2026-04-20 Paper Digest

Automated digest of 10 arXiv papers on agent / LLM / AI infra submitted in the last 24h, analysed with Claude Code.

1. First, Do No Harm (With LLMs): Mitigating Racial Bias via Agentic Workflows

arXiv: 2604.18038 · cs.CY · relevance score 27

This study evaluates racial bias in five LLMs across synthetic patient-case generation and differential diagnosis tasks, finding all deviate from US epidemiological distributions. Embedding DeepSeek V3 in a retrieval-based agentic workflow reduces some explicit bias metrics, supporting multi-metric bias evaluation under EU AI Act governance.

Read detailed analysis →


2. MASS-RAG: Multi-Agent Synthesis Retrieval-Augmented Generation

arXiv: 2604.18509 · cs.CL · relevance score 26

MASS-RAG 提出一种多智能体协作的检索增强生成框架,将证据处理拆分为摘要、抽取、推理三类角色化 agent,再由合成阶段整合输出,提升噪声/异构上下文下的回答质量。

Read detailed analysis →


3. StepPO: Step-Aligned Policy Optimization for Agentic Reinforcement Learning

arXiv: 2604.18401 · cs.CL · relevance score 26

StepPO argues that Agentic RL for LLMs should move from token-level to step-level MDPs, treating each agent step (not token) as the action unit and doing credit assignment at that granularity. The paper is a position piece with preliminary experiments.

Read detailed analysis →


4. AQPIM: Breaking the PIM Capacity Wall for LLMs with In-Memory Activation Quantization

arXiv: 2604.18137 · cs.AR · relevance score 25

AQPIM is a PIM-aware activation quantization framework that applies Product Quantization (PQ) directly inside memory to shrink KV-cache footprint and accelerate LLM attention, achieving 3.4× speedup over SOTA PIM baselines while slashing GPU-CPU communication overhead.

Read detailed analysis →


5. Training and Agentic Inference Strategies for LLM-based Manim Animation Generation

arXiv: 2604.18364 · cs.AI · relevance score 24

The paper introduces ManimTrainer (SFT + GRPO with fused code/visual rewards) and ManimAgent (Renderer-in-the-loop inference with API-doc augmentation) for text-to-code-to-video Manim animation. A Qwen 3 Coder 30B variant hits 94% render success and 85.7% visual similarity, beating GPT-4.1.

Read detailed analysis →


6. HybridGen: Efficient LLM Generative Inference via CPU-GPU Hybrid Computing

arXiv: 2604.18529 · cs.PF · relevance score 22

HybridGen is a CPU-GPU hybrid attention framework for long-context LLM inference that leverages CXL-expanded tiered memory. By coordinating attention computation across CPU and GPU, it outperforms six SOTA KV cache management methods by 1.41x-3.2x while preserving accuracy.

Read detailed analysis →


7. Unlocking the Edge deployment and ondevice acceleration of multi-LoRA enabled one-for-all foundational LLM

arXiv: 2604.18655 · cs.DC · relevance score 20

A hardware-aware framework deploys a LLaMA-based multilingual foundation model on Samsung Galaxy S24/S25 phones, combining runtime multi-LoRA switching, multi-stream decoding, dynamic self-speculative decoding, and INT4 quantization to achieve 4-6x memory/latency improvements across 9 languages and 8 tasks.

Read detailed analysis →


8. River-LLM: Large Language Model Seamless Exit Based on KV Share

arXiv: 2604.18396 · cs.CL · relevance score 20

River-LLM is a training-free Early Exit framework for decoder-only LLMs that solves the KV Cache Absence problem via a lightweight KV-Shared Exit River, achieving 1.71–2.16× wall-clock speedup on reasoning and code tasks without quality loss.

Read detailed analysis →


9. Copy-as-Decode: Grammar-Constrained Parallel Prefill for LLM Editing

arXiv: 2604.18170 · cs.CL · relevance score 20

Copy-as-Decode reframes LLM text/code editing as grammar-constrained decoding over two primitives (<copy> and <gen>), letting copy spans be filled via a single parallel-prefill forward instead of N autoregressive steps, yielding large theoretical speedups without end-to-end training.

Read detailed analysis →


10. ARES: Adaptive Red-Teaming and End-to-End Repair of Policy-Reward System

arXiv: 2604.18789 · cs.AI · relevance score 19

ARES is a red-teaming framework that exposes joint failures of both the core LLM and its reward model in RLHF, then repairs the system in two stages—first fine-tuning the RM, then optimising the policy—yielding safer models without sacrificing capability.

Read detailed analysis →