2026-04-20 Paper Digest
Automated digest of 10 arXiv papers on agent / LLM / AI infra submitted in the last 24h, analysed with Claude Code.
1. First, Do No Harm (With LLMs): Mitigating Racial Bias via Agentic Workflows
arXiv: 2604.18038 · cs.CY · relevance score 27
This study evaluates racial bias in five LLMs across synthetic patient-case generation and differential diagnosis tasks, finding all deviate from US epidemiological distributions. Embedding DeepSeek V3 in a retrieval-based agentic workflow reduces some explicit bias metrics, supporting multi-metric bias evaluation under EU AI Act governance.
2. MASS-RAG: Multi-Agent Synthesis Retrieval-Augmented Generation
arXiv: 2604.18509 · cs.CL · relevance score 26
MASS-RAG 提出一种多智能体协作的检索增强生成框架,将证据处理拆分为摘要、抽取、推理三类角色化 agent,再由合成阶段整合输出,提升噪声/异构上下文下的回答质量。
3. StepPO: Step-Aligned Policy Optimization for Agentic Reinforcement Learning
arXiv: 2604.18401 · cs.CL · relevance score 26
StepPO argues that Agentic RL for LLMs should move from token-level to step-level MDPs, treating each agent step (not token) as the action unit and doing credit assignment at that granularity. The paper is a position piece with preliminary experiments.
4. AQPIM: Breaking the PIM Capacity Wall for LLMs with In-Memory Activation Quantization
arXiv: 2604.18137 · cs.AR · relevance score 25
AQPIM is a PIM-aware activation quantization framework that applies Product Quantization (PQ) directly inside memory to shrink KV-cache footprint and accelerate LLM attention, achieving 3.4× speedup over SOTA PIM baselines while slashing GPU-CPU communication overhead.
5. Training and Agentic Inference Strategies for LLM-based Manim Animation Generation
arXiv: 2604.18364 · cs.AI · relevance score 24
The paper introduces ManimTrainer (SFT + GRPO with fused code/visual rewards) and ManimAgent (Renderer-in-the-loop inference with API-doc augmentation) for text-to-code-to-video Manim animation. A Qwen 3 Coder 30B variant hits 94% render success and 85.7% visual similarity, beating GPT-4.1.
6. HybridGen: Efficient LLM Generative Inference via CPU-GPU Hybrid Computing
arXiv: 2604.18529 · cs.PF · relevance score 22
HybridGen is a CPU-GPU hybrid attention framework for long-context LLM inference that leverages CXL-expanded tiered memory. By coordinating attention computation across CPU and GPU, it outperforms six SOTA KV cache management methods by 1.41x-3.2x while preserving accuracy.
7. Unlocking the Edge deployment and ondevice acceleration of multi-LoRA enabled one-for-all foundational LLM
arXiv: 2604.18655 · cs.DC · relevance score 20
A hardware-aware framework deploys a LLaMA-based multilingual foundation model on Samsung Galaxy S24/S25 phones, combining runtime multi-LoRA switching, multi-stream decoding, dynamic self-speculative decoding, and INT4 quantization to achieve 4-6x memory/latency improvements across 9 languages and 8 tasks.
8. River-LLM: Large Language Model Seamless Exit Based on KV Share
arXiv: 2604.18396 · cs.CL · relevance score 20
River-LLM is a training-free Early Exit framework for decoder-only LLMs that solves the KV Cache Absence problem via a lightweight KV-Shared Exit River, achieving 1.71–2.16× wall-clock speedup on reasoning and code tasks without quality loss.
9. Copy-as-Decode: Grammar-Constrained Parallel Prefill for LLM Editing
arXiv: 2604.18170 · cs.CL · relevance score 20
Copy-as-Decode reframes LLM text/code editing as grammar-constrained decoding over two primitives (<copy> and <gen>), letting copy spans be filled via a single parallel-prefill forward instead of N autoregressive steps, yielding large theoretical speedups without end-to-end training.
10. ARES: Adaptive Red-Teaming and End-to-End Repair of Policy-Reward System
arXiv: 2604.18789 · cs.AI · relevance score 19
ARES is a red-teaming framework that exposes joint failures of both the core LLM and its reward model in RLHF, then repairs the system in two stages—first fine-tuning the RM, then optimising the policy—yielding safer models without sacrificing capability.
- April 27, 2026 ARES: Adaptive Red-Teaming and End-to-End Repair of Policy-Reward System
- April 27, 2026 Copy-as-Decode: Grammar-Constrained Parallel Prefill for LLM Editing
- April 27, 2026 River-LLM: Large Language Model Seamless Exit Based on KV Share
- April 27, 2026 Unlocking the Edge deployment and ondevice acceleration of multi-LoRA enabled one-for-all foundational LLM
- April 27, 2026 HybridGen: Efficient LLM Generative Inference via CPU-GPU Hybrid Computing
- April 27, 2026 Training and Agentic Inference Strategies for LLM-based Manim Animation Generation
- April 27, 2026 AQPIM: Breaking the PIM Capacity Wall for LLMs with In-Memory Activation Quantization
- April 27, 2026 StepPO: Step-Aligned Policy Optimization for Agentic Reinforcement Learning
- April 27, 2026 MASS-RAG: Multi-Agent Synthesis Retrieval-Augmented Generation
- April 27, 2026 First, Do No Harm (With LLMs): Mitigating Racial Bias via Agentic Workflows