2026-04-20 论文速递

对 agent / LLM / AI 基础设施方向共 10 篇 arXiv 论文的自动摘要，由 Claude Code 生成分析。

1. First, Do No Harm (With LLMs): Mitigating Racial Bias via Agentic Workflows

arXiv: 2604.18038 · cs.CY · 相关度分数 27

以 EU AI Act 为治理视角，评估 5 个主流 LLM 在合成病例生成与鉴别诊断中的种族偏见，发现 retrieval-based agentic workflow 可缓解 DeepSeek V3 的显性偏见。

阅读完整分析 →

2. MASS-RAG: Multi-Agent Synthesis Retrieval-Augmented Generation

arXiv: 2604.18509 · cs.CL · 相关度分数 26

MASS-RAG 用多 agent 分工（摘要、抽取、推理）处理检索证据，再经合成阶段产出答案，在证据分散场景下稳定优于单次生成的 RAG baseline。

阅读完整分析 →

3. StepPO: Step-Aligned Policy Optimization for Agentic Reinforcement Learning

arXiv: 2604.18401 · cs.CL · 相关度分数 26

StepPO 主张把 Agentic RL 从 token 级 MDP 升级为 step 级 MDP，以 step 作为 LLM agent 的动作粒度，并提出相应的 step-level credit assignment 来对齐策略优化与 agent 决策。

阅读完整分析 →

4. AQPIM: Breaking the PIM Capacity Wall for LLMs with In-Memory Activation Quantization

arXiv: 2604.18137 · cs.AR · 相关度分数 25

AQPIM 在 PIM 内部用 Product Quantization 压缩 LLM 激活/KV cache，突破 PIM 容量墙，相比 SOTA PIM 方案加速 3.4 倍。

阅读完整分析 →

5. Training and Agentic Inference Strategies for LLM-based Manim Animation Generation

arXiv: 2604.18364 · cs.AI · 相关度分数 24

提出 ManimTrainer（SFT+GRPO）与 ManimAgent（RITL/RITL-DOC）两套训练-推理管线，首次系统研究 LLM 生成 Manim 动画的 text-to-code-to-video 任务。

阅读完整分析 →

6. HybridGen: Efficient LLM Generative Inference via CPU-GPU Hybrid Computing

arXiv: 2604.18529 · cs.PF · 相关度分数 22

HybridGen 提出 CPU-GPU 协同 attention 框架，配合 CXL 扩展内存，针对长上下文 LLM 推理在六种 KV cache 管理基线上平均加速 1.41×–3.2×。

阅读完整分析 →

7. Unlocking the Edge deployment and ondevice acceleration of multi-LoRA enabled one-for-all foundational LLM

arXiv: 2604.18655 · cs.DC · 相关度分数 20

面向三星 Galaxy S24/S25 的端侧 LLM 部署框架：用多 LoRA 共享单一冻结推理图、多流解码与 DS2D 自推测解码，实现 4–6× 内存与时延改进。

阅读完整分析 →

arXiv: 2604.18396 · cs.CL · 相关度分数 20

River-LLM 提出训练无关的 token 级 Early Exit 框架，用 KV-Shared Exit River 解决 decoder-only 架构中 KV Cache 缺失问题，实现 1.71–2.16× 推理加速。

阅读完整分析 →

9. Copy-as-Decode: Grammar-Constrained Parallel Prefill for LLM Editing

arXiv: 2604.18170 · cs.CL · 相关度分数 20

Copy-as-Decode 把 LLM 编辑任务重写为 <copy>/<gen> 两原语的语法约束解码，让拷贝段走并行 prefill 而非逐 token 自回归，在 Qwen2.5 上给出最高 303× 的内核加速与 13× 端到端上界。

阅读完整分析 →

10. ARES: Adaptive Red-Teaming and End-to-End Repair of Policy-Reward System

arXiv: 2604.18789 · cs.AI · 相关度分数 19

ARES 提出一种自适应红队框架，同时攻击 policy 与 reward model，再通过两阶段微调修复二者联动的"系统性弱点"。

阅读完整分析 →

四月 27, 2026 ARES: Adaptive Red-Teaming and End-to-End Repair of Policy-Reward System
四月 27, 2026 Copy-as-Decode: Grammar-Constrained Parallel Prefill for LLM Editing
四月 27, 2026 River-LLM: Large Language Model Seamless Exit Based on KV Share
四月 27, 2026 Unlocking the Edge deployment and ondevice acceleration of multi-LoRA enabled one-for-all foundational LLM
四月 27, 2026 HybridGen: Efficient LLM Generative Inference via CPU-GPU Hybrid Computing
四月 27, 2026 Training and Agentic Inference Strategies for LLM-based Manim Animation Generation
四月 27, 2026 AQPIM: Breaking the PIM Capacity Wall for LLMs with In-Memory Activation Quantization
四月 27, 2026 StepPO: Step-Aligned Policy Optimization for Agentic Reinforcement Learning
四月 27, 2026 MASS-RAG: Multi-Agent Synthesis Retrieval-Augmented Generation
四月 27, 2026 First, Do No Harm (With LLMs): Mitigating Racial Bias via Agentic Workflows

1. First, Do No Harm (With LLMs): Mitigating Racial Bias via Agentic Workflows

2. MASS-RAG: Multi-Agent Synthesis Retrieval-Augmented Generation

3. StepPO: Step-Aligned Policy Optimization for Agentic Reinforcement Learning

4. AQPIM: Breaking the PIM Capacity Wall for LLMs with In-Memory Activation Quantization

5. Training and Agentic Inference Strategies for LLM-based Manim Animation Generation

6. HybridGen: Efficient LLM Generative Inference via CPU-GPU Hybrid Computing

7. Unlocking the Edge deployment and ondevice acceleration of multi-LoRA enabled one-for-all foundational LLM

8. River-LLM: Large Language Model Seamless Exit Based on KV Share

9. Copy-as-Decode: Grammar-Constrained Parallel Prefill for LLM Editing

10. ARES: Adaptive Red-Teaming and End-to-End Repair of Policy-Reward System