2026-04-20 Paper Digest on JXIN's Home

ARES: Adaptive Red-Teaming and End-to-End Repair of Policy-Reward System

Mon, 27 Apr 2026 05:28:42 +0000

Authors: Jiacheng Liang, Yao Ma, Tharindu Kumarage, Satyapriya Krishna, Rahul Gupta, Kai-Wei Chang, Aram Galstyan, Charith Peris

Primary category: cs.AI · all: cs.AI, cs.CR, cs.LG

Matched keywords: large language model, llm, rag, serving, fine-tun, rlhf

TL;DR

ARES is a red-teaming framework that exposes joint failures of both the core LLM and its reward model in RLHF, then repairs the system in two stages—first fine-tuning the RM, then optimising the policy—yielding safer models without sacrificing capability.

Copy-as-Decode: Grammar-Constrained Parallel Prefill for LLM Editing

Mon, 27 Apr 2026 05:28:00 +0000

arXiv: 2604.18170 · PDF

Authors: Ziyang Liu

Primary category: cs.CL · all: cs.AI, cs.CL

Matched keywords: llm, rag, serving, kv cache, speculative decoding, fine-tun

TL;DR

Copy-as-Decode reframes LLM text/code editing as grammar-constrained decoding over two primitives (<copy> and <gen>), letting copy spans be filled via a single parallel-prefill forward instead of N autoregressive steps, yielding large theoretical speedups without end-to-end training.

Key Ideas

Most edit outputs are verbatim copies of the input, so regenerating them autoregressively is wasteful.
A two-primitive grammar (<copy lines="i-j"/>, <gen>...</gen>) with a token-level FSM guarantees syntactic validity.
Copy spans reuse the speculative-decoding parallel-forward kernel, but with input tokens as the “draft” and grammar-enforced (not probabilistic) acceptance.
Paper gives an upper-bound analysis — no training required — separating kernel speedup, copy coverage ceiling, and pipeline losslessness.

Approach

At decode time the model emits grammar tokens; a deterministic resolver expands <copy> tags by issuing one parallel-prefill forward that updates the KV cache for the whole span, while <gen> falls back to standard autoregressive decoding. An FSM enforces legal token transitions. Line-level and finer token-level primitives are both analyzed.

River-LLM: Large Language Model Seamless Exit Based on KV Share

Mon, 27 Apr 2026 05:27:28 +0000

arXiv: 2604.18396 · PDF

Authors: Yingtao Shen, An Zou

Primary category: cs.CL · all: cs.CL

Matched keywords: large language model, llm, reasoning, inference, kv cache, latency

TL;DR

River-LLM is a training-free Early Exit framework for decoder-only LLMs that solves the KV Cache Absence problem via a lightweight KV-Shared Exit River, achieving 1.71–2.16× wall-clock speedup on reasoning and code tasks without quality loss.

Key Ideas

Identifies KV Cache Absence as the core bottleneck preventing Early Exit from delivering practical speedup in decoder-only LLMs.
Proposes a KV-Shared Exit River: skipped layers still produce usable KV entries, avoiding recomputation or masking.
Uses state transition similarity across decoder blocks to predict cumulative KV errors and drive per-token exit decisions.
Training-free — drops into existing models without fine-tuning.

Approach

River-LLM adds a lightweight side path (“Exit River”) that shares/propagates KV states so that layers skipped by Early Exit still contribute KV cache entries consistent with the backbone. Exit decisions are made token-by-token using a predictor based on inter-block state transition similarity, estimating cumulative KV error and stopping when safe. No retraining is required.

Unlocking the Edge deployment and ondevice acceleration of multi-LoRA enabled one-for-all foundational LLM

Mon, 27 Apr 2026 05:26:54 +0000

arXiv: 2604.18655 · PDF

Authors: Sravanth Kodavanti, Sowmya Vajrala, Srinivas Miriyala, Utsav Tiwari, Uttam Kumar, Utkarsh Kumar Mahawar, Achal Pratap Singh, Arya D, Narendra Mutyala, Vikram Nelvoy Rajendiran, Sharan Kumar Allur, Euntaik Lee, Dohyoung Kim, HyeonSu Lee, Gyusung Cho, JungBae Kim

Primary category: cs.DC · all: cs.AI, cs.CL, cs.DC

Matched keywords: large language model, llm, inference, quantization, speculative decoding, latency

TL;DR

A hardware-aware framework deploys a LLaMA-based multilingual foundation model on Samsung Galaxy S24/S25 phones, combining runtime multi-LoRA switching, multi-stream decoding, dynamic self-speculative decoding, and INT4 quantization to achieve 4-6x memory/latency improvements across 9 languages and 8 tasks.

HybridGen: Efficient LLM Generative Inference via CPU-GPU Hybrid Computing

Mon, 27 Apr 2026 05:26:19 +0000

arXiv: 2604.18529 · PDF

Authors: Mao Lin, Xi Wang, Guilherme Cox, Dong Li, Hyeran Jeon

Primary category: cs.PF · all: cs.DC, cs.PF

Matched keywords: llm, rag, inference, kv cache, parallelism, attention, gpu, scheduler

TL;DR

HybridGen is a CPU-GPU hybrid attention framework for long-context LLM inference that leverages CXL-expanded tiered memory. By coordinating attention computation across CPU and GPU, it outperforms six SOTA KV cache management methods by 1.41x-3.2x while preserving accuracy.

Key Ideas

Existing KV cache pruning/offloading underutilizes hardware by computing attention on only one device.
Tiered memory (e.g., CXL) expands CPU-local KV capacity but introduces NUMA penalties.
Collaborative CPU-GPU attention needs new parallelism, scheduling, and data placement strategies.
Three challenges: multi-dim attention dependencies, load imbalance with long sequences, NUMA penalty.

Approach

HybridGen introduces three mechanisms:

Training and Agentic Inference Strategies for LLM-based Manim Animation Generation

Mon, 27 Apr 2026 05:25:48 +0000

arXiv: 2604.18364 · PDF

Authors: Ravidu Suien Rammuni Silva, Ahmad Lotfi, Isibor Kennedy Ihianle, Golnaz Shahtahmassebi, Jordan J. Bird

Primary category: cs.AI · all: cs.AI, cs.GR, cs.MA

Matched keywords: large language model, llm, agent, agentic, reasoning, inference, fine-tun

TL;DR

The paper introduces ManimTrainer (SFT + GRPO with fused code/visual rewards) and ManimAgent (Renderer-in-the-loop inference with API-doc augmentation) for text-to-code-to-video Manim animation. A Qwen 3 Coder 30B variant hits 94% render success and 85.7% visual similarity, beating GPT-4.1.

AQPIM: Breaking the PIM Capacity Wall for LLMs with In-Memory Activation Quantization

Mon, 27 Apr 2026 05:24:50 +0000

arXiv: 2604.18137 · PDF

Authors: Kosuke Matsushima, Yasuyuki Okoshi, Masato Motomura, Daichi Fujiki

Primary category: cs.AR · all: cs.AI, cs.AR, cs.LG

Matched keywords: large language model, llm, rag, kv cache, quantization, attention, transformer, gpu, latency

TL;DR

AQPIM is a PIM-aware activation quantization framework that applies Product Quantization (PQ) directly inside memory to shrink KV-cache footprint and accelerate LLM attention, achieving 3.4× speedup over SOTA PIM baselines while slashing GPU-CPU communication overhead.

Key Ideas

Activation (KV cache) memory, not just weights, is the real PIM capacity wall for long-context LLMs.
Clustering-based vector quantization (specifically PQ) aligns with activation statistics and PIM’s internal bandwidth.
Quantization performed inside memory enables direct compute on compressed data.
Algorithmic tweaks restore PQ accuracy for modern LLMs.

Approach

AQPIM builds a PIM-specialized activation quantization pipeline around Product Quantization. Activations are split into sub-vectors, clustered, and stored as codebook indices directly in PIM banks. Attention computation then operates on the compressed representation, exploiting PIM’s high internal bandwidth. Several (unspecified) algorithmic optimizations mitigate PQ’s accuracy loss on LLM activations.

StepPO: Step-Aligned Policy Optimization for Agentic Reinforcement Learning

Mon, 27 Apr 2026 05:24:17 +0000

arXiv: 2604.18401 · PDF

Authors: Daoyu Wang, Qingchuan Li, Mingyue Cheng, Jie Ouyang, Shuo Yu, Qi Liu, Enhong Chen

Primary category: cs.CL · all: cs.CL

Matched keywords: large language model, llm, agent, agentic, tool use, reasoning, post-train, rlhf

TL;DR

StepPO argues that Agentic RL for LLMs should move from token-level to step-level MDPs, treating each agent step (not token) as the action unit and doing credit assignment at that granularity. The paper is a position piece with preliminary experiments.

MASS-RAG: Multi-Agent Synthesis Retrieval-Augmented Generation

Mon, 27 Apr 2026 05:23:47 +0000

arXiv: 2604.18509 · PDF

Authors: Xingchen Xiao, Heyan Huang, Runheng Liu, Jincheng Xie

Primary category: cs.CL · all: cs.CL

Matched keywords: large language model, llm, agent, multi-agent, retrieval, rag, reasoning, inference

TL;DR

MASS-RAG 提出一种多智能体协作的检索增强生成框架，将证据处理拆分为摘要、抽取、推理三类角色化 agent，再由合成阶段整合输出，提升噪声/异构上下文下的回答质量。

Key Ideas

单一生成过程难以调和噪声、残缺、异构的检索证据。
将 RAG 解耦为角色化多 agent：summarization、extraction、reasoning。
专设 synthesis 阶段融合多视角中间证据再生成最终答案。
多中间证据视图利于互补信息对比与整合。

Approach

架构：检索 → 并行运行三类专职 agent（证据摘要 / 证据抽取 / 推理）→ 合成 agent 聚合中间输出 → 生成答案。
每个 agent 针对同一批检索文档产出不同粒度的中间表示，暴露多条证据路径。
合成阶段作为仲裁器对互补/冲突证据进行比较与整合。
摘要未说明具体 prompt 模板、agent 间通信协议或后端模型。

Experiments

四个 RAG benchmark（具体名未披露）。
对比强 RAG baseline（未具名）。
评估重点：证据分散在多段检索上下文时的表现。
摘要未给出数据集规模、检索器设置、评测指标等细节。

Results

声称在四个 benchmark 上"consistently"优于强 baseline。
在证据跨上下文分散的场景优势更明显。
摘要未提供具体数值增益，无法独立核实提升幅度。

Why It Matters

为嘈杂或长尾检索结果提供可组合的 agentic RAG 模式。
为实务派在 RAG pipeline 里显式引入角色分工、证据融合层提供模板。
对构建高可靠知识问答、企业 RAG 系统的工程师有借鉴价值。

Connections to Prior Work

Self-RAG、Chain-of-Note：显式证据处理/批注思路。
Multi-agent LLM 协作（AutoGen、MetaGPT、Debate）：角色化 agent 协同。
CRAG、RA-DIT 等鲁棒 RAG 方法：处理噪声/低质量检索。
Map-Reduce / hierarchical summarization for long context。

Open Questions

多 agent 带来的推理成本与延迟如何？是否值得单次调用的 N 倍 token？
各 agent 是否共享同一底座 LLM，是否需专门微调？
合成阶段如何处理 agent 间冲突证据？是否有显式投票或置信度？
在对抗性或高度冗余检索下鲁棒性如何？
与更强的单模型长上下文推理（如 Gemini / Claude 长窗）相比是否仍有优势？

Figures

Figure 1: Figure 1 (extracted from PDF)

First, Do No Harm (With LLMs): Mitigating Racial Bias via Agentic Workflows

Mon, 27 Apr 2026 05:23:13 +0000

arXiv: 2604.18038 · PDF

Authors: Sihao Xing, Zaur Gouliev

Primary category: cs.CY · all: cs.AI, cs.CY

Matched keywords: large language model, llm, agent, agentic, retrieval, reasoning, attention, ai system

TL;DR

This study evaluates racial bias in five LLMs across synthetic patient-case generation and differential diagnosis tasks, finding all deviate from US epidemiological distributions. Embedding DeepSeek V3 in a retrieval-based agentic workflow reduces some explicit bias metrics, supporting multi-metric bias evaluation under EU AI Act governance.