2026-04-28 论文速递 on JXIN's Home

AgenticCache: Cache-Driven Asynchronous Planning for Embodied AI Agents

Tue, 28 Apr 2026 14:35:36 +0000

作者: Hojoon Kim, Yuheng Wu, Thierry Tambe

单位: Stanford University, Harvard University

主分类: cs.LG · 全部: cs.AI, cs.CL, cs.LG

命中关键词: large language model, llm, agent, agentic, multi-agent, rag, latency

TL;DR

AgenticCache 利用 embodied 任务的「plan locality」，让 agent 通过 2-gram plan 缓存 + 后台异步 LLM 更新器避免逐步调用 LLM，在四个多 agent benchmark 上平均成功率 +22%、延迟 -65%、token -50%。

Motivation

LLM 驱动的 embodied agent 当前采用同步 plan-act 循环：每一步动作前都要等 LLM 返回一个新 plan，结果是 Figure 2 显示跨 benchmark 超过 70% 的运行时间消耗在 LLM planning query 上。对于需要数千步长 horizon 的多 agent 仿真（TDW-MAT、BEHAVIOR-1K 等），这意味着 GPT-5 baseline 在 TDW-MAT 上要跑 41.34 小时、花 40.5 美元（Table 2）——对任何想做大规模 evaluation 或实际部署的团队都难以承受。

BitRL: Reinforcement Learning with 1-bit Quantized Language Models for Resource-Constrained Edge Deployment

Tue, 28 Apr 2026 14:27:15 +0000

arXiv: 2604.24273 · PDF

作者: Md. Ashiq Ul Islam Sajid, Mohammad Sakib Mahmood, Md. Tareq Hasan, Md Abdur Rahim, Rafat Ara, Md. Arafat Hossain

单位: N/A

主分类: cs.LG · 全部: cs.LG

命中关键词: large language model, llm, agent, rag, inference, quantization, latency

自动分析不可用（claude CLI timeout）。展示原始摘要。

摘要

The deployment of intelligent reinforcement learning (RL) agents on resource-constrained edge devices remains a fundamental challenge due to the substantial memory, computational, and energy requirements of modern deep learning systems. While large language models (LLMs) have emerged as powerful architectures for decision-making agents, their multi-billion parameter scale confines them to cloud-based deployment, raising concerns about latency, privacy, and connectivity dependence. We introduce BitRL, a framework for building RL agents using 1-bit quantized language models that enables practical on-device learning and inference under severe resource constraints. Leveraging the BitNet b1.58 architecture with ternary weights (-1, 0, +1) and an optimized inference stack, BitRL achieves 10-16x memory reduction and 3-5x energy efficiency improvements over full-precision baselines while maintaining 85-98 percent of task performance across benchmarks. We provide theoretical analysis of quantization as structured parameter perturbation, derive convergence bounds for quantized policy gradients under frozen-backbone architectures, and identify the exploration-stability trade-off in extreme quantization. Our framework systematically integrates 1-bit quantized language models with reinforcement learning for edge deployment and demonstrates effectiveness on commodity hardware.

DepthKV: Layer-Dependent KV Cache Pruning for Long-Context LLM Inference

Tue, 28 Apr 2026 14:18:44 +0000

arXiv: 2604.24647 · PDF

作者: Zahra Dehghanighobadi, Asja Fischer

单位: Ruhr University Bochum, UAR Research Center for Trustworthy Data Science and Security

主分类: cs.CL · 全部: cs.AI, cs.CL

命中关键词: large language model, llm, reasoning, inference, kv cache, attention

TL;DR

DepthKV 指出 Transformer 各层对 KV cache 剪枝敏感度差异显著，按 InfoNCE 等表征指标在固定全局预算下做层级非均匀分配，在摘要/QA/数学推理任务上一致优于 uniform 剪枝。

Motivation

长上下文 LLM 推理的瓶颈已从算力转向显存：KV cache 随序列长度线性增长，prefill 阶段对超长文档（本文测 3K–10K token）的 serving 吃满 GPU HBM。现有 post-training KV pruning（H2O、StreamingLLM、SnapKV、FastGen）几乎都默认所有 Transformer 层同等重要，按同一比例剪各层——这在工程上简单但作者认为是错的。Skean et al. (2025) 已指出中间层表征更关键，但没人验证这是否延伸到 KV 剪枝。今天需要长文档摘要 / 多跳 QA / 长链路数学推理的团队，在固定显存预算下只能凑合用 uniform 剪枝，或者退回 full KV 把 batch 压小。作者主张：只要能识别"哪些层剪了会塌"，就能在同一全局预算内重新分配预算，拿到免费的质量提升，无需改架构、无需重训。

Beyond the Attention Stability Boundary: Agentic Self-Synthesizing Reasoning Protocols

Tue, 28 Apr 2026 14:12:35 +0000

arXiv: 2604.24512 · PDF

作者: Dahlia Shehata, Ming Li

单位: University of Waterloo

主分类: cs.AI · 全部: cs.AI

命中关键词: llm, agent, agentic, retrieval, reasoning, attention, transformer

自动分析不可用（claude CLI timeout）。展示原始摘要。

摘要

As LLM agents transition to autonomous digital coworkers, maintaining deterministic goal-directedness in non-linear multi-turn conversations emerged as an architectural bottleneck. We identify and formalize a systemic failure mode termed the Attention Latch in decoder-only autoregressive Transformers. This phenomenon, a behavioral manifestation of Information Over-squashing, occurs when the cumulative probabilistic weight of historical context overrides mid-task updates, causing agents to remain anchored to obsolete constraints despite explicit contradictory instructions. We propose Self-Synthesizing Reasoning Protocols (SSRP), a metacognitive framework that implements a discrete separation between high-level architectural planning (Architect) and turn-by-turn procedural execution (Executive). We evaluate SSRP across 9K trajectories using the MultiWOZ 2.2 dataset and the Aggregate Pivot Accuracy (APA), a novel metric we validate by mapping its scores to the U-shaped ‘Lost in the Middle’ curve. We present 3 experimental tiers: a shallow recency-based retrieval pilot, a high-entropy SOP, and a semantic hijacked 3-hop Multi-Fact Synthesis task. Our results empirically locate the Attention Stability Boundary, where stateless Vanilla ReAct baselines for GPT 5.4 collapse to 0.1% success while SSRP achieves a 715X Resilience Lift. We demonstrate statistically significant gains across Gemini 3.1 Pro, Claude Sonnet 4.6 and DeepSeek V3.2. Audits confirm SSRP necessity by proving attentional lapse via a recursive reflexion baseline (100% success); decoupling the latch from positional bias through equidistant stress testing (90% accuracy); and formalizing SSRP via the Information Bottleneck principle and granularity ablations. Procedural Integrity audit (98.8% adherence) reveals a Grounding Paradox where high-stability models fail by refusing to hallucinate under retrieval-reasoning contamination.

Grounding Before Generalizing: How AI Differs from Humans in Causal Transfer

Tue, 28 Apr 2026 14:04:02 +0000

arXiv: 2604.24062 · PDF

作者: Liangru Xiang, Yuxi Ma, Zhihao Cao, Yixin Zhu, Song-Chun Zhu

单位: Tsinghua University, Peking University, State Key Laboratory of General Artificial Intelligence, Beijing Key Laboratory of Behavior and Mental Health

主分类: cs.AI · 全部: cs.AI

命中关键词: large language model, llm, agent, rag, reasoning

TL;DR

用 OpenLock 范式对比人类与 GPT-5.2/Claude-4.5/Gemini-3-Flash/DeepSeek-V3.2，发现模型在单环境内可匹敌或超越人类，但跨环境的因果结构迁移必须先"环境接地"才生效，呈现延迟迁移。

Motivation

人类能从一次交互中抽取抽象因果结构（Common Cause / Common Effect）并立即迁移到新环境；经典 RL agent 则在 OpenLock 上灾难性失败（Edmonds et al. 2018 [4]）。LLM/VLM 虽在静态推理 benchmark 上强势，但其"在交互中主动归纳潜在因果图、再跨上下文迁移"的能力从未被系统刻画过。作者关心的痛点是：现在业界讨论 agent reasoning / VLM 推理时，默认大模型已经具备类人的结构抽象，但没人验证过这在需要主动试错、顺序发现的场景下是否成立。对于要把 LLM/VLM 当作 agent runtime（规划、工具调用、新场景泛化）的团队，如果模型实际只会"先落地再泛化"而非"先泛化再行动"，产品层的稳定性假设就会被打破。作者认为现在值得做是因为近一代多模态模型（GPT-5.2、Claude-4.5、Gemini-3、DeepSeek-V3.2）已足够强到在 text-only 条件下打过人类基线，因此能清晰区分"解不出来"与"解得出但抽象方式不同"。

PhysNote: Self-Knowledge Notes for Evolvable Physical Reasoning in Vision-Language Model

Tue, 28 Apr 2026 13:56:28 +0000

arXiv: 2604.24443 · PDF

作者: Sinin Zhang, Yunfei Xie, Yuxuan Cheng, Haoyu Zhang, Tong Zhang

单位: The Chinese University of Hong Kong, Shenzhen, Rice University, City University of Hong Kong, Fudan University

主分类: cs.AI · 全部: cs.AI

命中关键词: agent, agentic, multi-agent, reasoning, inference

TL;DR

PhysNote 让 VLM 通过自生成的 “Knowledge Notes” 外化并演化物理推理知识，结合时空规范化与 InfoAgent 迭代验证，在 PhysBench 测试集上达到 56.68% 准确率。

Motivation

VLM 在教科书式物理题上表现不错，但一旦面对多帧动态场景就常常失手：PhysBench 上 75 个 VLM 的大规模评测显示，多数模型在物理推理任务上只有约 40% 准确率，远低于人类，且这个缺口不会随模型尺寸、训练数据或输入帧数增加而缩小。作者把失败归到两条根因：(1) 时空身份漂移——物体在连续帧之间会"换身份"，因果链被幻觉式转场切断，典型如碰撞后的轨迹无法被连贯表征；(2) 推理洞见的易失性——模型偶尔能蹦出正确的物理推理，但推理一结束就随上下文窗口蒸发，下次遇到同类问题还得从零开始，像"金鱼记忆的物理学家"。

今天受这个问题困扰的是 embodied agent / 机器人操作 / 自动导航这些需要精确物理推断的场景，当前的绕路做法是 PhysAgent 这类"reason-act-observe"框架外挂 SAM/Depth Anything 等工具，或 PCBs 通过微调小 VLM 产出辅助描述——要么推理链条用完即弃，要么靠昂贵的参数微调，都缺少自主演化知识的能力。作者认为可以把人类物理学家"记笔记、攒启发式、反思改错"的工作流外化成一个持续更新的结构化知识库，既不碰底模权重，也不依赖专用视觉工具。

Stabilizing Efficient Reasoning with Step-Level Advantage Selection

Tue, 28 Apr 2026 13:47:10 +0000

arXiv: 2604.24003 · PDF

作者: Han Wang, Xiaodong Yu, Jialian Wu, Jiang Liu, Ximeng Sun, Mohit Bansal, Zicheng Liu

单位: UNC Chapel Hill, Advanced Micro Devices, Inc

主分类: cs.CL · 全部: cs.CL, cs.LG

命中关键词: large language model, llm, rag, reasoning, inference, post-train

TL;DR

在 4K 短上下文 GRPO 后训练中，用基于 token log-prob 的 step 级 confidence 对 rollout 内部做 advantage 零值遮罩，稳住训练并压缩推理长度。

Motivation

现有 efficient reasoning 方法（L1、LAPO、ThinkPrune 等）都把 length-aware reward 和短上下文后训练捆在一起——base 模型在 16K–24K 上下文训练，后训练却硬压到 4K，但没人单独量化"短上下文本身到底贡献了多少压缩"。作者做了一个 ablation：只跑纯 GRPO、不加任何 length reward，在 4K 上下文后训练 DeepScaleR-1.5B，结果 output length 被压到和 LAPO/ThinkPrune 同档甚至更短（Fig 2a），说明上下文窗口本身就是强压缩信号，过去被错误归因给了 length reward 设计。但代价是训练不稳：accuracy 波动、后期退化（Fig 2b），policy entropy 快速塌陷。作者量化了原因：把 base 模型 8K rollout 硬切到 4K 后用同一 verifier 重跑，约 29% 本来正确的 rollout 变成 verifier-failed——多数只是丢了最后的 boxed 答案或收尾推导。标准 GRPO 会把负 advantage 平摊给这些 rollout 里本来正确的中间步骤，造成 credit 误判。这块痛点直接影响所有在短上下文下做 RL-based efficient reasoning 的团队。

The Chameleon's Limit: Investigating Persona Collapse and Homogenization in Large Language Models

Tue, 28 Apr 2026 13:39:37 +0000

arXiv: 2604.24698 · PDF

作者: Yunze Xiao, Vivienne J. Zhang, Chenghao Yang, Ningshan Ma, Weihao Xuan, Jen-tse Huang

单位: CMU, UChicago, MIT, 2077.ai, UTokyo, RIKEN AIP, JHU

主分类: cs.CL · 全部: cs.CL

命中关键词: large language model, llm, agent, multi-agent, rag, reasoning

自动分析不可用（claude CLI timeout）。展示原始摘要。

摘要

Applications based on large language models (LLMs), such as multi-agent simulations, require population diversity among agents. We identify a pervasive failure mode we term \emph{Persona Collapse}: agents each assigned a distinct profile nonetheless converge into a narrow behavioral mode, producing a homogeneous simulated population. To quantify persona collapse, we propose a framework that measures how much of the persona space a population occupies (Coverage), how evenly agents spread across it (Uniformity), and how rich the resulting behavioral patterns are (Complexity). Evaluating ten LLMs on personality simulation (BFI-44), moral reasoning, and self-introduction, we observe persona collapse along two axes: (1) Dimensions: a model can appear diverse on one axis yet structurally degenerate on another, and (2) Domains: the same model may collapse the most in personality yet be the most diverse in moral reasoning. Furthermore, item-level diagnostics reveal that behavioral variation tracks coarse demographic stereotypes rather than the fine-grained individual differences specified in each persona. Counter-intuitively, \textbf{the models achieving the highest per-persona fidelity consistently produce the most stereotyped populations}. We release our toolkit and data to support population-level evaluation of LLMs.

Long-Context Aware Upcycling: A New Frontier for Hybrid LLM Scaling

Tue, 28 Apr 2026 13:31:07 +0000

arXiv: 2604.24715 · PDF

作者: Parsa Ashrafi Fashi, Utkarsh Saxena, Mehdi Rezagholizadeh, Aref Jafari, Akash Haridas, Mingyu Yang, Vansh Bhatia, Guihong Li, Vikram Appia, Emad Barsoum

单位: AMD

主分类: cs.CL · 全部: cs.CL, cs.LG

命中关键词: llm, reasoning, inference, serving, kv-cache, attention, transformer, post-train

自动分析不可用（claude CLI timeout）。展示原始摘要。

摘要

Hybrid sequence models that combine efficient Transformer components with linear sequence modeling blocks are a promising alternative to pure Transformers, but most are still pretrained from scratch and therefore fail to reuse existing Transformer checkpoints. We study upcycling as a practical path to convert pretrained Transformer LLMs into hybrid architectures while preserving short-context quality and improving long-context capability. We call our solution \emph{HyLo} (HYbrid LOng-context): a long-context upcycling recipe that combines architectural adaptation with efficient Transformer blocks, Multi-Head Latent Attention (MLA), and linear blocks (Mamba2 or Gated DeltaNet), together with staged long-context training and teacher-guided distillation for stable optimization. HyLo extends usable context length by up to $32\times$ through efficient post-training and reduces KV-cache memory by more than $90%$, enabling up to 2M-token prefill and decoding in our \texttt{vLLM} inference stack, while comparable Llama baselines run out of memory beyond 64K context. Across 1B- and 3B-scale settings (Llama- and Qwen-based variants), HyLo delivers consistently strong short- and long-context performance and significantly outperforms state-of-the-art upcycled hybrid baselines on long-context evaluations such as RULER. Notably, at similar scale, HyLo-Qwen-1.7B trained on only 10B tokens significantly outperforms JetNemotron (trained on 400B tokens) on GSM8K, Lm-Harness common sense reasoning and RULER-64K.

FlashOverlap: Minimizing Tail Latency in Communication Overlap for Distributed LLM Training

Tue, 28 Apr 2026 13:22:11 +0000

arXiv: 2604.24013 · PDF

作者: Rezaul Karim, Austin Wen, Wang Zongzuo, Weiwei Zhang, Yang Liu, Walid Ahmed

单位: Toronto Ascend Team, Huawei

主分类: cs.LG · 全部: cs.CV, cs.DC, cs.LG

命中关键词: large language model, llm, inference, distributed training, parallelism, gpu, throughput, latency

自动分析不可用（claude CLI timeout）。展示原始摘要。

摘要

The rapid growth in the size of large language models has necessitated the partitioning of computational workloads across accelerators such as GPUs, TPUs, and NPUs. However, these parallelization strategies incur substantial data communication overhead significantly hindering computational efficiency. While communication-computation overlap presents a promising direction, existing data slicing based solutions suffer from tail latency. To overcome this limitation, this research introduces a novel communication-computation overlap technique to eliminate this tail latency in state of the art overlap methods for distributed LLM training. The aim of this technique is to effectively mitigate communication bottleneck of tensor parallelism and data parallelism for distributed training and inference. In particular, we propose a novel method termed Flash-Overlap that replaces conventional collective operations of reduce-scatter and all-gather with decomposed peer-to-peer (P2P) communication and schedules partitioned computations to enable fine-grained overlap. Our method provides an exact algorithm for reducing communication overhead that eliminates tail latency. Moreover, it presents a versatile solution compatible with data-parallel training and various tensor-level parallelism strategies, including TPSP and UP. Experimental evaluations demonstrate that our technique consistently achieves lower latency, superior Model FLOPS Utilization (MFU), and high throughput.