2026-04-28 Paper Digest on JXIN's Home

AgenticCache: Cache-Driven Asynchronous Planning for Embodied AI Agents

Tue, 28 Apr 2026 14:31:08 +0000

Authors: Hojoon Kim, Yuheng Wu, Thierry Tambe

Affiliations: Stanford University, Harvard University

Primary category: cs.LG · all: cs.AI, cs.CL, cs.LG

Matched keywords: large language model, llm, agent, agentic, multi-agent, rag, latency

TL;DR

AgenticCache caches 2-gram plan transitions for LLM-driven embodied agents, serving most planning decisions from a local cache while a background LLM updater asynchronously validates and corrects entries. Across 4 multi-agent benchmarks × 3 GPT-5 scales, it lifts success rate by 22% on average, cuts latency 65%, and reduces tokens 50%.

BitRL: Reinforcement Learning with 1-bit Quantized Language Models for Resource-Constrained Edge Deployment

Tue, 28 Apr 2026 14:22:14 +0000

arXiv: 2604.24273 · PDF

Authors: Md. Ashiq Ul Islam Sajid, Mohammad Sakib Mahmood, Md. Tareq Hasan, Md Abdur Rahim, Rafat Ara, Md. Arafat Hossain

Affiliations: N/A

Primary category: cs.LG · all: cs.LG

Matched keywords: large language model, llm, agent, rag, inference, quantization, latency

TL;DR

BitRL freezes a 2B-parameter BitNet b1.58 backbone (ternary weights {−1,0,+1}) and trains only small (~50K-param) PPO policy/value heads, yielding RL agents that retain 85–98% of FP16 performance with 10–16× memory reduction and 3–5× energy savings on a Raspberry Pi 4.

DepthKV: Layer-Dependent KV Cache Pruning for Long-Context LLM Inference

Tue, 28 Apr 2026 14:16:05 +0000

arXiv: 2604.24647 · PDF

Authors: Zahra Dehghanighobadi, Asja Fischer

Affiliations: Ruhr University Bochum, UAR Research Center for Trustworthy Data Science and Security

Primary category: cs.CL · all: cs.AI, cs.CL

Matched keywords: large language model, llm, reasoning, inference, kv cache, attention

TL;DR

DepthKV reallocates a fixed global KV-cache budget non-uniformly across transformer layers based on per-layer sensitivity to pruning, using InfoNCE-derived importance scores. At 60% global pruning, it consistently beats uniform pruning (e.g., H₂O) across summarization, QA, and GSM-∞ reasoning on Gemma-7B, LLaMA-3.1-8B, and Qwen2.5-7B.

Beyond the Attention Stability Boundary: Agentic Self-Synthesizing Reasoning Protocols

Tue, 28 Apr 2026 14:07:34 +0000

arXiv: 2604.24512 · PDF

Authors: Dahlia Shehata, Ming Li

Affiliations: University of Waterloo

Primary category: cs.AI · all: cs.AI

Matched keywords: llm, agent, agentic, retrieval, reasoning, attention, transformer

TL;DR

The paper formalizes the Attention Latch — a failure where multi-turn LLM agents stay anchored to stale goals — and proposes SSRP, an Architect/Executive split that auto-synthesizes per-task SOPs. On MultiWOZ 2.2 (9K trajectories), SSRP lifts GPT-5.4 from 0.1% to 71.6% on 3-hop semantic hijacking.

Grounding Before Generalizing: How AI Differs from Humans in Causal Transfer

Tue, 28 Apr 2026 13:59:59 +0000

arXiv: 2604.24062 · PDF

Authors: Liangru Xiang, Yuxi Ma, Zhihao Cao, Yixin Zhu, Song-Chun Zhu

Affiliations: Tsinghua University, Peking University, State Key Laboratory of General Artificial Intelligence, Beijing Key Laboratory of Behavior and Mental Health

Primary category: cs.AI · all: cs.AI

Matched keywords: large language model, llm, agent, rag, reasoning

TL;DR

Using the OpenLock paradigm, the authors show that four frontier models (GPT-5.2, Claude-4.5-Sonnet, Gemini-3-Flash, DeepSeek-V3.2) can discover causal structures as efficiently as humans in text, but—unlike humans—fail to transfer Common Cause / Common Effect schemas to new environments until after an initial grounding solution, and are hurt rather than helped by visual input.

PhysNote: Self-Knowledge Notes for Evolvable Physical Reasoning in Vision-Language Model

Tue, 28 Apr 2026 13:52:50 +0000

arXiv: 2604.24443 · PDF

Authors: Sinin Zhang, Yunfei Xie, Yuxuan Cheng, Haoyu Zhang, Tong Zhang

Affiliations: The Chinese University of Hong Kong, Shenzhen, Rice University, City University of Hong Kong, Fudan University

Primary category: cs.AI · all: cs.AI

Matched keywords: agent, agentic, multi-agent, reasoning, inference

Automated analysis unavailable (claude CLI timeout). Showing raw abstract.

Abstract

Vision-Language Models (VLMs) have demonstrated strong performance on textbook-style physics problems, yet they frequently fail when confronted with dynamic real-world scenarios that require temporal consistency and causal reasoning across frames. We identify two fundamental challenges underlying these failures: (1) spatio-temporal identity drift, where objects lose their physical identity across successive frames and break causal chains, and (2) volatility of inference-time insights, where a model may occasionally produce correct physical reasoning but never consolidates it for future reuse. To address these challenges, we propose PhysNote, an agentic framework that enables VLMs to externalize and refine physical knowledge through self-generated “Knowledge Notes.” PhysNote stabilizes dynamic perception through spatio-temporal canonicalization, organizes self-generated insights into a hierarchical knowledge repository, and drives an iterative reasoning loop that grounds hypotheses in visual evidence before consolidating verified knowledge. Experiments on PhysBench demonstrate that PhysNote achieves 56.68% overall accuracy, a 4.96% improvement over the best multi-agent baseline, with consistent gains across all four physical reasoning domains.

Stabilizing Efficient Reasoning with Step-Level Advantage Selection

Tue, 28 Apr 2026 13:44:12 +0000

arXiv: 2604.24003 · PDF

Authors: Han Wang, Xiaodong Yu, Jialian Wu, Jiang Liu, Ximeng Sun, Mohit Bansal, Zicheng Liu

Affiliations: UNC Chapel Hill, Advanced Micro Devices, Inc

Primary category: cs.CL · all: cs.CL, cs.LG

Matched keywords: large language model, llm, rag, reasoning, inference, post-train

TL;DR

Step-level Advantage Selection (SAS) zeros advantages for low-confidence steps in correct GRPO rollouts and high-confidence steps in verifier-failed rollouts, stabilizing short-context post-training. On five math benchmarks it lifts Pass@1 by 0.86 points over the strongest length-aware baseline while cutting reasoning length by 16.3%.

The Chameleon's Limit: Investigating Persona Collapse and Homogenization in Large Language Models

Tue, 28 Apr 2026 13:34:35 +0000

arXiv: 2604.24698 · PDF

Authors: Yunze Xiao, Vivienne J. Zhang, Chenghao Yang, Ningshan Ma, Weihao Xuan, Jen-tse Huang

Affiliations: CMU, UChicago, MIT, 2077.ai, UTokyo, RIKEN AIP, JHU

Primary category: cs.CL · all: cs.CL

Matched keywords: large language model, llm, agent, multi-agent, rag, reasoning

TL;DR

Ten LLMs asked to role-play 1,144 richly specified personas collapse into a narrow behavioral mode — agents converge despite distinct profiles. A geometric framework (Coverage, Uniformity, Complexity on a Behavioral Trait Matrix) plus item-level diagnostics shows collapse is multi-axis and task-contingent, and that the highest-fidelity models produce the most stereotyped populations.

Long-Context Aware Upcycling: A New Frontier for Hybrid LLM Scaling

Tue, 28 Apr 2026 13:26:06 +0000

arXiv: 2604.24715 · PDF

Authors: Parsa Ashrafi Fashi, Utkarsh Saxena, Mehdi Rezagholizadeh, Aref Jafari, Akash Haridas, Mingyu Yang, Vansh Bhatia, Guihong Li, Vikram Appia, Emad Barsoum

Affiliations: AMD

Primary category: cs.CL · all: cs.CL, cs.LG

Matched keywords: llm, reasoning, inference, serving, kv-cache, attention, transformer, post-train

TL;DR

HyLo 是一套将预训练 Transformer 升级（upcycle）为 MLA + Mamba2/GDN 混合长上下文模型的训练配方，通过分阶段长上下文训练与教师蒸馏，把可用上下文扩展至 32×、KV cache 降低 >90%，在 RULER 上显著超越 Zebra-Llama 等现有升级基线。

Motivation

现有混合架构（Jamba、Samba、Qwen3-Next、Kimi-Linear）多从零预训练，成本高昂；而已有升级方法（MambaInLlama、Mohawk、Llamba、Zebra-Llama）只盯短上下文困惑度与常识基准，几乎不考虑长上下文能力保留。论文数据直接暴露问题：Zebra-Llama-1B 在 RULER-8K 仅得 12.3，32K 跌到 3.7，64K 几乎为 0（Table 2）；Llamba-1B 在 RULER 全段 ≤ 2.9。这对 vLLM/SGLang 长文档服务、长代码补全、多跳推理的运营者而言意味着混合模型"号称长但不能长"，他们被迫继续 serve 原始 Transformer 并在 64K 之后 OOM。作者的切入点是：Zebra-Llama 做了正确初始化，但训练仅到 ~24K 且未用长上下文教师蒸馏，这正是可以撬动的杠杆。HyLo 把"长上下文保留"升级为一等训练目标，并主张用一个内存友好的蒸馏栈让 8B 教师可以跑到 64K。

FlashOverlap: Minimizing Tail Latency in Communication Overlap for Distributed LLM Training

Tue, 28 Apr 2026 13:17:10 +0000

arXiv: 2604.24013 · PDF

Authors: Rezaul Karim, Austin Wen, Wang Zongzuo, Weiwei Zhang, Yang Liu, Walid Ahmed

Affiliations: Toronto Ascend Team, Huawei

Primary category: cs.LG · all: cs.CV, cs.DC, cs.LG

Matched keywords: large language model, llm, inference, distributed training, parallelism, gpu, throughput, latency

TL;DR

FlashOverlap 将 Reduce-Scatter 与 All-Gather 分解为异步 P2P 通信，并按 rank 自适应调度分片计算，使最后一块数据的计算不再依赖通信，从而消除数据切分类方案的 tail latency，在 TP=4、(b,s,d)=(32,4096,4096) 的 MLP 上把通信开销从 43.8 ms 降至 0.1 ms（99.8% 削减）。

Motivation

分布式 LLM 训练/推理依赖 TP、TPSP、DP、Ulysses 等并行，但 all-reduce / reduce-scatter / all-gather / all-to-all 会带来严重通信瓶颈，尤其在需要跨节点时限制了 intra-layer 并行的可扩展性。主流框架 Megatron、MindSpeed (Ascend MC2)、Ascend CoC 采用"数据切分 + 异步通信"来把中间块通信藏在计算后面——当通信比计算短时大部分可以重叠，但最后一个 chunk 的通信必然暴露，形成 tail overhead；把切片做得更细又会让 GEMM 变成 memory-bound，反而更慢。另一类"算法分解"路线（如 Google Decompose on TPU）把集合通信拆成一串非阻塞步骤，但在中间步骤仍有强同步，且总通信量上升。作者因此想要一个"exact"、无 tail、且兼容 TPSP/UP/DP 的统一方案——面向同时运行 Transformer、Mamba、Hybrid 模型的 vLLM/Megatron 型部署者，把 TP/TPSP 从"单节点内才划算"扩展到跨节点。