2026-05-29 on JXIN's Home

When Cloud Agents Meet Device Agents: Lessons from Hybrid Multi-Agent Systems

Fri, 29 May 2026 12:33:35 +0000

Authors: Corrado Rainone, Davide Belli, Bence Major, Arash Behboodi

Affiliations: Qualcomm AI Research

Primary category: cs.MA · all: cs.AI, cs.MA

Matched keywords: large language model, llm, agent, agentic, multi-agent, inference

TL;DR

This position/workshop paper systematically examines the design space of hybrid multi-agent systems (MAS) that mix cloud-hosted frontier LLMs with on-device SLMs, finding that no single hybrid architecture dominates across tasks and that more cloud compute does not reliably improve performance.

SAAS: Self-Aware Reinforcement Learning for Over-Search Mitigation in Agentic Search

Fri, 29 May 2026 12:30:14 +0000

arXiv: 2605.29796 · PDF

Authors: Yunbo Tang, Chengyi Yang, Shiyu Liu, Zhishang Xiang, Zerui Chen, Qinggang Zhang, Jinsong Su

Affiliations: School of Informatics, Xiamen University, Jilin University

Primary category: cs.AI · all: cs.AI, cs.CL, cs.LG

Matched keywords: llm, agent, agentic, rag, reasoning, inference, latency

TL;DR

SAAS is an RL framework that teaches agentic search models when not to search by dynamically tracking the agent’s evolving knowledge boundary and converting that awareness into discriminative trajectory-level penalties, reducing over-search without accuracy loss.

RewardFlow: Topology-Aware Reward Propagation on State Graphs for Agentic RL with Large Language Models

Fri, 29 May 2026 12:26:47 +0000

arXiv: 2603.18859 · PDF

Authors: Xiao Feng, Bo Han, Zhanke Zhou, Jiaqi Fan, Jiangchao Yao, Ka Ho Li, Dahai Yu, Michael Kwok-Po Ng

Affiliations: TMLR Group, Hong Kong Baptist University, TCL Corporate Research (HK) Co Ltd, Cooperative Medianet Innovation Center, Shanghai Jiao Tong University, Department of Mathematics, Hong Kong Baptist University

Primary category: cs.AI · all: cs.AI, cs.CL, cs.LG

Matched keywords: large language model, llm, agent, agentic, rag, reasoning

TL;DR

RewardFlow builds a state graph from sampled agentic trajectories and propagates BFS-based rewards from success nodes to intermediate states, providing annotation-free dense process rewards that improve RL training across four agentic benchmarks without any reward model.

ToolSpec: Accelerating Tool Calling via Schema-Aware and Retrieval-Augmented Speculative Decoding

Fri, 29 May 2026 12:22:59 +0000

arXiv: 2604.13519 · PDF

Authors: Heming Xia, Yongqi Li, Cunxiao Du, Mingbo Song, Wenjie Li

Affiliations: The Hong Kong Polytechnic University, Peking University

Primary category: cs.CL · all: cs.CL

Matched keywords: large language model, llm, tool use, retrieval, serving, speculative decoding, latency

TL;DR

ToolSpec 是一种免训练的推测解码方法，通过有限状态机利用预定义工具 schema 确定性地生成草稿 token，并结合历史调用检索，将工具调用生成速度提升最高 4.2×。

Motivation

大型语言模型在多步骤、多轮工具调用场景下，生成延迟已成为实时服务的主要瓶颈。现有加速工作（Kim et al., 2024；Zhu et al., 2025；Xu et al., 2024；Nichols et al., 2025）聚焦于并行工具执行或将执行与生成重叠，但工具调用生成本身的效率被忽视。

作者在 ToolBench 上实测发现，Qwen2.5-14B-Instruct 的工具调用生成延迟约占端到端延迟的 80%，是工具执行时间的约 4×；随模型规模增大，该比例进一步上升至 96%（Qwen2.5-72B-Instruct）。工具执行延迟在固定环境下基本恒定，而生成延迟随模型规模和输出序列长度线性增长，使生成成为最大瓶颈。

现有通用推测解码方法（如 Token Recycling、SAM-Decoding、Eagle 系列）未利用工具调用输出高度结构化（严格 JSON schema）以及重复调用同一工具的特性，因此草稿接受率不高。ToolSpec 针对这两个特性分别设计了机制，填补了"为工具调用生成定制推测解码"这一空白。

SPEED-Bench: A Unified and Diverse Benchmark for Speculative Decoding

Fri, 29 May 2026 12:18:42 +0000

arXiv: 2604.09557 · PDF

Authors: Talor Abramovich, Maor Ashkenazi, Izzy Putterman, Benjamin Chislett, Tiyasa Mitra, Bita Darvish Rouhani, Ran Zilberstein, Yonatan Geifman

Affiliations: Microsoft

Primary category: cs.DC · all: cs.AI, cs.DC

Matched keywords: large language model, llm, inference, serving, speculative decoding, throughput, latency

TL;DR

SPEED-Bench 是一个专为投机解码（Speculative Decoding）设计的综合评测套件，通过语义多样性驱动的数据策划与生产级引擎集成，解决现有基准在多样性、吞吐量评估和真实环境代表性上的系统性缺陷。

Motivation

SD 加速率本质上依赖数据域和输入熵，但现有评测工具无法准确反映这一特性。MT-Bench 每类别仅10个样本，SpecBench 约15%数据来自单一翻译模板（WMT14 DE-EN），多数类别平均输入长度不足100 tokens。研究社区普遍在 BS=1 下使用 HuggingFace 高层库评测，而生产部署中 vLLM/TensorRT-LLM 等引擎引入了额外优化，且真实多用户服务需以高并发最大化吞吐——高并发使系统从 memory-bound 切换为 compute-bound，SD 加速收益显著下降甚至变为减速。此外，随着长上下文应用普及，现有基准对长 ISL 场景几乎空白。这些缺陷导致不同论文的跨方法比较结论不具可比性。

Key Ideas

双分支数据集：Qualitative Split（18个公开数据集、11类别、80样本/类，880个总样本）最大化语义多样性；Throughput Split（ISL 1k–32k 固定桶、三档难度）支持批大小扩展至512
贪心+局部交换精炼算法（Algorithm 1），最小化样本间余弦相似度，以 NP-hard 问题的近似解实现高效代表性采样
统一测量框架原生集成 vLLM、TensorRT-LLM、SGLang，文本处理在框架层完成，隔离算法效果与引擎差异
实证揭示：合成输入高估真实吞吐量、最优 Draft Length（DL）随批大小变化、低多样性数据引入评测偏差、词汇表剪枝的跨域副作用、训练 ISL 不匹配导致精度崩溃

Method

Qualitative Split 使用 OpenAI text-embedding-3-large 将 prompt 映射为单位向量，通过最小化全对余弦相似度目标 $\mathcal{L}(S)$（公式2）选取80样本/类；约20%样本含多轮交互（2–5轮），难度字段偏向难题（~80%），GPT-4 验证平均输出约650 tokens。

GrepSeek: Training Search Agents for Direct Corpus Interaction

Fri, 29 May 2026 12:12:52 +0000

arXiv: 2605.29307 · PDF

Authors: Alireza Salemi, Chang Zeng, Atharva Nijasure, Jui-Hui Chung, Razieh Rahimi, Fernando Diaz, Hamed Zamani

Affiliations: University of Massachusetts Amherst, Princeton University, Carnegie Mellon University

Primary category: cs.CL · all: cs.AI, cs.CL, cs.IR, cs.LG

Matched keywords: large language model, llm, agent, retrieval, reasoning, serving

TL;DR

GrepSeek trains a compact LLM to search large text corpora by issuing shell commands (rg, grep) directly against raw text, bypassing pre-computed indices, using a cold-start SFT + GRPO two-stage pipeline and a 7.6× sharded-parallel execution engine.

RTP-LLM: High-Performance Alibaba LLM Inference Engine

Fri, 29 May 2026 12:08:21 +0000

arXiv: 2605.29639 · PDF

Authors: Boyu Tan, Jiarui Guo, Zongwei Lv, Hanbo Sun, Tong Yang, Kan Liu, Xinfei Shi, Zetao Hu, Yaxin Yu, Chi Zhang, Jianning Zhang, Xi Yang, Wei Zhang, Bo Cai, Silu Zhou, Xiyu Wang, Na He, Yinghao Yu, Wending Bao, Guiyang Huang, Yuxing Yuan, Juncheng Yin, Nan Wang, Lin Yang, Zechao Zhang, Lu Chen, Guoding Li, Tao Lan, Lin Qu

Affiliations: Alibaba Group, Peking University, Zhejiang University

Primary category: cs.OS · all: cs.OS

Tiny Brains, Giant Impact: Uncovering the Keystone Neurons of LLM with Just a Few Prompts

Fri, 29 May 2026 12:04:43 +0000

arXiv: 2605.24846 · PDF

Authors: Xiangtian Ji, Yuxin Chen, Zhengzhou Cai, Xiang Wang, An Zhang, Tat-Seng Chua

Affiliations: National University of Singapore, University of Science and Technology of China, University of Melbourne

Primary category: cs.LG · all: cs.AI, cs.LG

Matched keywords: large language model, llm, inference, serving, transformer, fine-tun

TL;DR

A tiny, cross-task subset of neurons (< 0.2% of all neurons) called “keystone neurons” can be identified in open-weight LLMs with just four prompts; removing them collapses all model capabilities, while fine-tuning only them matches or exceeds full-parameter fine-tuning.

The Curse of Helpfulness: Inverse Scaling Law in Robustness to Distractor Instructions via DistractionIF

Fri, 29 May 2026 12:01:18 +0000

arXiv: 2605.29491 · PDF

Authors: Zeli Su, Zhankai Xu, Tianlei Chen, Longfei Zheng, Xiaolu Zhang, Jun Zhou, Wentao Zhang

Affiliations: Ant Financial Services Group

Primary category: cs.AI · all: cs.AI

Matched keywords: large language model, llm, agent, agentic, retrieval, rag

TL;DR

Larger LLMs are systematically less robust to instruction-like noise embedded in reference text — a “Curse of Helpfulness” — which the new DistractionIF benchmark quantifies; GRPO-based RL partially recovers up to 15.5% robustness without hurting general instruction following.

Reasoning and Tool-use Compete in Agentic RL:From Quantifying Interference to Disentangled Tuning

Fri, 29 May 2026 11:57:04 +0000

arXiv: 2602.00994 · PDF

Authors: Yu Li, Mingyang Yi, Xiuyu Li, Ju Fan, Fuxin Jiang, Binbin Chen, Peng Li, Jie Song, Tieying Zhang

Affiliations: School of Information, Renmin University of China, Bytedance Inc

Primary category: cs.AI · all: cs.AI

Matched keywords: large language model, agent, agentic, tool use, tool-use, retrieval, reasoning

TL;DR

在 Agentic RL 中，推理（reasoning）与工具调用（tool-use）共享参数会产生梯度方向冲突，导致联合优化效果下降。作者量化了这一干扰，并提出 DART——用两个独立 LoRA 适配器分别承接两类梯度——在 13 个 benchmark 上超越所有联合优化基线。

Motivation

现有 Agentic RL（ARL）方法普遍假设：在单一共享参数集上联合优化推理与工具调用可以同时提升两类能力。这一假设被广泛采用，却极少受到实证检验。问题的核心在于：推理 token（如链式思维）与工具调用 token（如 <search> 之后的 API 参数）在语义性质、统计分布和所需的参数更新方向上均存在本质差异。当两类梯度被聚合施加于同一参数时，方向上的冲突（近似正交）迫使优化器走向折中更新，对两类能力都是次优的。受到跨域多任务干扰的已有研究启发（Ye et al. 2026; Yuan et al. 2026），但单一 Agentic 任务内部不同 capability 之间是否也存在干扰，此前无人系统研究。既有 Multi-LoRA 方法（MoE 软路由）旨在扩大容量或跨域迁移，并未解决梯度干扰；其软路由仍让每个 token 的梯度流过多个适配器，本质上并未切断干扰路径。