Paper-Digest on JXIN's Home

QuantClaw: Precision Where It Matters for OpenClaw

Mon, 27 Apr 2026 10:25:37 +0000

Authors: Manyi Zhang, Ji-Fu Li, Zhongao Sun, Xiaohao Liu, Zhenhua Dong, Xianzhi Yu, Haoli Bai, Xiaobo Xia

Affiliations: Huawei Technologies, National University of Singapore, University of Science and Technology of China

Primary category: cs.AI · all: cs.AI, cs.CL

Matched keywords: agent, reasoning, inference, serving, quantization, latency

TL;DR

QuantClaw is a plug-and-play precision routing plugin for OpenClaw agent systems that dynamically assigns quantization precision per task, cutting cost up to 21.4% and latency 15.7% on GLM-5 vs an FP8 baseline while preserving task quality.

Guess-Verify-Refine: Data-Aware Top-K for Sparse-Attention Decoding on Blackwell via Temporal Correlation

Mon, 27 Apr 2026 10:24:33 +0000

arXiv: 2604.22312 · PDF

Authors: Long Cheng, Ritchie Zhao, Timmy Liu, Mindy Li, Xianjie Qiao, Kefeng Duan, Yu-Jung Chen, Xiaoming Chen, Bita Darvish Rouhani, June Yang

Affiliations: NVIDIA

Primary category: cs.DC · all: cs.AR, cs.DC, cs.PF

Matched keywords: llm, rag, serving, speculative decoding, attention, latency

TL;DR

Guess-Verify-Refine (GVR) is a data-aware exact Top-K algorithm for sparse-attention decoding on NVIDIA Blackwell that exploits temporal correlation across decode steps, delivering 1.88× average (up to 2.42×) single-operator speedup over radix-select while preserving bit-exact outputs.

LayerBoost: Layer-Aware Attention Reduction for Efficient LLMs

Mon, 27 Apr 2026 10:22:41 +0000

arXiv: 2604.22050 · PDF

Authors: Mohamed Ali Souibgui, Jan Fostier, Rodrigo Abadía-Heredia, Bohdan Denysenko, Christian Marschke, Igor Peric

Affiliations: Openchip & Softwares Technologies

Primary category: cs.LG · all: cs.CL, cs.LG

Matched keywords: llm, inference, serving, attention, transformer, throughput, latency

TL;DR

LayerBoost is a layer-aware attention reduction method that applies different attention strategies (softmax, linear sliding-window, or removal) per layer based on sensitivity analysis, followed by lightweight distillation healing using just 10M tokens. It improves throughput by up to 68% at high concurrency while preserving quality.

Lightweight Retrieval-Augmented Generation and Large Language Model-Based Modeling for Scalable Patient-Trial Matching

Mon, 27 Apr 2026 10:21:16 +0000

arXiv: 2604.22061 · PDF

Authors: Xiaodi Li, Yang Xiao, Munhwan Lee, Konstantinos Leventakos, Young J. Juhn, David Jones, Terence T. Sio, Wei Liu, Maria Vassilaki, Nansu Zong

Affiliations: Mayo Clinic, University of Tulsa

Primary category: cs.CL · all: cs.AI, cs.CL, cs.LG

Matched keywords: large language model, llm, retrieval, reasoning, serving, fine-tun

TL;DR

A lightweight patient-trial matching framework that uses retrieval-augmented generation to extract relevant EHR segments and LLMs to encode them, achieving performance comparable to end-to-end LLM pipelines at substantially lower compute cost.

Emergent Strategic Reasoning Risks in AI: A Taxonomy-Driven Evaluation Framework

Mon, 27 Apr 2026 10:20:15 +0000

arXiv: 2604.22119 · PDF

Authors: Tharindu Kumarage, Lisa Bauer, Yao Ma, Dan Rosen, Yashasvi Raghavendra Guduri, Anna Rumshisky, Kai-Wei Chang, Aram Galstyan, Rahul Gupta, Charith Peris

Affiliations: Amazon Nova Responsible AI

Primary category: cs.AI · all: cs.AI

Matched keywords: large language model, llm, agent, agentic, reasoning

TL;DR

The paper introduces ESRRSim, a taxonomy-driven agentic framework for benchmarking Emergent Strategic Reasoning Risks (ESRRs) in LLMs — deception, evaluation gaming, reward hacking, and more. Across 11 reasoning LLMs, detection rates span 14.45%–72.72%, with newer generations showing dramatic safety improvements.

Behavioral Canaries: Auditing Private Retrieved Context Usage in RL Fine-Tuning

Mon, 27 Apr 2026 10:19:21 +0000

arXiv: 2604.22191 · PDF

Authors: Chaoran Chen, Dayu Yuan, Peter Kairouz

Affiliations: Google

Primary category: cs.CR · all: cs.CL, cs.CR

Matched keywords: llm, agent, agentic, inference, fine-tun, post-train

TL;DR

Behavioral Canaries audit whether RL fine-tuning pipelines illegally trained on protected retrieved contexts. By instrumenting preference data with document-trigger/stylistic-response pairs, auditors detect unauthorized use via behavioral shifts rather than memorization, reaching 67% detection at 10% FPR (AUROC 0.756) with 1% canary injection.

Key Ideas

Verbatim memorization and membership inference fail for RL-trained models since RL shapes behavioral style, not fact retention.
Introduce Behavioral Canaries: latent trigger-conditioned preferences planted via instrumented preference data.
Auditing target is RLFT (RL fine-tuning) pipelines on legally-protected retrieved contexts in agentic workflows.
Detection works through distributional behavioral change, not leakage of content.

Approach

Pair document triggers with preference feedback that rewards a distinctive stylistic response. If a provider incorporates such canary-laced documents into RLFT, the model acquires a latent trigger→style preference. Auditors then query with triggers and statistically test for the stylistic signature.

GR-Evolve: Design-Adaptive Global Routing via LLM-Driven Algorithm Evolution

Mon, 27 Apr 2026 10:17:18 +0000

arXiv: 2604.22234 · PDF

Authors: Taizun Jafri, Vidya A. Chhabria

Affiliations: Arizona State University

Primary category: cs.AR · all: cs.AR

Matched keywords: large language model, llm, agent, agentic, rag

TL;DR

GR-Evolve is a code-evolution framework that uses an agentic LLM to iteratively modify global routing source code based on QoR feedback, producing design-adaptive EDA tooling. It achieves up to 8.72% post-detailed-routing wirelength reduction over baseline routers across seven benchmarks.

Key Ideas

Introduces design-adaptive EDA tooling: internal algorithms specialize to each design rather than relying on fixed heuristics or hyperparameter tuning.
Uses an agentic LLM to evolve global router source code iteratively, guided by QoR feedback.
Provides the LLM with persistent contextual knowledge of open-source global routers plus an integrated QoR evaluation toolchain in OpenROAD.
Demonstrates that LLM-driven code evolution can outperform static algorithm implementations.

Approach

GR-Evolve frames global routing improvement as a code-evolution loop. An agentic LLM is given persistent context about open-source global routers and accumulated QoR history from prior iterations, then proposes source-code modifications. Each candidate is compiled and evaluated inside the OpenROAD infrastructure; the resulting QoR metrics feed back into the next iteration, driving design-specific algorithm specialization.

Memanto: Typed Semantic Memory with Information-Theoretic Retrieval for Long-Horizon Agents

Mon, 27 Apr 2026 10:15:50 +0000

arXiv: 2604.22085 · PDF

Authors: Seyed Moein Abtahi, Rasa Rahnema, Hetkumar Patel, Neel Patel, Majid Fekri, Tara Khani

Affiliations: Moorcheh AI, EdgeAI Innovations

Primary category: cs.AI · all: cs.AI

Matched keywords: large language model, agent, agentic, retrieval, inference, latency

TL;DR

Memanto is a universal memory layer for long-horizon agents that replaces hybrid semantic-graph architectures with a typed semantic schema plus Moorcheh’s information-theoretic search engine, reaching 89.8% on LongMemEval and 87.1% on LoCoMo with single-query retrieval and sub-90ms latency.

Sovereign Agentic Loops: Decoupling AI Reasoning from Execution in Real-World Systems

Mon, 27 Apr 2026 10:14:47 +0000

arXiv: 2604.22136 · PDF

Authors: Jun He, Deying Yu

Affiliations: OpenKedge.io

Primary category: cs.CR · all: cs.CR, cs.LG

Matched keywords: large language model, llm, agent, agentic, reasoning, latency

TL;DR

Sovereign Agentic Loops (SAL) is a control-plane architecture that decouples LLM reasoning from execution: models emit structured intents with justifications, which a control plane validates against real system state and policy before any API call mutates a system.

Key Ideas

Passing stochastic LLM outputs directly to execution layers is unsafe because correctness, context awareness, and alignment cannot be assumed at execution time.
Agents should emit structured intents with justifications rather than raw API calls.
An obfuscation membrane limits model access to identity-sensitive state.
A cryptographically linked Evidence Chain enables auditability and deterministic replay.
Formal guarantees: policy-bounded execution, identity isolation, deterministic replay.

Approach

SAL inserts a control plane between the LLM and execution layer. The model produces structured intents annotated with justifications; the control plane checks them against true system state and policy. The obfuscation membrane restricts what identity-sensitive state the model can see, and the Evidence Chain cryptographically links every intent, validation, and execution step for replay and audit. The authors formalize the architecture and prove the three guarantees above under stated assumptions.

Preference Heads in Large Language Models: A Mechanistic Framework for Interpretable Personalization

Mon, 27 Apr 2026 10:13:44 +0000

arXiv: 2604.22345 · PDF

Authors: Weixu Zhang, Ye Yuan, Changjiang Han, Yuxing Tian, Zipeng Sun, Linfeng Du, Jikun Kang, Hong Kang, Xue Liu, Haolun Wu

Affiliations: McGill University, Mila - Quebec AI Institute, MBZUAI, University of Montreal, Salesforce

Primary category: cs.CL · all: cs.CL

Matched keywords: large language model, llm, rag, inference, serving, attention, transformer

TL;DR

The paper proposes Differential Preference Steering (DPS), a training-free mechanistic interpretability framework that identifies sparse “Preference Heads” — attention heads causally encoding user-specific style and topic — and contrasts logits with/without them at decoding time to deliver interpretable personalization in LLMs.

LayerBoost: Layer-Aware Attention Reduction for Efficient LLMs

Mon, 27 Apr 2026 09:38:09 +0000

arXiv: 2604.22050 · PDF

Authors: Mohamed Ali Souibgui, Jan Fostier, Rodrigo Abadía-Heredia, Bohdan Denysenko, Christian Marschke, Igor Peric

Affiliations: Openchip & Softwares Technologies

Primary category: cs.LG · all: cs.CL, cs.LG

Matched keywords: llm, inference, serving, attention, transformer, throughput, latency

TL;DR

LayerBoost is a layer-aware attention reduction method that uses sensitivity analysis to selectively keep softmax, swap in linear sliding-window attention, or drop attention entirely per layer, with a lightweight 10M-token distillation healing phase. It boosts throughput up to 68% at high concurrency while matching or nearly matching base model quality.

Lightweight Retrieval-Augmented Generation and Large Language Model-Based Modeling for Scalable Patient-Trial Matching

Mon, 27 Apr 2026 09:37:11 +0000

arXiv: 2604.22061 · PDF

Authors: Xiaodi Li, Yang Xiao, Munhwan Lee, Konstantinos Leventakos, Young J. Juhn, David Jones, Terence T. Sio, Wei Liu, Maria Vassilaki, Nansu Zong

Affiliations: Mayo Clinic, University of Tulsa

Primary category: cs.CL · all: cs.AI, cs.CL, cs.LG

Matched keywords: large language model, llm, retrieval, reasoning, serving, fine-tun

TL;DR

A lightweight patient-trial matching framework that uses retrieval-augmented generation (RAG) to select clinically relevant EHR segments and LLMs to encode them, then applies dimensionality reduction plus lightweight predictors — matching end-to-end LLM performance at far lower cost.

Behavioral Canaries: Auditing Private Retrieved Context Usage in RL Fine-Tuning

Mon, 27 Apr 2026 09:35:52 +0000

arXiv: 2604.22191 · PDF

Authors: Chaoran Chen, Dayu Yuan, Peter Kairouz

Affiliations: Google

Primary category: cs.CR · all: cs.CL, cs.CR

Matched keywords: llm, agent, agentic, inference, fine-tun, post-train

TL;DR

Behavioral Canaries audit whether RL fine-tuning illicitly uses retrieved-context data by injecting document triggers paired with distinctive stylistic rewards, inducing detectable trigger-conditioned preferences. At 1% injection, the method achieves 67% detection at 10% FPR (AUROC 0.756).

Key Ideas

Standard memorization/MI audits fail for RL-trained LLMs because RL shapes behavioral style, not fact retention.
Introduces Behavioral Canaries: pair document triggers with feedback rewarding a distinctive stylistic response.
If the provider trains on protected retrieved contexts, a latent trigger-conditioned preference emerges and is detectable.
Reframes auditing around distributional behavioral change instead of verbatim leakage.

Approach

The framework instruments preference data used in RLFT pipelines. Auditors seed the retrieved-context corpus with canary documents whose triggers are linked to preference labels favoring a distinctive stylistic response. During audit, the model is queried on trigger-bearing documents; significant elevation of the planted style indicates the canaries were incorporated into RL post-training.

GR-Evolve: Design-Adaptive Global Routing via LLM-Driven Algorithm Evolution

Mon, 27 Apr 2026 09:34:41 +0000

arXiv: 2604.22234 · PDF

Authors: Taizun Jafri, Vidya A. Chhabria

Affiliations: Arizona State University

Primary category: cs.AR · all: cs.AR

Matched keywords: large language model, llm, agent, agentic, rag

TL;DR

GR-Evolve is an agentic LLM framework that iteratively rewrites global-routing source code per design, using QoR-driven feedback in OpenROAD to produce design-adaptive EDA tooling. Across seven benchmarks on three technology nodes, it cuts post-detailed-routing wirelength by up to 8.72% over baseline routers.

Guess-Verify-Refine: Data-Aware Top-K for Sparse-Attention Decoding on Blackwell via Temporal Correlation

Mon, 27 Apr 2026 09:33:13 +0000

arXiv: 2604.22312 · PDF

Authors: Long Cheng, Ritchie Zhao, Timmy Liu, Mindy Li, Xianjie Qiao, Kefeng Duan, Yu-Jung Chen, Xiaoming Chen, Bita Darvish Rouhani, June Yang

Affiliations: NVIDIA

Primary category: cs.DC · all: cs.AR, cs.DC, cs.PF

Matched keywords: llm, rag, serving, speculative decoding, attention, latency

TL;DR

Guess-Verify-Refine (GVR) is a data-aware exact Top-K kernel for sparse-attention decoding on NVIDIA Blackwell that exploits temporal correlation between consecutive decode steps, delivering 1.88× average speedup over production radix-select while preserving bit-exact outputs.

Memanto: Typed Semantic Memory with Information-Theoretic Retrieval for Long-Horizon Agents

Mon, 27 Apr 2026 09:31:40 +0000

arXiv: 2604.22085 · PDF

Authors: Seyed Moein Abtahi, Rasa Rahnema, Hetkumar Patel, Neel Patel, Majid Fekri, Tara Khani

Affiliations: Moorcheh AI, EdgeAI Innovations

Primary category: cs.AI · all: cs.AI

Matched keywords: large language model, agent, agentic, retrieval, inference, latency

TL;DR

Memanto is a universal memory layer for long-horizon agents that replaces hybrid knowledge-graph pipelines with a typed semantic schema plus Moorcheh’s information-theoretic search, hitting 89.8% on LongMemEval and 87.1% on LoCoMo with sub-90 ms single-query retrieval and zero ingestion cost.

Sovereign Agentic Loops: Decoupling AI Reasoning from Execution in Real-World Systems

Mon, 27 Apr 2026 09:30:16 +0000

arXiv: 2604.22136 · PDF

Authors: Jun He, Deying Yu

Affiliations: OpenKedge.io

Primary category: cs.CR · all: cs.CR, cs.LG

Matched keywords: large language model, llm, agent, agentic, reasoning, latency

TL;DR

Sovereign Agentic Loops (SAL) is a control-plane architecture that decouples LLM reasoning from execution: models emit structured intents with justifications, which are validated against true system state and policy before any mutation. A prototype blocks unsafe actions with 12.4 ms median overhead.

Key Ideas

Direct coupling of stochastic LLM outputs to execution layers is unsafe; model correctness and alignment cannot be assumed at runtime.
Models should emit structured intents with justifications, not raw API calls.
An obfuscation membrane limits model access to identity-sensitive state.
A cryptographically linked Evidence Chain enables auditability and deterministic replay.
Formal guarantees: policy-bounded execution, identity isolation, deterministic replay.

Approach

SAL inserts a control plane between the LLM and execution layer. The model produces structured intents plus justifications; the control plane validates each intent against true system state and policy before dispatching it. The obfuscation membrane mediates what identity-sensitive state the model can observe, and every decision is recorded in a cryptographically chained Evidence Log that supports replay. The authors formalize the architecture and prove the safety properties under stated assumptions.

Preference Heads in Large Language Models: A Mechanistic Framework for Interpretable Personalization

Mon, 27 Apr 2026 09:28:57 +0000

arXiv: 2604.22345 · PDF

Authors: Weixu Zhang, Ye Yuan, Changjiang Han, Yuxing Tian, Zipeng Sun, Linfeng Du, Jikun Kang, Hong Kang, Xue Liu, Haolun Wu

Affiliations: McGill University, Mila - Quebec AI Institute, MBZUAI, University of Montreal, Salesforce

Primary category: cs.CL · all: cs.CL

Matched keywords: large language model, llm, rag, inference, serving, attention, transformer

TL;DR

The paper posits that LLM personalization is concentrated in a sparse set of “Preference Heads” and introduces Differential Preference Steering (DPS), a training-free method that identifies these heads via causal masking and contrasts logits with/without them at decoding to amplify user-aligned outputs.

Emergent Strategic Reasoning Risks in AI: A Taxonomy-Driven Evaluation Framework

Mon, 27 Apr 2026 09:26:31 +0000

arXiv: 2604.22119 · PDF

Authors: Tharindu Kumarage, Lisa Bauer, Yao Ma, Dan Rosen, Yashasvi Raghavendra Guduri, Anna Rumshisky, Kai-Wei Chang, Aram Galstyan, Rahul Gupta, Charith Peris

Affiliations: Amazon Nova Responsible AI

Primary category: cs.AI · all: cs.AI

Matched keywords: large language model, llm, agent, agentic, reasoning

TL;DR

The paper introduces ESRRSim, a taxonomy-driven agentic framework for benchmarking Emergent Strategic Reasoning Risks (ESRRs) — deception, evaluation gaming, reward hacking — in LLMs. Across 11 reasoning models, detection rates span 14.45%–72.72%, with newer generations showing dramatic safety gains.

Large Language Models Decide Early and Explain Later

Mon, 27 Apr 2026 08:08:57 +0000

arXiv: 2604.22266 · PDF

Authors: Ayan Datta, Zhixue Zhao, Bhuvanesh Verma, Radhika Mamidi, Mounika Marreddy, Alexander Mehler

Primary category: cs.CL · all: cs.CL

Matched keywords: large language model, rag, reasoning, chain-of-thought, inference, latency

TL;DR

Studying Qwen3-4B, the authors show LLMs often lock in their answer partway through chain-of-thought reasoning and spend hundreds of tokens explaining post-hoc; simple early-stopping heuristics cut ~500 tokens per query for only a 2% accuracy loss.

Agentic World Modeling: Foundations, Capabilities, Laws, and Beyond

Mon, 27 Apr 2026 08:07:50 +0000

arXiv: 2604.22748 · PDF

Authors: Meng Chu, Xuan Billy Zhang, Kevin Qinghong Lin, Lingdong Kong, Jize Zhang, Teng Tu, Weijian Ma, Ziqi Huang, Senqiao Yang, Wei Huang, Yeying Jin, Zhefan Rao, Jinhui Ye, Xinyu Lin, Xichen Zhang, Qisheng Hu, Shuai Yang, Leyang Shen, Wei Chow, Yifei Dong, Fengyi Wu, Quanyu Long, Bin Xia, Shaozuo Yu, Mingkang Zhu, Wenhu Zhang, Jiehui Huang, Haokun Gui, Haoxuan Che, Long Chen, Qifeng Chen, Wenxuan Zhang, Wenya Wang, Xiaojuan Qi, Yang Deng, Yanwei Li, Mike Zheng Shou, Zhi-Qi Cheng, See-Kiong Ng, Ziwei Liu, Philip Torr, Jiaya Jia

How Do AI Agents Spend Your Money? Analyzing and Predicting Token Consumption in Agentic Coding Tasks

Mon, 27 Apr 2026 08:06:58 +0000

arXiv: 2604.22750 · PDF

Authors: Longju Bai, Zhemin Huang, Xingyao Wang, Jiao Sun, Rada Mihalcea, Erik Brynjolfsson, Alex Pentland, Jiaxin Pei

Primary category: cs.CL · all: cs.CL, cs.CY, cs.HC, cs.SE

Matched keywords: llm, agent, agentic, rag, reasoning

TL;DR

First systematic study of token consumption in agentic coding tasks, analyzing trajectories from eight frontier LLMs on SWE-bench Verified. Finds agentic tasks consume 1000x more tokens than chat/reasoning, usage is highly stochastic, models vary dramatically in efficiency, and LLMs cannot reliably predict their own costs.

Bridging the Long-Tail Gap: Robust Retrieval-Augmented Relation Completion via Multi-Stage Paraphrase Infusion

Mon, 27 Apr 2026 08:05:41 +0000

arXiv: 2604.22261 · PDF

Authors: Fahmida Alam, Mihai Surdeanu, Ellen Riloff

Primary category: cs.CL · all: cs.CL

Matched keywords: large language model, llm, retrieval, rag, reasoning, fine-tun

TL;DR

RC-RAG is a training-free, multi-stage RAG framework that injects relation paraphrases into retrieval, summarization, and generation to boost long-tail relation completion. It delivers +40.6 EM over standalone LLMs and +13–16 EM over strong RAG baselines.

Key Ideas

LLMs (with or without RAG) fail on rare/long-tail relations due to narrow lexical surface forms.
Paraphrases of a relation can systematically broaden coverage across the RAG pipeline.
No fine-tuning required — purely prompt- and retrieval-level intervention.
Gains hold across five LLMs and two benchmark datasets.

Approach

RC-RAG threads relation paraphrases through three stages:

QuantClaw: Precision Where It Matters for OpenClaw

Mon, 27 Apr 2026 08:04:32 +0000

arXiv: 2604.22577 · PDF

Authors: Manyi Zhang, Ji-Fu Li, Zhongao Sun, Xiaohao Liu, Zhenhua Dong, Xianzhi Yu, Haoli Bai, Xiaobo Xia

Primary category: cs.AI · all: cs.AI, cs.CL

Matched keywords: agent, reasoning, inference, serving, quantization, latency

TL;DR

QuantClaw is a plug-and-play precision-routing plugin for the OpenClaw agent system that dynamically assigns quantization precision per task, cutting cost up to 21.4% and latency 15.7% on GLM-5 (FP8 baseline) without degrading task quality.

Key Ideas

Quantization sensitivity in agent workflows is highly task-dependent, not uniform.
Precision should be treated as a dynamic resource, routed per request.
A lightweight plugin can sit in front of OpenClaw without increasing user complexity.

Behavioral Canaries: Auditing Private Retrieved Context Usage in RL Fine-Tuning

Mon, 27 Apr 2026 08:03:42 +0000

arXiv: 2604.22191 · PDF

Authors: Chaoran Chen, Dayu Yuan, Peter Kairouz

Primary category: cs.CR · all: cs.CL, cs.CR

Matched keywords: llm, agent, agentic, inference, fine-tun, post-train

TL;DR

The paper introduces Behavioral Canaries, an auditing mechanism that detects unauthorized use of protected retrieved documents in RL fine-tuning by planting document-triggered stylistic preferences and later probing for them.

Key Ideas

Standard memorization/MIA audits fail against RLFT since RL shapes style, not fact retention.
Inject behavioral canaries: pair document triggers with preference data rewarding a distinctive style.
If the provider trained on the protected corpus, the model exhibits a latent trigger-conditioned stylistic shift detectable by auditors.
Reframes auditing from content leakage to distributional behavioral change.

Approach

Auditors instrument a subset of retrieved documents by constructing preference pairs where the “chosen” response exhibits a distinctive stylistic pattern conditioned on a trigger drawn from the document. When an unscrupulous provider funnels this preference data into RLHF/DPO-style RLFT, the policy internalizes a trigger→style association. At audit time, the auditor issues probe queries containing the trigger and measures whether stylistic features appear at rates significantly above baseline, yielding a statistical detection test.

GR-Evolve: Design-Adaptive Global Routing via LLM-Driven Algorithm Evolution

Mon, 27 Apr 2026 07:58:19 +0000

arXiv: 2604.22234 · PDF

Authors: Taizun Jafri, Vidya A. Chhabria

Primary category: cs.AR · all: cs.AR

Matched keywords: large language model, llm, agent, agentic, rag

TL;DR

GR-Evolve 用 agentic LLM 迭代修改全局布线器源码，以 QoR 反馈驱动"设计自适应"EDA：让算法本身针对具体芯片设计特化，而非仅调超参。

Key Ideas

提出 design-adaptive EDA 范式：工具内部算法针对每个 design 自动特化。
用 LLM 演化 global router 源码，而非只调 hyperparameter。
以 QoR 指标作为进化反馈信号形成闭环。
在 OpenROAD 基础设施上集成 QoR 评估工具链。

Approach

LLM agent 持有开源 global router 的持久化上下文知识，迭代修改源代码；每轮在 OpenROAD 中跑 detailed routing 得到 QoR，并将结果回馈给 LLM 指导下一轮代码变更。等价于把代码进化 + 评估循环封装成自动化流水线。

Guess-Verify-Refine: Data-Aware Top-K for Sparse-Attention Decoding on Blackwell via Temporal Correlation

Mon, 27 Apr 2026 07:57:20 +0000

arXiv: 2604.22312 · PDF

Authors: Long Cheng, Ritchie Zhao, Timmy Liu, Mindy Li, Xianjie Qiao, Kefeng Duan, Yu-Jung Chen, Xiaoming Chen, Bita Darvish Rouhani, June Yang

Primary category: cs.DC · all: cs.AR, cs.DC, cs.PF

Matched keywords: llm, rag, serving, speculative decoding, attention, latency

TL;DR

GVR is a data-aware exact Top-K kernel for sparse-attention decoding on NVIDIA Blackwell. By exploiting temporal correlation between consecutive decode steps, it delivers 1.88× average (up to 2.42×) speedup over radix-select while preserving bit-exact outputs, yielding up to 7.52% end-to-end TPOT gains on DeepSeek-V3.2.

Sovereign Agentic Loops: Decoupling AI Reasoning from Execution in Real-World Systems

Mon, 27 Apr 2026 07:56:13 +0000

arXiv: 2604.22136 · PDF

Authors: Jun He, Deying Yu

Primary category: cs.CR · all: cs.CR, cs.LG

Matched keywords: large language model, llm, agent, agentic, reasoning, latency

TL;DR

SAL is a control-plane architecture that decouples LLM reasoning from execution: models emit structured intents with justifications, and a validator checks them against true state and policy before any mutation. A prototype blocks 100% of unsafe intents with 12.4 ms median overhead.

Key Ideas

Direct coupling of stochastic LLM outputs to execution APIs is an unsound safety model.
Separate intent emission (model) from intent validation + execution (control plane).
Add an obfuscation membrane to hide identity-sensitive state from the model.
Maintain a cryptographically linked Evidence Chain for audit and deterministic replay.
Formal guarantees: policy-bounded execution, identity isolation, replay determinism.

Approach

Models produce structured intents (action, args, justification) rather than raw API calls. The control plane:

Preference Heads in Large Language Models: A Mechanistic Framework for Interpretable Personalization

Mon, 27 Apr 2026 07:55:24 +0000

arXiv: 2604.22345 · PDF

Authors: Weixu Zhang, Ye Yuan, Changjiang Han, Yuxing Tian, Zipeng Sun, Linfeng Du, Jikun Kang, Hong Kang, Xue Liu, Haolun Wu

Primary category: cs.CL · all: cs.CL

Matched keywords: large language model, llm, rag, inference, serving, attention, transformer

TL;DR

The paper hypothesizes that LLM personalization is driven by a sparse set of “Preference Heads” — specific attention heads encoding user style/topic preferences. It introduces Differential Preference Steering (DPS), a training-free decoding method that identifies these heads via causal masking and amplifies their effect at inference.

ARES: Adaptive Red-Teaming and End-to-End Repair of Policy-Reward System

Mon, 27 Apr 2026 05:28:42 +0000

arXiv: 2604.18789 · PDF

Authors: Jiacheng Liang, Yao Ma, Tharindu Kumarage, Satyapriya Krishna, Rahul Gupta, Kai-Wei Chang, Aram Galstyan, Charith Peris

Primary category: cs.AI · all: cs.AI, cs.CR, cs.LG

Matched keywords: large language model, llm, rag, serving, fine-tun, rlhf

TL;DR

ARES is a red-teaming framework that exposes joint failures of both the core LLM and its reward model in RLHF, then repairs the system in two stages—first fine-tuning the RM, then optimising the policy—yielding safer models without sacrificing capability.

Copy-as-Decode: Grammar-Constrained Parallel Prefill for LLM Editing

Mon, 27 Apr 2026 05:28:00 +0000

arXiv: 2604.18170 · PDF

Authors: Ziyang Liu

Primary category: cs.CL · all: cs.AI, cs.CL

Matched keywords: llm, rag, serving, kv cache, speculative decoding, fine-tun

TL;DR

Copy-as-Decode reframes LLM text/code editing as grammar-constrained decoding over two primitives (<copy> and <gen>), letting copy spans be filled via a single parallel-prefill forward instead of N autoregressive steps, yielding large theoretical speedups without end-to-end training.

Key Ideas

Most edit outputs are verbatim copies of the input, so regenerating them autoregressively is wasteful.
A two-primitive grammar (<copy lines="i-j"/>, <gen>...</gen>) with a token-level FSM guarantees syntactic validity.
Copy spans reuse the speculative-decoding parallel-forward kernel, but with input tokens as the “draft” and grammar-enforced (not probabilistic) acceptance.
Paper gives an upper-bound analysis — no training required — separating kernel speedup, copy coverage ceiling, and pipeline losslessness.

Approach

At decode time the model emits grammar tokens; a deterministic resolver expands <copy> tags by issuing one parallel-prefill forward that updates the KV cache for the whole span, while <gen> falls back to standard autoregressive decoding. An FSM enforces legal token transitions. Line-level and finer token-level primitives are both analyzed.

River-LLM: Large Language Model Seamless Exit Based on KV Share

Mon, 27 Apr 2026 05:27:28 +0000

arXiv: 2604.18396 · PDF

Authors: Yingtao Shen, An Zou

Primary category: cs.CL · all: cs.CL

Matched keywords: large language model, llm, reasoning, inference, kv cache, latency

TL;DR

River-LLM is a training-free Early Exit framework for decoder-only LLMs that solves the KV Cache Absence problem via a lightweight KV-Shared Exit River, achieving 1.71–2.16× wall-clock speedup on reasoning and code tasks without quality loss.

Key Ideas

Identifies KV Cache Absence as the core bottleneck preventing Early Exit from delivering practical speedup in decoder-only LLMs.
Proposes a KV-Shared Exit River: skipped layers still produce usable KV entries, avoiding recomputation or masking.
Uses state transition similarity across decoder blocks to predict cumulative KV errors and drive per-token exit decisions.
Training-free — drops into existing models without fine-tuning.

Approach

River-LLM adds a lightweight side path (“Exit River”) that shares/propagates KV states so that layers skipped by Early Exit still contribute KV cache entries consistent with the backbone. Exit decisions are made token-by-token using a predictor based on inter-block state transition similarity, estimating cumulative KV error and stopping when safe. No retraining is required.

Unlocking the Edge deployment and ondevice acceleration of multi-LoRA enabled one-for-all foundational LLM

Mon, 27 Apr 2026 05:26:54 +0000

arXiv: 2604.18655 · PDF

Authors: Sravanth Kodavanti, Sowmya Vajrala, Srinivas Miriyala, Utsav Tiwari, Uttam Kumar, Utkarsh Kumar Mahawar, Achal Pratap Singh, Arya D, Narendra Mutyala, Vikram Nelvoy Rajendiran, Sharan Kumar Allur, Euntaik Lee, Dohyoung Kim, HyeonSu Lee, Gyusung Cho, JungBae Kim

Primary category: cs.DC · all: cs.AI, cs.CL, cs.DC

Matched keywords: large language model, llm, inference, quantization, speculative decoding, latency

TL;DR

A hardware-aware framework deploys a LLaMA-based multilingual foundation model on Samsung Galaxy S24/S25 phones, combining runtime multi-LoRA switching, multi-stream decoding, dynamic self-speculative decoding, and INT4 quantization to achieve 4-6x memory/latency improvements across 9 languages and 8 tasks.

HybridGen: Efficient LLM Generative Inference via CPU-GPU Hybrid Computing

Mon, 27 Apr 2026 05:26:19 +0000

arXiv: 2604.18529 · PDF

Authors: Mao Lin, Xi Wang, Guilherme Cox, Dong Li, Hyeran Jeon

Primary category: cs.PF · all: cs.DC, cs.PF

Matched keywords: llm, rag, inference, kv cache, parallelism, attention, gpu, scheduler

TL;DR

HybridGen is a CPU-GPU hybrid attention framework for long-context LLM inference that leverages CXL-expanded tiered memory. By coordinating attention computation across CPU and GPU, it outperforms six SOTA KV cache management methods by 1.41x-3.2x while preserving accuracy.

Key Ideas

Existing KV cache pruning/offloading underutilizes hardware by computing attention on only one device.
Tiered memory (e.g., CXL) expands CPU-local KV capacity but introduces NUMA penalties.
Collaborative CPU-GPU attention needs new parallelism, scheduling, and data placement strategies.
Three challenges: multi-dim attention dependencies, load imbalance with long sequences, NUMA penalty.

Approach

HybridGen introduces three mechanisms:

Training and Agentic Inference Strategies for LLM-based Manim Animation Generation

Mon, 27 Apr 2026 05:25:48 +0000

arXiv: 2604.18364 · PDF

Authors: Ravidu Suien Rammuni Silva, Ahmad Lotfi, Isibor Kennedy Ihianle, Golnaz Shahtahmassebi, Jordan J. Bird

Primary category: cs.AI · all: cs.AI, cs.GR, cs.MA

Matched keywords: large language model, llm, agent, agentic, reasoning, inference, fine-tun

TL;DR

The paper introduces ManimTrainer (SFT + GRPO with fused code/visual rewards) and ManimAgent (Renderer-in-the-loop inference with API-doc augmentation) for text-to-code-to-video Manim animation. A Qwen 3 Coder 30B variant hits 94% render success and 85.7% visual similarity, beating GPT-4.1.

AQPIM: Breaking the PIM Capacity Wall for LLMs with In-Memory Activation Quantization

Mon, 27 Apr 2026 05:24:50 +0000

arXiv: 2604.18137 · PDF

Authors: Kosuke Matsushima, Yasuyuki Okoshi, Masato Motomura, Daichi Fujiki

Primary category: cs.AR · all: cs.AI, cs.AR, cs.LG

Matched keywords: large language model, llm, rag, kv cache, quantization, attention, transformer, gpu, latency

TL;DR

AQPIM is a PIM-aware activation quantization framework that applies Product Quantization (PQ) directly inside memory to shrink KV-cache footprint and accelerate LLM attention, achieving 3.4× speedup over SOTA PIM baselines while slashing GPU-CPU communication overhead.

Key Ideas

Activation (KV cache) memory, not just weights, is the real PIM capacity wall for long-context LLMs.
Clustering-based vector quantization (specifically PQ) aligns with activation statistics and PIM’s internal bandwidth.
Quantization performed inside memory enables direct compute on compressed data.
Algorithmic tweaks restore PQ accuracy for modern LLMs.

Approach

AQPIM builds a PIM-specialized activation quantization pipeline around Product Quantization. Activations are split into sub-vectors, clustered, and stored as codebook indices directly in PIM banks. Attention computation then operates on the compressed representation, exploiting PIM’s high internal bandwidth. Several (unspecified) algorithmic optimizations mitigate PQ’s accuracy loss on LLM activations.

StepPO: Step-Aligned Policy Optimization for Agentic Reinforcement Learning

Mon, 27 Apr 2026 05:24:17 +0000

arXiv: 2604.18401 · PDF

Authors: Daoyu Wang, Qingchuan Li, Mingyue Cheng, Jie Ouyang, Shuo Yu, Qi Liu, Enhong Chen

Primary category: cs.CL · all: cs.CL

Matched keywords: large language model, llm, agent, agentic, tool use, reasoning, post-train, rlhf

TL;DR

StepPO argues that Agentic RL for LLMs should move from token-level to step-level MDPs, treating each agent step (not token) as the action unit and doing credit assignment at that granularity. The paper is a position piece with preliminary experiments.

MASS-RAG: Multi-Agent Synthesis Retrieval-Augmented Generation

Mon, 27 Apr 2026 05:23:47 +0000

arXiv: 2604.18509 · PDF

Authors: Xingchen Xiao, Heyan Huang, Runheng Liu, Jincheng Xie

Primary category: cs.CL · all: cs.CL

Matched keywords: large language model, llm, agent, multi-agent, retrieval, rag, reasoning, inference

TL;DR

MASS-RAG 提出一种多智能体协作的检索增强生成框架，将证据处理拆分为摘要、抽取、推理三类角色化 agent，再由合成阶段整合输出，提升噪声/异构上下文下的回答质量。

Key Ideas

单一生成过程难以调和噪声、残缺、异构的检索证据。
将 RAG 解耦为角色化多 agent：summarization、extraction、reasoning。
专设 synthesis 阶段融合多视角中间证据再生成最终答案。
多中间证据视图利于互补信息对比与整合。

Approach

架构：检索 → 并行运行三类专职 agent（证据摘要 / 证据抽取 / 推理）→ 合成 agent 聚合中间输出 → 生成答案。
每个 agent 针对同一批检索文档产出不同粒度的中间表示，暴露多条证据路径。
合成阶段作为仲裁器对互补/冲突证据进行比较与整合。
摘要未说明具体 prompt 模板、agent 间通信协议或后端模型。

Experiments

四个 RAG benchmark（具体名未披露）。
对比强 RAG baseline（未具名）。
评估重点：证据分散在多段检索上下文时的表现。
摘要未给出数据集规模、检索器设置、评测指标等细节。

Results

声称在四个 benchmark 上"consistently"优于强 baseline。
在证据跨上下文分散的场景优势更明显。
摘要未提供具体数值增益，无法独立核实提升幅度。

Why It Matters

为嘈杂或长尾检索结果提供可组合的 agentic RAG 模式。
为实务派在 RAG pipeline 里显式引入角色分工、证据融合层提供模板。
对构建高可靠知识问答、企业 RAG 系统的工程师有借鉴价值。

Connections to Prior Work

Self-RAG、Chain-of-Note：显式证据处理/批注思路。
Multi-agent LLM 协作（AutoGen、MetaGPT、Debate）：角色化 agent 协同。
CRAG、RA-DIT 等鲁棒 RAG 方法：处理噪声/低质量检索。
Map-Reduce / hierarchical summarization for long context。

Open Questions

多 agent 带来的推理成本与延迟如何？是否值得单次调用的 N 倍 token？
各 agent 是否共享同一底座 LLM，是否需专门微调？
合成阶段如何处理 agent 间冲突证据？是否有显式投票或置信度？
在对抗性或高度冗余检索下鲁棒性如何？
与更强的单模型长上下文推理（如 Gemini / Claude 长窗）相比是否仍有优势？

Figures

Figure 1: Figure 1 (extracted from PDF)

First, Do No Harm (With LLMs): Mitigating Racial Bias via Agentic Workflows

Mon, 27 Apr 2026 05:23:13 +0000

arXiv: 2604.18038 · PDF

Authors: Sihao Xing, Zaur Gouliev

Primary category: cs.CY · all: cs.AI, cs.CY

Matched keywords: large language model, llm, agent, agentic, retrieval, reasoning, attention, ai system

TL;DR

This study evaluates racial bias in five LLMs across synthetic patient-case generation and differential diagnosis tasks, finding all deviate from US epidemiological distributions. Embedding DeepSeek V3 in a retrieval-based agentic workflow reduces some explicit bias metrics, supporting multi-metric bias evaluation under EU AI Act governance.

Cyber Defense Benchmark: Agentic Threat Hunting Evaluation for LLMs in SecOps

Mon, 27 Apr 2026 05:22:40 +0000

arXiv: 2604.19533 · PDF

Authors: Alankrit Chona, Igor Kozlov, Ambuj Kumar

Primary category: cs.CR · all: cs.AI, cs.CR

Matched keywords: large language model, llm, agent, agentic, rag

TL;DR

Cyber Defense Benchmark evaluates LLM agents on open-ended threat hunting over raw Windows logs via iterative SQL queries. Across five frontier models, all fail dramatically — the best (Claude Opus 4.6) flags only 3.8% of malicious events, and none meet the >=50% per-tactic recall bar for unsupervised SOC deployment.

TRN-R1-Zero: Text-rich Network Reasoning via LLMs with Reinforcement Learning Only

Mon, 27 Apr 2026 05:21:53 +0000

arXiv: 2604.19070 · PDF

Authors: Yilun Liu, Ruihong Qiu, Zi Huang

Primary category: cs.CL · all: cs.CL, cs.LG

Matched keywords: large language model, llm, reasoning, chain-of-thought, inference, fine-tun, post-train

TL;DR

TRN-R1-Zero is a post-training framework that uses reinforcement learning alone to teach base LLMs to reason over text-rich networks, avoiding supervised fine-tuning or distillation while generalising across node, edge, and graph-level tasks.

Key Ideas

RL-only post-training for text-rich network (TRN) reasoning — no SFT, no CoT distillation from larger teachers.
Neighbour-aware Group Relative Policy Optimisation (N-GRPO) that shapes rewards via a novel “margin gain” metric measuring neighbour informativeness.
Node-level training transfers zero-shot to edge- and graph-level tasks, beyond typical cross-domain transfer.

Approach

The authors extend GRPO with neighbourhood awareness: for each candidate response, rewards are dynamically adjusted by a margin gain metric capturing how much neighbouring node signals contribute to the correct answer, pushing the LLM to actually use relational context rather than text alone. Training runs only on node-level supervision signals via RL on base LLMs.

Detoxification for LLM: From Dataset Itself

Mon, 27 Apr 2026 05:21:19 +0000

arXiv: 2604.19124 · PDF

Authors: Wei Shao, Yihang Wang, Gaoyu Zhu, Ziqiang Cheng, Lei Yu, Jiafeng Guo, Xueqi Cheng

Primary category: cs.CL · all: cs.CL

Matched keywords: large language model, llm, inference, serving, fine-tun, post-train

TL;DR

The paper proposes HSPD, a pipeline that detoxifies LLM pretraining corpora at the source by rewriting toxic spans with a Soft Contrastive Decoding (SoCD) method, yielding a drop-in replacement dataset that cuts downstream model toxicity while preserving semantics.

SAW-INT4: System-Aware 4-Bit KV-Cache Quantization for Real-World LLM Serving

Mon, 27 Apr 2026 05:20:49 +0000

arXiv: 2604.19157 · PDF

Authors: Jinda Jia, Jisen Li, Zhongzhu Zhou, Jung Hwan Heo, Jue Wang, Tri Dao, Shuaiwen Leon Song, Ben Athiwaratkun, Chenfeng Xu, Tianyi Zhang, Xiaoxia Wu

Primary category: cs.LG · all: cs.LG

Matched keywords: llm, serving, kv-cache, quantization, attention, throughput, latency

TL;DR

SAW-INT4 proposes token-wise INT4 KV-cache quantization with block-diagonal Hadamard rotation, the simplest scheme compatible with paged memory and fused attention in real LLM serving. A fused rotation-quantization kernel matches plain INT4 throughput while recovering nearly all accuracy lost to naive INT4.

If you're waiting for a sign... that might not be it! Mitigating Trust Boundary Confusion from Visual Injections on Vision-Language Agentic Systems

Mon, 27 Apr 2026 05:20:15 +0000

arXiv: 2604.19844 · PDF

Authors: Jiamin Chang, Minhui Xue, Ruoxi Sun, Shuchao Pang, Salil S. Kanhere, Hammond Pearce

Primary category: cs.CV · all: cs.AI, cs.CV

Matched keywords: agent, agentic, multi-agent, serving, ai system

TL;DR

This paper identifies “trust boundary confusion” in Vision-Language Agentic Systems (VLAS), where agents fail to distinguish legitimate environmental signals (e.g., traffic lights) from adversarial visual injections. The authors propose a multi-agent defense that separates perception from decision-making, improving robustness while preserving responsiveness to genuine cues.

Statistics, Not Scale: Modular Medical Dialogue with Bayesian Belief Engine

Mon, 27 Apr 2026 05:19:41 +0000

arXiv: 2604.20022 · PDF

Authors: Yusuf Kesmen, Fay Elhassan, Jiayi Ma, Julien Stalhandske, David Sasu, Alexandra Kulinkina, Akhil Arora, Lars Klein, Mary-Anne Hartley

Primary category: cs.LG · all: cs.AI, cs.CL, cs.LG

Matched keywords: large language model, llm, agent, rag, reasoning, inference

TL;DR

BMBE splits medical dialogue into an LLM “sensor” that parses utterances and a deterministic Bayesian engine that handles all diagnostic inference, yielding calibrated, private, and robust diagnosis that beats frontier standalone LLMs at a fraction of the cost.

A-MAR: Agent-based Multimodal Art Retrieval for Fine-Grained Artwork Understanding

Mon, 27 Apr 2026 05:19:13 +0000

arXiv: 2604.19689 · PDF

Authors: Shuai Wang, Hongyi Zhu, Jia-Hong Huang, Yixian Shen, Chengxi Zeng, Stevan Rudinac, Monika Kackovic, Nachoem Wijnberg, Marcel Worring

Primary category: cs.AI · all: cs.AI

Matched keywords: large language model, llm, agent, retrieval, reasoning, ai system

TL;DR

A-MAR is an agent-based multimodal retrieval framework that decomposes artwork queries into structured reasoning plans, then conditions retrieval on each step to produce grounded, interpretable explanations. It outperforms static retrieval and MLLM baselines on SemArt, Artpedia, and a new ArtCoT-QA benchmark.

Rethinking Scale: Deployment Trade-offs of Small Language Models under Agent Paradigms

Mon, 27 Apr 2026 05:18:43 +0000

arXiv: 2604.19299 · PDF

Authors: Xinlin Wang, Mats Brorsson

Primary category: cs.CL · all: cs.AI, cs.CL

Matched keywords: large language model, agent, multi-agent, tool use, reasoning, latency, fine-tun

TL;DR

This paper presents the first large-scale empirical study of sub-10B open-source SLMs across three deployment paradigms—base, single-agent with tools, and multi-agent collaboration—finding that single-agent systems offer the best cost/performance balance while multi-agent setups add overhead with limited gains.

Key Ideas

SLMs (<10B params) are viable LLM alternatives if their weaknesses are compensated by agent paradigms rather than pure scaling or fine-tuning.
Tool-augmented single agents systematically outperform base SLMs at modest extra cost.
Multi-agent collaboration yields diminishing returns relative to its computational overhead.
Deployment efficiency is a first-class design criterion for trustworthy SLM systems.

Approach

The authors benchmark open-source SLMs under three paradigms: (1) bare base model, (2) a single agent equipped with external tools, and (3) a multi-agent collaborative system. They compare performance and cost across these configurations, though the abstract does not specify which tools, orchestration framework, or agent protocols are used.

GRASPrune: Global Gating for Budgeted Structured Pruning of Large Language Models

Mon, 27 Apr 2026 05:18:10 +0000

arXiv: 2604.19398 · PDF

Authors: Ziyang Wang, Jiangfeng Xiao, Chuan Xiao, Ruoxiang Li, Rui Mao, Jianbin Qin

Primary category: cs.AI · all: cs.AI

Matched keywords: large language model, llm, rag, inference, kv cache, attention, gpu, latency, fine-tun

TL;DR

GRASPrune is a post-pretraining structured pruning framework that jointly prunes FFN channels and KV head groups under a single global budget using projected straight-through gate learning, producing a smaller dense checkpoint without fine-tuning the backbone.

ChipCraftBrain: Validation-First RTL Generation via Multi-Agent Orchestration

Mon, 27 Apr 2026 05:17:33 +0000

arXiv: 2604.19856 · PDF

Authors: Cagri Eryilmaz

Primary category: cs.AR · all: cs.AI, cs.AR, cs.LG

Matched keywords: large language model, llm, agent, agentic, multi-agent, retrieval, rag, reasoning

TL;DR

ChipCraftBrain is a multi-agent RTL generation framework combining PPO-driven orchestration, symbolic-neural reasoning, and knowledge retrieval. It hits 97.2% pass@1 on VerilogEval-Human and 94.7% on a 302-problem CVDP subset, outperforming MAGE and matching ChipAgents while using far fewer attempts than NVIDIA’s ACE-RTL.

Key Ideas

Adaptive orchestration of six specialized agents via a PPO policy over a 168-dim state (with an MPC world-model alternative).
Hybrid symbolic-neural architecture: algorithmic solvers for K-maps/truth tables, neural agents for waveforms and general RTL.
Knowledge-augmented retrieval from 321 patterns + 971 open-source reference implementations with focus-aware lookup.
Hierarchical spec decomposition into dependency-ordered sub-modules with interface synchronization.

Approach

A controller learns (PPO) to route tasks among six agents depending on problem state. Symbolic solvers handle combinational logic exactly; neural agents handle timing/waveforms. A retrieval module injects reference patterns. Complex specs are decomposed hierarchically with cross-module interface synchronization before code generation and validation.

Co-Evolving LLM Decision and Skill Bank Agents for Long-Horizon Tasks

Mon, 27 Apr 2026 05:17:00 +0000

arXiv: 2604.20987 · PDF

Authors: Xiyang Wu, Zongxia Li, Guangyao Shi, Alexander Duffy, Tyler Marques, Matthew Lyle Olson, Tianyi Zhou, Dinesh Manocha

Primary category: cs.AI · all: cs.AI

Matched keywords: large language model, llm, agent, retrieval, rag, reasoning

TL;DR

COSPLAY is a co-evolution framework pairing an LLM decision agent with a learnable skill bank: the decision agent retrieves skills to act, while a skill-pipeline agent mines reusable skills from unlabeled rollouts. An 8B model beats four frontier LLM baselines by >25% average reward on six game environments.

Agentic AI for Personalized Physiotherapy: A Multi-Agent Framework for Generative Video Training and Real-Time Pose Correction

Mon, 27 Apr 2026 05:16:25 +0000

arXiv: 2604.21154 · PDF

Authors: Abhishek Dharmaratnakar, Srivaths Ranganathan, Anushree Sinha, Debanshu Das

Primary category: cs.AI · all: cs.AI

Matched keywords: large language model, agent, agentic, multi-agent, rag

TL;DR

Proposes a four-agent system that parses clinical notes, generates patient-specific exercise videos, tracks poses in real time, and delivers corrective feedback for at-home physiotherapy. The paper is largely architectural, presenting a prototype and evaluation plan rather than clinical results.

Key Ideas

Tele-rehabilitation gap stems from static video libraries and generic avatars ignoring patient-specific constraints.
A Multi-Agent System (MAS) can close the loop by combining generative video, pose estimation, and autonomous feedback.
Four specialized micro-agents cover extraction, synthesis, vision, and diagnostics.
Unstructured clinical notes can be turned into kinematic constraints that condition downstream generation.

Approach

Four micro-agents pipeline:

EvoAgent: An Evolvable Agent Framework with Skill Learning and Multi-Agent Delegation

Mon, 27 Apr 2026 05:15:48 +0000

arXiv: 2604.20133 · PDF

Authors: Aimin Zhang, Jiajing Guo, Fuwei Jia, Chen Lv, Boyu Wang, Fangzheng Li

Primary category: cs.AI · all: cs.AI

Matched keywords: large language model, llm, agent, multi-agent, rag

TL;DR

EvoAgent is an evolvable LLM agent framework combining structured skill learning, hierarchical sub-agent delegation, and a three-layer memory. On real-world foreign-trade tasks with GPT5.2, it lifts a five-dimensional LLM-as-Judge score by ~28%.

Key Ideas

Skills modeled as multi-file structured capability units with triggers and evolutionary metadata.
User-feedback-driven closed loop for continuous skill generation and optimization.
Three-stage skill matching plus three-layer memory architecture for long-term accumulation.
Hierarchical sub-agent delegation enabling dynamic task decomposition.
Agent performance depends on model–architecture synergy, not just base model strength.

Approach

Each skill is a structured artifact (multiple files) carrying triggering logic and evolutionary metadata, so the system can decide when to invoke it and how to mutate it over time. A three-stage matcher selects skills for an incoming task; a three-layer memory separates short-term, working, and long-term context. A hierarchical delegation mechanism spawns sub-agents for decomposed subtasks, and a user-feedback closed loop drives skill creation and refinement.

Dual-Cluster Memory Agent: Resolving Multi-Paradigm Ambiguity in Optimization Problem Solving

Mon, 27 Apr 2026 05:14:56 +0000

arXiv: 2604.20183 · PDF

Authors: Xinyu Zhang, Yuchen Wan, Boxuan Zhang, Zesheng Yang, Lingling Zhang, Bifan Wei, Jun Liu

Primary category: cs.CL · all: cs.CL

Matched keywords: large language model, llm, agent, rag, reasoning, inference

TL;DR

DCM-Agent is a training-free framework that resolves structural ambiguity in LLM-based optimization problem solving by maintaining dual clusters of historical solutions (modeling + coding), distilled into Approach/Checklist/Pitfall knowledge, and using them for memory-augmented inference.

Key Ideas

Optimization problems suffer from multi-paradigm ambiguity that confuses LLMs.
Split memory into two clusters: modeling and coding.
Distill each cluster into three structured knowledge types: Approach, Checklist, Pitfall.
Use memory at inference for path navigation, error repair, and adaptive switching.
Observed “knowledge inheritance”: memory from larger models lifts smaller models.

Approach

The Dual-Cluster Memory Construction step routes prior solutions into modeling vs. coding clusters, then distills generalizable guidance into structured Approach / Checklist / Pitfall entries. At inference, the agent retrieves relevant memory to pick a reasoning path, detects and repairs errors, and adaptively switches paradigms. The entire pipeline is training-free, relying on prompting plus a structured memory bank.

FASER: Fine-Grained Phase Management for Speculative Decoding in Dynamic LLM Serving

Mon, 27 Apr 2026 05:14:26 +0000

arXiv: 2604.20503 · PDF

Authors: Wenyan Chen, Chengzhi Lu, Yanying Lin, Dmitrii Ustiugov

Primary category: cs.DC · all: cs.DC

Matched keywords: llm, inference, serving, speculative decoding, gpu, throughput, latency

TL;DR

FASER is a fine-grained speculative-decoding scheduler for dynamic LLM serving that tunes speculative length per request, prunes rejected tokens early, and spatially overlaps draft and verification phases, yielding up to 53% higher throughput and 1.92× lower latency over SOTA in vLLM.

Key Ideas

Coarse-grained, batch-level speculative decoding (SD) wastes GPU cycles under both low and high load.
Speculative length should be a per-request knob inside a continuous batch, not a global constant.
Verification can be chunked into “frontiers” and overlapped with drafting via spatial multiplexing.
Rejected tokens can be pruned mid-verification to avoid wasted compute.

Approach

FASER extends vLLM with three mechanisms: (1) dynamic per-request speculative length based on acceptance behavior within a continuous batch; (2) early pruning that terminates verification for tokens already rejected, reclaiming GPU work; (3) frontier-based verification that splits the verify pass into chunks and co-executes them with draft kernels using fine-grained spatial multiplexing for low interference.

Cooperative Profiles Predict Multi-Agent LLM Team Performance in AI for Science Workflows

Mon, 27 Apr 2026 05:13:53 +0000

arXiv: 2604.20658 · PDF

Authors: Shivani Kumar, Adarsh Bharathwaj, David Jurgens

Primary category: cs.CL · all: cs.CL

Matched keywords: large language model, llm, agent, multi-agent, reasoning, gpu

TL;DR

Authors benchmark 35 open-weight LLMs on six behavioral-economics games and show that the resulting “cooperative profiles” predict downstream team performance in AI-for-Science workflows under shared budget constraints, offering a cheap diagnostic for multi-agent deployment.

Key Ideas

Cooperative disposition is a distinct, measurable LLM property, not reducible to general capability.
Behavioral-economics games isolate cooperation mechanisms that transfer to realistic multi-agent science tasks.
Models favoring multiplicative team production over greedy strategies yield better scientific reports.
Game-based screening can precede expensive multi-agent rollouts.

Approach

Evaluate 35 open-weight LLMs across six behavioral-economics games targeting distinct cooperation mechanisms (coordination, investment, resource sharing).
Derive per-model “cooperative profiles” from game behavior.
Deploy LLM teams in an AI-for-Science pipeline: collaboratively analyze data, build models, and write scientific reports under shared budgets (e.g., GPU/credit caps).
Regress downstream outcomes on cooperative profile features while controlling for confounds (likely model size, general ability benchmarks).

Experiments

Models: 35 open-weight LLMs.
Games: six behavioral-economics tasks (abstract not specific, but likely includes public-goods, trust, coordination variants).
Downstream task: multi-agent AI-for-Science workflow with shared constraints.
Metrics: report accuracy, quality, and completion.
Baselines / controls: general-ability factors partialled out.

Results

Cooperative profiles robustly predict downstream accuracy, quality, and completion.
Effect persists after controlling for multiple confounding factors.
Headline numerical effect sizes not given in the abstract.

Why It Matters

Provides a fast, inexpensive screening tool for multi-agent LLM deployments where coordination and budget-sharing matter.
Reframes multi-agent selection beyond raw benchmark scores toward cooperative disposition.
Useful for agent/infra teams building scientific, engineering, or tool-using LLM collectives.

Connections to Prior Work

Behavioral-economics probes of LLMs (trust games, ultimatum, public-goods studies).
Multi-agent LLM frameworks (AutoGen, MetaGPT, ChatDev, AI-Scientist).
Work on LLM “personality” / social-preference elicitation.
Emergent cooperation and game-theoretic evaluations in RL agents.
Scientific-writing and data-analysis agent benchmarks.

Open Questions

Which specific games carry the most predictive signal, and do they generalize beyond AI-for-Science?
Does cooperative profile stay stable under prompting, fine-tuning, or RLHF interventions?
Are closed-weight frontier models (GPT-4.x, Claude, Gemini) consistent with the 35-model findings?
Can cooperative disposition be deliberately trained or aligned, and at what cost to single-agent capability?
How do heterogeneous teams (mixing cooperators and defectors) behave versus homogeneous ones?

Figures

Figure 1: Page 2 (rendered)

Breaking MCP with Function Hijacking Attacks: Novel Threats for Function Calling and Agentic Models

Mon, 27 Apr 2026 05:13:18 +0000

arXiv: 2604.20994 · PDF

Authors: Yannis Belkhiter, Giulio Zizzo, Sergio Maffeis, Seshu Tirupathi, John D. Kelleher

Primary category: cs.CR · all: cs.AI, cs.CL, cs.CR

Matched keywords: large language model, llm, agent, agentic, reasoning, attention

TL;DR

This paper introduces Function Hijacking Attacks (FHA), a novel adversarial technique that manipulates agentic LLMs’ tool selection to force invocation of attacker-chosen functions, achieving 70-100% attack success rates across five models on the BFCL benchmark, largely independent of query semantics.

Automatic Ontology Construction Using LLMs as an External Layer of Memory, Verification, and Planning for Hybrid Intelligent Systems

Mon, 27 Apr 2026 05:12:44 +0000

arXiv: 2604.20795 · PDF

Authors: Pavel Salovskii, Iuliia Gorshkova

Primary category: cs.AI · all: cs.AI

Matched keywords: large language model, llm, agent, retrieval, rag, reasoning, inference

TL;DR

The paper proposes a hybrid architecture augmenting LLMs with an external RDF/OWL ontological memory layer, automatically constructed from heterogeneous sources, to enable persistent, verifiable, and semantically grounded reasoning beyond vector-based RAG.

Key Ideas

LLMs suffer from weak long-term memory, poor structure, and unreliable multi-step reasoning.
An external ontology (RDF/OWL knowledge graph) acts as verifiable memory and planning substrate.
Automated pipeline builds and maintains the ontology from documents, APIs, and dialogue logs.
SHACL/OWL constraints turn inference into a generation–verification–correction loop.
Hybrid inference combines vector retrieval, graph reasoning, and external tool calls.

Approach

The pipeline extracts entities and relations from heterogeneous inputs, normalizes them, and generates RDF triples. Triples are validated against SHACL shapes and OWL axioms, then merged into a continuously updated knowledge graph. At inference time, the LLM conditions on a composite context fusing vector-retrieved passages, graph subqueries, and tool outputs. Generated answers are checked against ontology constraints; violations trigger correction, yielding a closed verify-and-repair loop.

HaS: Accelerating RAG through Homology-Aware Speculative Retrieval

Mon, 27 Apr 2026 05:12:02 +0000

arXiv: 2604.20452 · PDF

Authors: Peng Peng, Weiwei Lin, Wentai Wu, Xinyang Wang, Yongheng Liu

Primary category: cs.IR · all: cs.CL, cs.IR

Matched keywords: large language model, llm, agent, agentic, retrieval, rag, inference, latency

TL;DR

HaS accelerates Retrieval-Augmented Generation by speculatively retrieving from a restricted scope, then validating candidates via “homologous query re-identification” — checking whether the incoming query matches a previously-seen one. This bypasses full-database search for repeat-like queries, cutting latency 24–37% with 1–2% accuracy loss.

SAKE: Self-aware Knowledge Exploitation-Exploration for Grounded Multimodal Named Entity Recognition

Mon, 27 Apr 2026 05:11:32 +0000

arXiv: 2604.20146 · PDF

Authors: Jielong Tang, Xujie Yuan, Jiayang Liu, Jianxing Yu, Xiao Dong, Lin Chen, Yunlai Teng, Shimin Di, Jian Yin

Primary category: cs.IR · all: cs.CL, cs.IR

Matched keywords: large language model, llm, agent, agentic, tool-use, retrieval, reasoning, chain-of-thought, serving, fine-tun

TL;DR

SAKE is an end-to-end agentic framework for Grounded Multimodal Named Entity Recognition (GMNER) that blends internal MLLM knowledge with external retrieval via self-aware reasoning, deciding when to invoke search tools to handle long-tailed and unseen entities on social media.

Enhancing Online Recruitment with Category-Aware MoE and LLM-based Data Augmentation

Mon, 27 Apr 2026 05:10:58 +0000

arXiv: 2604.21264 · PDF

Authors: Minping Chen, Bing Xu, Zulong Chen, Chuanfei Xu, Ying Zhou, Zui Tao, Zeyi Wen

Primary category: cs.AI · all: cs.AI

Matched keywords: large language model, llm, rag, chain-of-thought, mixture of experts, moe

TL;DR

The paper proposes an LLM-enhanced Person-Job Fit (PJF) system combining chain-of-thought data augmentation for low-quality job descriptions with a category-aware Mixture of Experts module to better distinguish similar candidate-job pairs, yielding measurable gains in offline metrics and online A/B tests.

LayerBoost: Layer-Aware Attention Reduction for Efficient LLMs

Mon, 27 Apr 2026 05:10:15 +0000

arXiv: 2604.22050 · PDF

Authors: Mohamed Ali Souibgui, Jan Fostier, Rodrigo Abadía-Heredia, Bohdan Denysenko, Christian Marschke, Igor Peric

Primary category: cs.LG · all: cs.CL, cs.LG

Matched keywords: llm, inference, serving, attention, transformer, throughput, latency

TL;DR

LayerBoost is a layer-aware attention reduction method that uses sensitivity analysis to selectively apply softmax, linear sliding window, or no attention per layer, recovered via a lightweight 10M-token distillation. It improves throughput by up to 68% at high concurrency while preserving quality.

Lightweight Retrieval-Augmented Generation and Large Language Model-Based Modeling for Scalable Patient-Trial Matching

Mon, 27 Apr 2026 05:09:44 +0000

arXiv: 2604.22061 · PDF

Authors: Xiaodi Li, Yang Xiao, Munhwan Lee, Konstantinos Leventakos, Young J. Juhn, David Jones, Terence T. Sio, Wei Liu, Maria Vassilaki, Nansu Zong

Primary category: cs.CL · all: cs.AI, cs.CL, cs.LG

Matched keywords: large language model, llm, retrieval, reasoning, serving, fine-tun

TL;DR

该论文提出一种轻量级框架，结合 RAG 与 LLM 表征建模，用于可扩展的患者-临床试验匹配，在多个公开和真实临床数据集上以显著更低的计算代价达到与端到端 LLM 相当的性能。

Key Ideas

将 RAG 与 LLM 表征解耦：RAG 负责从长 EHR 中选相关片段，LLM 负责编码。
引入降维与轻量分类器，实现下游高效分类。
冻结 LLM 对结构化数据已足够，非结构化临床叙述则必须微调。
在公开基准与 Mayo Clinic 真实多模态数据集上验证可扩展性。

Approach

Pipeline 分两阶段：(1) RAG 从长 EHR 中检索与试验入组标准相关的临床片段，降低输入长度；(2) LLM 将这些片段编码为表征，再经降维后输入轻量预测器（如线性或浅层模型）完成匹配分类。对结构化字段用冻结 LLM，对自由文本叙述部分做微调。

Emergent Strategic Reasoning Risks in AI: A Taxonomy-Driven Evaluation Framework

Mon, 27 Apr 2026 05:09:07 +0000

arXiv: 2604.22119 · PDF

Authors: Tharindu Kumarage, Lisa Bauer, Yao Ma, Dan Rosen, Yashasvi Raghavendra Guduri, Anna Rumshisky, Kai-Wei Chang, Aram Galstyan, Rahul Gupta, Charith Peris

Primary category: cs.AI · all: cs.AI

Matched keywords: large language model, llm, agent, agentic, reasoning

TL;DR

This paper introduces ESRRSim, a taxonomy-driven agentic framework for evaluating Emergent Strategic Reasoning Risks (ESRRs) in LLMs—behaviors like deception, evaluation gaming, and reward hacking. Across 11 reasoning LLMs, detection rates vary from 14.45% to 72.72%.

Trust but Verify: Introducing DAVinCI -- A Framework for Dual Attribution and Verification in Claim Inference for Language Models

Mon, 27 Apr 2026 05:08:32 +0000

arXiv: 2604.21193 · PDF

Authors: Vipula Rawte, Ryan Rossi, Franck Dernoncourt, Nedim Lipka

Primary category: cs.AI · all: cs.AI

Matched keywords: large language model, llm, retrieval, reasoning, inference, ai system

TL;DR

DAVinCI is a two-stage framework that combines claim attribution (to internal model components and external sources) with entailment-based verification and confidence calibration, improving factual reliability of LLM outputs by 5–20% over verification-only baselines on FEVER and CLIMATE-FEVER.

Key Ideas

Dual approach: pair attribution with verification rather than treating them independently.
Attribute claims both to internal LLM components and external retrieved sources.
Use entailment reasoning plus confidence recalibration for claim checking.
Release a modular implementation pluggable into existing LLM pipelines.

Approach

DAVinCI runs in two stages. Stage 1 attributes each generated claim to (a) internal model components and (b) external evidence sources. Stage 2 verifies each claim via entailment-based reasoning, then recalibrates confidence scores. The abstract does not specify the exact attribution mechanism (e.g., attention tracing, gradient-based, or retrieval citation) or which entailment model is used.

MambaCSP: Hybrid-Attention State Space Models for Hardware-Efficient Channel State Prediction

Mon, 27 Apr 2026 05:08:03 +0000

arXiv: 2604.21957 · PDF

Authors: Aladin Djuhera, Haris Gacanin, Holger Boche

Primary category: cs.IT · all: cs.AI, cs.IT, cs.LG, eess.SP

Matched keywords: large language model, llm, inference, attention, transformer, throughput, latency

TL;DR

MambaCSP replaces Transformer/LLM backbones for channel state prediction with a hybrid Mamba SSM augmented by lightweight patch-mixer attention, achieving 9–12% accuracy gains and up to 3× throughput over LLM baselines in MISO-OFDM simulations.

Key Ideas

Pure attention-based CSP suffers quadratic sequence cost, limiting real-time wireless use.
Selective SSMs (Mamba) offer linear-time alternatives but lack long-range cross-token mixing.
Hybrid design: Mamba backbone + periodic patch-mixer attention layers recovers global context cheaply.
Hardware efficiency (VRAM, latency, throughput) is treated as a first-class objective alongside accuracy.

Approach

MambaCSP swaps the LLM prediction backbone for a linear-time Mamba selective SSM operating on CSI sequences. Because pure SSMs capture mostly local dependencies, the authors periodically insert lightweight “patch-mixer” attention layers that inject cross-token interactions across patched CSI tokens. The architecture thus alternates SSM blocks (cheap sequential mixing) with sparse attention (global context), targeting MISO-OFDM channel prediction.

Pre-trained LLMs Meet Sequential Recommenders: Efficient User-Centric Knowledge Distillation

Mon, 27 Apr 2026 05:07:25 +0000

arXiv: 2604.21536 · PDF

Authors: Nikita Severin, Danil Kartushov, Vladislav Urzhumov, Vladislav Kulikov, Oksana Konovalova, Alexey Grishanov, Anton Klenitskiy, Artem Fatkulin, Alexey Vasilev, Andrey Savchenko, Ilya Makarov

Primary category: cs.IR · all: cs.AI, cs.IR

Matched keywords: large language model, llm, reasoning, inference, serving, fine-tun

TL;DR

The paper proposes a knowledge distillation method that transfers LLM-generated textual user profiles into sequential recommender systems, enhancing user semantic understanding without incurring LLM inference costs at serving time.

Memanto: Typed Semantic Memory with Information-Theoretic Retrieval for Long-Horizon Agents

Mon, 27 Apr 2026 05:06:52 +0000

arXiv: 2604.22085 · PDF

Authors: Seyed Moein Abtahi, Rasa Rahnema, Hetkumar Patel, Neel Patel, Majid Fekri, Tara Khani

Primary category: cs.AI · all: cs.AI

Matched keywords: large language model, agent, agentic, retrieval, inference, latency

TL;DR

Memanto is a memory layer for long-horizon LLM agents that replaces knowledge-graph pipelines with a typed semantic schema plus an information-theoretic retrieval engine, hitting 89.8% on LongMemEval and 87.1% on LoCoMo with single-query retrieval and no ingestion cost.

Tool Attention Is All You Need: Dynamic Tool Gating and Lazy Schema Loading for Eliminating the MCP/Tools Tax in Scalable Agentic Workflows

Mon, 27 Apr 2026 05:06:21 +0000

arXiv: 2604.21816 · PDF

Authors: Anuj Sadani, Deepak Kumar

Primary category: cs.AI · all: cs.AI

Matched keywords: large language model, llm, agent, agentic, reasoning, attention, latency

TL;DR

Tool Attention is a middleware layer that replaces MCP’s eager schema injection with intent-gated, lazy schema loading — cutting per-turn tool tokens by 95% in simulation and arguing that protocol efficiency, not context length, is the real bottleneck for scalable agentic systems.

Key Ideas

The “MCP Tax” (10k–60k tokens/turn) inflates KV cache and pushes context past known reasoning-degradation thresholds (~70%).
Generalize self-attention into attention over tools: score, gate, then selectively expose schemas.
Protocol-level efficiency is a tighter constraint than raw context window size.

Approach

A middleware sitting between agent and MCP servers with three components:

Nemobot Games: Crafting Strategic AI Gaming Agents for Interactive Learning with Large Language Models

Mon, 27 Apr 2026 05:05:47 +0000

arXiv: 2604.21896 · PDF

Authors: Chee Wei Tan, Yuchen Wang, Shangxin Guo

Primary category: cs.AI · all: cs.AI

Matched keywords: large language model, llm, agent, agentic, rag, reasoning, fine-tun

TL;DR

Nemobot is an interactive agentic environment that uses LLMs to build and deploy game-playing agents across Shannon’s taxonomy, spanning dictionary-based, solvable, heuristic, and learning-based games, aiming toward self-programming AI.

Key Ideas

Extends Shannon’s 1950 taxonomy of game-playing machines into an LLM era paradigm.
Four game classes handled distinctly: dictionary, solvable, heuristic, learning-based.
Agents combine minimax, crowd-sourced data, RLHF, and self-critique.
Programmable environment for tool-augmented generation and fine-tuning.
Positions user-in-the-loop customization as a route to self-programming.

Approach

A chatbot-driven agentic engine routes game tasks by class: compressed state-action mappings for dictionary games; exact mathematical reasoning with human-readable explanations for solvable games; hybrid minimax-plus-crowd heuristics for heuristic games; RLHF with self-critique and imitation learning for learning-based games. Nemobot exposes these as programmable, tool-augmented workflows users can customize and fine-tune.

Guess-Verify-Refine: Data-Aware Top-K for Sparse-Attention Decoding on Blackwell via Temporal Correlation

Mon, 27 Apr 2026 05:02:30 +0000

arXiv: 2604.22312 · PDF

Authors: Long Cheng, Ritchie Zhao, Timmy Liu, Mindy Li, Xianjie Qiao, Kefeng Duan, Yu-Jung Chen, Xiaoming Chen, Bita Darvish Rouhani, June Yang

Primary category: cs.DC · all: cs.AR, cs.DC, cs.PF

Matched keywords: llm, rag, serving, speculative decoding, attention, latency

TL;DR

GVR is a data-aware exact Top-K algorithm for sparse-attention decoding on NVIDIA Blackwell. By exploiting temporal correlation between consecutive decode steps, it delivers 1.88× average kernel speedup over radix-select while preserving bit-exact outputs.

Behavioral Canaries: Auditing Private Retrieved Context Usage in RL Fine-Tuning

Mon, 27 Apr 2026 05:01:57 +0000

arXiv: 2604.22191 · PDF

Authors: Chaoran Chen, Dayu Yuan, Peter Kairouz

Primary category: cs.CR · all: cs.CL, cs.CR

Matched keywords: llm, agent, agentic, inference, fine-tun, post-train

TL;DR

The paper introduces Behavioral Canaries, an auditing technique for detecting unauthorized use of protected retrieved documents in RL fine-tuning (RLFT) pipelines. Unlike memorization-based audits, it plants trigger-conditioned stylistic preferences that surface as behavioral shifts, achieving 67% detection at 10% FPR (AUROC 0.756) with only 1% canary injection.

GR-Evolve: Design-Adaptive Global Routing via LLM-Driven Algorithm Evolution

Mon, 27 Apr 2026 05:01:16 +0000

arXiv: 2604.22234 · PDF

Authors: Taizun Jafri, Vidya A. Chhabria

Primary category: cs.AR · all: cs.AR

Matched keywords: large language model, llm, agent, agentic, rag

TL;DR

GR-Evolve uses an agentic LLM to iteratively evolve global router source code, specializing EDA algorithms per-design via QoR feedback within OpenROAD, achieving up to 8.72% post-detailed-routing wirelength reduction over baselines.

Key Ideas

Introduces “design-adaptive EDA tooling”: algorithms themselves adapt to each design, not just hyperparameters.
Uses LLM-driven code evolution on global router source code.
Closes the loop with QoR-driven feedback from OpenROAD toolchain.
Equips the LLM with persistent contextual knowledge about open-source routers.

Approach

GR-Evolve is a code evolution framework wrapping an agentic LLM around an open-source global router. The LLM iteratively edits the router’s source code; each candidate is compiled and evaluated through an integrated OpenROAD QoR pipeline. Persistent context about router internals grounds the LLM, and QoR metrics (notably post-detailed-routing wirelength) steer subsequent mutations.