2026-04-24 on JXIN's Home

Large Language Models Decide Early and Explain Later

Mon, 27 Apr 2026 08:08:57 +0000

Authors: Ayan Datta, Zhixue Zhao, Bhuvanesh Verma, Radhika Mamidi, Mounika Marreddy, Alexander Mehler

Primary category: cs.CL · all: cs.CL

Matched keywords: large language model, rag, reasoning, chain-of-thought, inference, latency

TL;DR

Studying Qwen3-4B, the authors show LLMs often lock in their answer partway through chain-of-thought reasoning and spend hundreds of tokens explaining post-hoc; simple early-stopping heuristics cut ~500 tokens per query for only a 2% accuracy loss.

Agentic World Modeling: Foundations, Capabilities, Laws, and Beyond

Mon, 27 Apr 2026 08:07:50 +0000

arXiv: 2604.22748 · PDF

Authors: Meng Chu, Xuan Billy Zhang, Kevin Qinghong Lin, Lingdong Kong, Jize Zhang, Teng Tu, Weijian Ma, Ziqi Huang, Senqiao Yang, Wei Huang, Yeying Jin, Zhefan Rao, Jinhui Ye, Xinyu Lin, Xichen Zhang, Qisheng Hu, Shuai Yang, Leyang Shen, Wei Chow, Yifei Dong, Fengyi Wu, Quanyu Long, Bin Xia, Shaozuo Yu, Mingkang Zhu, Wenhu Zhang, Jiehui Huang, Haokun Gui, Haoxuan Che, Long Chen, Qifeng Chen, Wenxuan Zhang, Wenya Wang, Xiaojuan Qi, Yang Deng, Yanwei Li, Mike Zheng Shou, Zhi-Qi Cheng, See-Kiong Ng, Ziwei Liu, Philip Torr, Jiaya Jia

How Do AI Agents Spend Your Money? Analyzing and Predicting Token Consumption in Agentic Coding Tasks

Mon, 27 Apr 2026 08:06:58 +0000

arXiv: 2604.22750 · PDF

Authors: Longju Bai, Zhemin Huang, Xingyao Wang, Jiao Sun, Rada Mihalcea, Erik Brynjolfsson, Alex Pentland, Jiaxin Pei

Primary category: cs.CL · all: cs.CL, cs.CY, cs.HC, cs.SE

Matched keywords: llm, agent, agentic, rag, reasoning

TL;DR

First systematic study of token consumption in agentic coding tasks, analyzing trajectories from eight frontier LLMs on SWE-bench Verified. Finds agentic tasks consume 1000x more tokens than chat/reasoning, usage is highly stochastic, models vary dramatically in efficiency, and LLMs cannot reliably predict their own costs.

Bridging the Long-Tail Gap: Robust Retrieval-Augmented Relation Completion via Multi-Stage Paraphrase Infusion

Mon, 27 Apr 2026 08:05:41 +0000

arXiv: 2604.22261 · PDF

Authors: Fahmida Alam, Mihai Surdeanu, Ellen Riloff

Primary category: cs.CL · all: cs.CL

Matched keywords: large language model, llm, retrieval, rag, reasoning, fine-tun

TL;DR

RC-RAG is a training-free, multi-stage RAG framework that injects relation paraphrases into retrieval, summarization, and generation to boost long-tail relation completion. It delivers +40.6 EM over standalone LLMs and +13–16 EM over strong RAG baselines.

Key Ideas

LLMs (with or without RAG) fail on rare/long-tail relations due to narrow lexical surface forms.
Paraphrases of a relation can systematically broaden coverage across the RAG pipeline.
No fine-tuning required — purely prompt- and retrieval-level intervention.
Gains hold across five LLMs and two benchmark datasets.

Approach

RC-RAG threads relation paraphrases through three stages:

QuantClaw: Precision Where It Matters for OpenClaw

Mon, 27 Apr 2026 08:04:32 +0000

arXiv: 2604.22577 · PDF

Authors: Manyi Zhang, Ji-Fu Li, Zhongao Sun, Xiaohao Liu, Zhenhua Dong, Xianzhi Yu, Haoli Bai, Xiaobo Xia

Primary category: cs.AI · all: cs.AI, cs.CL

Matched keywords: agent, reasoning, inference, serving, quantization, latency

TL;DR

QuantClaw is a plug-and-play precision-routing plugin for the OpenClaw agent system that dynamically assigns quantization precision per task, cutting cost up to 21.4% and latency 15.7% on GLM-5 (FP8 baseline) without degrading task quality.

Key Ideas

Quantization sensitivity in agent workflows is highly task-dependent, not uniform.
Precision should be treated as a dynamic resource, routed per request.
A lightweight plugin can sit in front of OpenClaw without increasing user complexity.

Behavioral Canaries: Auditing Private Retrieved Context Usage in RL Fine-Tuning

Mon, 27 Apr 2026 08:03:42 +0000

arXiv: 2604.22191 · PDF

Authors: Chaoran Chen, Dayu Yuan, Peter Kairouz

Primary category: cs.CR · all: cs.CL, cs.CR

Matched keywords: llm, agent, agentic, inference, fine-tun, post-train

TL;DR

The paper introduces Behavioral Canaries, an auditing mechanism that detects unauthorized use of protected retrieved documents in RL fine-tuning by planting document-triggered stylistic preferences and later probing for them.

Key Ideas

Standard memorization/MIA audits fail against RLFT since RL shapes style, not fact retention.
Inject behavioral canaries: pair document triggers with preference data rewarding a distinctive style.
If the provider trained on the protected corpus, the model exhibits a latent trigger-conditioned stylistic shift detectable by auditors.
Reframes auditing from content leakage to distributional behavioral change.

Approach

Auditors instrument a subset of retrieved documents by constructing preference pairs where the “chosen” response exhibits a distinctive stylistic pattern conditioned on a trigger drawn from the document. When an unscrupulous provider funnels this preference data into RLHF/DPO-style RLFT, the policy internalizes a trigger→style association. At audit time, the auditor issues probe queries containing the trigger and measures whether stylistic features appear at rates significantly above baseline, yielding a statistical detection test.

GR-Evolve: Design-Adaptive Global Routing via LLM-Driven Algorithm Evolution

Mon, 27 Apr 2026 07:58:19 +0000

arXiv: 2604.22234 · PDF

Authors: Taizun Jafri, Vidya A. Chhabria

Primary category: cs.AR · all: cs.AR

Matched keywords: large language model, llm, agent, agentic, rag

TL;DR

GR-Evolve 用 agentic LLM 迭代修改全局布线器源码，以 QoR 反馈驱动"设计自适应"EDA：让算法本身针对具体芯片设计特化，而非仅调超参。

Key Ideas

提出 design-adaptive EDA 范式：工具内部算法针对每个 design 自动特化。
用 LLM 演化 global router 源码，而非只调 hyperparameter。
以 QoR 指标作为进化反馈信号形成闭环。
在 OpenROAD 基础设施上集成 QoR 评估工具链。

Approach

LLM agent 持有开源 global router 的持久化上下文知识，迭代修改源代码；每轮在 OpenROAD 中跑 detailed routing 得到 QoR，并将结果回馈给 LLM 指导下一轮代码变更。等价于把代码进化 + 评估循环封装成自动化流水线。

Guess-Verify-Refine: Data-Aware Top-K for Sparse-Attention Decoding on Blackwell via Temporal Correlation

Mon, 27 Apr 2026 07:57:20 +0000

arXiv: 2604.22312 · PDF

Authors: Long Cheng, Ritchie Zhao, Timmy Liu, Mindy Li, Xianjie Qiao, Kefeng Duan, Yu-Jung Chen, Xiaoming Chen, Bita Darvish Rouhani, June Yang

Primary category: cs.DC · all: cs.AR, cs.DC, cs.PF

Matched keywords: llm, rag, serving, speculative decoding, attention, latency

TL;DR

GVR is a data-aware exact Top-K kernel for sparse-attention decoding on NVIDIA Blackwell. By exploiting temporal correlation between consecutive decode steps, it delivers 1.88× average (up to 2.42×) speedup over radix-select while preserving bit-exact outputs, yielding up to 7.52% end-to-end TPOT gains on DeepSeek-V3.2.

Sovereign Agentic Loops: Decoupling AI Reasoning from Execution in Real-World Systems

Mon, 27 Apr 2026 07:56:13 +0000

arXiv: 2604.22136 · PDF

Authors: Jun He, Deying Yu

Primary category: cs.CR · all: cs.CR, cs.LG

Matched keywords: large language model, llm, agent, agentic, reasoning, latency

TL;DR

SAL is a control-plane architecture that decouples LLM reasoning from execution: models emit structured intents with justifications, and a validator checks them against true state and policy before any mutation. A prototype blocks 100% of unsafe intents with 12.4 ms median overhead.

Key Ideas

Direct coupling of stochastic LLM outputs to execution APIs is an unsound safety model.
Separate intent emission (model) from intent validation + execution (control plane).
Add an obfuscation membrane to hide identity-sensitive state from the model.
Maintain a cryptographically linked Evidence Chain for audit and deterministic replay.
Formal guarantees: policy-bounded execution, identity isolation, replay determinism.

Approach

Models produce structured intents (action, args, justification) rather than raw API calls. The control plane:

Preference Heads in Large Language Models: A Mechanistic Framework for Interpretable Personalization

Mon, 27 Apr 2026 07:55:24 +0000

arXiv: 2604.22345 · PDF

Authors: Weixu Zhang, Ye Yuan, Changjiang Han, Yuxing Tian, Zipeng Sun, Linfeng Du, Jikun Kang, Hong Kang, Xue Liu, Haolun Wu

Primary category: cs.CL · all: cs.CL

Matched keywords: large language model, llm, rag, inference, serving, attention, transformer

TL;DR

The paper hypothesizes that LLM personalization is driven by a sparse set of “Preference Heads” — specific attention heads encoding user style/topic preferences. It introduces Differential Preference Steering (DPS), a training-free decoding method that identifies these heads via causal masking and amplifies their effect at inference.

Guess-Verify-Refine: Data-Aware Top-K for Sparse-Attention Decoding on Blackwell via Temporal Correlation

Mon, 27 Apr 2026 05:02:30 +0000

arXiv: 2604.22312 · PDF

Authors: Long Cheng, Ritchie Zhao, Timmy Liu, Mindy Li, Xianjie Qiao, Kefeng Duan, Yu-Jung Chen, Xiaoming Chen, Bita Darvish Rouhani, June Yang

Primary category: cs.DC · all: cs.AR, cs.DC, cs.PF

Matched keywords: llm, rag, serving, speculative decoding, attention, latency

TL;DR

GVR is a data-aware exact Top-K algorithm for sparse-attention decoding on NVIDIA Blackwell. By exploiting temporal correlation between consecutive decode steps, it delivers 1.88× average kernel speedup over radix-select while preserving bit-exact outputs.

Behavioral Canaries: Auditing Private Retrieved Context Usage in RL Fine-Tuning

Mon, 27 Apr 2026 05:01:57 +0000

arXiv: 2604.22191 · PDF

Authors: Chaoran Chen, Dayu Yuan, Peter Kairouz

Primary category: cs.CR · all: cs.CL, cs.CR

Matched keywords: llm, agent, agentic, inference, fine-tun, post-train

TL;DR

The paper introduces Behavioral Canaries, an auditing technique for detecting unauthorized use of protected retrieved documents in RL fine-tuning (RLFT) pipelines. Unlike memorization-based audits, it plants trigger-conditioned stylistic preferences that surface as behavioral shifts, achieving 67% detection at 10% FPR (AUROC 0.756) with only 1% canary injection.

GR-Evolve: Design-Adaptive Global Routing via LLM-Driven Algorithm Evolution

Mon, 27 Apr 2026 05:01:16 +0000

arXiv: 2604.22234 · PDF

Authors: Taizun Jafri, Vidya A. Chhabria

Primary category: cs.AR · all: cs.AR

Matched keywords: large language model, llm, agent, agentic, rag

TL;DR

GR-Evolve uses an agentic LLM to iteratively evolve global router source code, specializing EDA algorithms per-design via QoR feedback within OpenROAD, achieving up to 8.72% post-detailed-routing wirelength reduction over baselines.

Key Ideas

Introduces “design-adaptive EDA tooling”: algorithms themselves adapt to each design, not just hyperparameters.
Uses LLM-driven code evolution on global router source code.
Closes the loop with QoR-driven feedback from OpenROAD toolchain.
Equips the LLM with persistent contextual knowledge about open-source routers.

Approach

GR-Evolve is a code evolution framework wrapping an agentic LLM around an open-source global router. The LLM iteratively edits the router’s source code; each candidate is compiled and evaluated through an integrated OpenROAD QoR pipeline. Persistent context about router internals grounds the LLM, and QoR metrics (notably post-detailed-routing wirelength) steer subsequent mutations.