2026-04-27 on JXIN's Home

QuantClaw: Precision Where It Matters for OpenClaw

Mon, 27 Apr 2026 10:25:37 +0000

Authors: Manyi Zhang, Ji-Fu Li, Zhongao Sun, Xiaohao Liu, Zhenhua Dong, Xianzhi Yu, Haoli Bai, Xiaobo Xia

Affiliations: Huawei Technologies, National University of Singapore, University of Science and Technology of China

Primary category: cs.AI · all: cs.AI, cs.CL

Matched keywords: agent, reasoning, inference, serving, quantization, latency

TL;DR

QuantClaw is a plug-and-play precision routing plugin for OpenClaw agent systems that dynamically assigns quantization precision per task, cutting cost up to 21.4% and latency 15.7% on GLM-5 vs an FP8 baseline while preserving task quality.

Guess-Verify-Refine: Data-Aware Top-K for Sparse-Attention Decoding on Blackwell via Temporal Correlation

Mon, 27 Apr 2026 10:24:33 +0000

arXiv: 2604.22312 · PDF

Authors: Long Cheng, Ritchie Zhao, Timmy Liu, Mindy Li, Xianjie Qiao, Kefeng Duan, Yu-Jung Chen, Xiaoming Chen, Bita Darvish Rouhani, June Yang

Affiliations: NVIDIA

Primary category: cs.DC · all: cs.AR, cs.DC, cs.PF

Matched keywords: llm, rag, serving, speculative decoding, attention, latency

TL;DR

Guess-Verify-Refine (GVR) is a data-aware exact Top-K algorithm for sparse-attention decoding on NVIDIA Blackwell that exploits temporal correlation across decode steps, delivering 1.88× average (up to 2.42×) single-operator speedup over radix-select while preserving bit-exact outputs.

LayerBoost: Layer-Aware Attention Reduction for Efficient LLMs

Mon, 27 Apr 2026 10:22:41 +0000

arXiv: 2604.22050 · PDF

Authors: Mohamed Ali Souibgui, Jan Fostier, Rodrigo Abadía-Heredia, Bohdan Denysenko, Christian Marschke, Igor Peric

Affiliations: Openchip & Softwares Technologies

Primary category: cs.LG · all: cs.CL, cs.LG

Matched keywords: llm, inference, serving, attention, transformer, throughput, latency

TL;DR

LayerBoost is a layer-aware attention reduction method that applies different attention strategies (softmax, linear sliding-window, or removal) per layer based on sensitivity analysis, followed by lightweight distillation healing using just 10M tokens. It improves throughput by up to 68% at high concurrency while preserving quality.

Lightweight Retrieval-Augmented Generation and Large Language Model-Based Modeling for Scalable Patient-Trial Matching

Mon, 27 Apr 2026 10:21:16 +0000

arXiv: 2604.22061 · PDF

Authors: Xiaodi Li, Yang Xiao, Munhwan Lee, Konstantinos Leventakos, Young J. Juhn, David Jones, Terence T. Sio, Wei Liu, Maria Vassilaki, Nansu Zong

Affiliations: Mayo Clinic, University of Tulsa

Primary category: cs.CL · all: cs.AI, cs.CL, cs.LG

Matched keywords: large language model, llm, retrieval, reasoning, serving, fine-tun

TL;DR

A lightweight patient-trial matching framework that uses retrieval-augmented generation to extract relevant EHR segments and LLMs to encode them, achieving performance comparable to end-to-end LLM pipelines at substantially lower compute cost.

Emergent Strategic Reasoning Risks in AI: A Taxonomy-Driven Evaluation Framework

Mon, 27 Apr 2026 10:20:15 +0000

arXiv: 2604.22119 · PDF

Authors: Tharindu Kumarage, Lisa Bauer, Yao Ma, Dan Rosen, Yashasvi Raghavendra Guduri, Anna Rumshisky, Kai-Wei Chang, Aram Galstyan, Rahul Gupta, Charith Peris

Affiliations: Amazon Nova Responsible AI

Primary category: cs.AI · all: cs.AI

Matched keywords: large language model, llm, agent, agentic, reasoning

TL;DR

The paper introduces ESRRSim, a taxonomy-driven agentic framework for benchmarking Emergent Strategic Reasoning Risks (ESRRs) in LLMs — deception, evaluation gaming, reward hacking, and more. Across 11 reasoning LLMs, detection rates span 14.45%–72.72%, with newer generations showing dramatic safety improvements.

Behavioral Canaries: Auditing Private Retrieved Context Usage in RL Fine-Tuning

Mon, 27 Apr 2026 10:19:21 +0000

arXiv: 2604.22191 · PDF

Authors: Chaoran Chen, Dayu Yuan, Peter Kairouz

Affiliations: Google

Primary category: cs.CR · all: cs.CL, cs.CR

Matched keywords: llm, agent, agentic, inference, fine-tun, post-train

TL;DR

Behavioral Canaries audit whether RL fine-tuning pipelines illegally trained on protected retrieved contexts. By instrumenting preference data with document-trigger/stylistic-response pairs, auditors detect unauthorized use via behavioral shifts rather than memorization, reaching 67% detection at 10% FPR (AUROC 0.756) with 1% canary injection.

Key Ideas

Verbatim memorization and membership inference fail for RL-trained models since RL shapes behavioral style, not fact retention.
Introduce Behavioral Canaries: latent trigger-conditioned preferences planted via instrumented preference data.
Auditing target is RLFT (RL fine-tuning) pipelines on legally-protected retrieved contexts in agentic workflows.
Detection works through distributional behavioral change, not leakage of content.

Approach

Pair document triggers with preference feedback that rewards a distinctive stylistic response. If a provider incorporates such canary-laced documents into RLFT, the model acquires a latent trigger→style preference. Auditors then query with triggers and statistically test for the stylistic signature.

GR-Evolve: Design-Adaptive Global Routing via LLM-Driven Algorithm Evolution

Mon, 27 Apr 2026 10:17:18 +0000

arXiv: 2604.22234 · PDF

Authors: Taizun Jafri, Vidya A. Chhabria

Affiliations: Arizona State University

Primary category: cs.AR · all: cs.AR

Matched keywords: large language model, llm, agent, agentic, rag

TL;DR

GR-Evolve is a code-evolution framework that uses an agentic LLM to iteratively modify global routing source code based on QoR feedback, producing design-adaptive EDA tooling. It achieves up to 8.72% post-detailed-routing wirelength reduction over baseline routers across seven benchmarks.

Key Ideas

Introduces design-adaptive EDA tooling: internal algorithms specialize to each design rather than relying on fixed heuristics or hyperparameter tuning.
Uses an agentic LLM to evolve global router source code iteratively, guided by QoR feedback.
Provides the LLM with persistent contextual knowledge of open-source global routers plus an integrated QoR evaluation toolchain in OpenROAD.
Demonstrates that LLM-driven code evolution can outperform static algorithm implementations.

Approach

GR-Evolve frames global routing improvement as a code-evolution loop. An agentic LLM is given persistent context about open-source global routers and accumulated QoR history from prior iterations, then proposes source-code modifications. Each candidate is compiled and evaluated inside the OpenROAD infrastructure; the resulting QoR metrics feed back into the next iteration, driving design-specific algorithm specialization.

Memanto: Typed Semantic Memory with Information-Theoretic Retrieval for Long-Horizon Agents

Mon, 27 Apr 2026 10:15:50 +0000

arXiv: 2604.22085 · PDF

Authors: Seyed Moein Abtahi, Rasa Rahnema, Hetkumar Patel, Neel Patel, Majid Fekri, Tara Khani

Affiliations: Moorcheh AI, EdgeAI Innovations

Primary category: cs.AI · all: cs.AI

Matched keywords: large language model, agent, agentic, retrieval, inference, latency

TL;DR

Memanto is a universal memory layer for long-horizon agents that replaces hybrid semantic-graph architectures with a typed semantic schema plus Moorcheh’s information-theoretic search engine, reaching 89.8% on LongMemEval and 87.1% on LoCoMo with single-query retrieval and sub-90ms latency.

Sovereign Agentic Loops: Decoupling AI Reasoning from Execution in Real-World Systems

Mon, 27 Apr 2026 10:14:47 +0000

arXiv: 2604.22136 · PDF

Authors: Jun He, Deying Yu

Affiliations: OpenKedge.io

Primary category: cs.CR · all: cs.CR, cs.LG

Matched keywords: large language model, llm, agent, agentic, reasoning, latency

TL;DR

Sovereign Agentic Loops (SAL) is a control-plane architecture that decouples LLM reasoning from execution: models emit structured intents with justifications, which a control plane validates against real system state and policy before any API call mutates a system.

Key Ideas

Passing stochastic LLM outputs directly to execution layers is unsafe because correctness, context awareness, and alignment cannot be assumed at execution time.
Agents should emit structured intents with justifications rather than raw API calls.
An obfuscation membrane limits model access to identity-sensitive state.
A cryptographically linked Evidence Chain enables auditability and deterministic replay.
Formal guarantees: policy-bounded execution, identity isolation, deterministic replay.

Approach

SAL inserts a control plane between the LLM and execution layer. The model produces structured intents annotated with justifications; the control plane checks them against true system state and policy. The obfuscation membrane restricts what identity-sensitive state the model can see, and the Evidence Chain cryptographically links every intent, validation, and execution step for replay and audit. The authors formalize the architecture and prove the three guarantees above under stated assumptions.

Preference Heads in Large Language Models: A Mechanistic Framework for Interpretable Personalization

Mon, 27 Apr 2026 10:13:44 +0000

arXiv: 2604.22345 · PDF

Authors: Weixu Zhang, Ye Yuan, Changjiang Han, Yuxing Tian, Zipeng Sun, Linfeng Du, Jikun Kang, Hong Kang, Xue Liu, Haolun Wu

Affiliations: McGill University, Mila - Quebec AI Institute, MBZUAI, University of Montreal, Salesforce

Primary category: cs.CL · all: cs.CL

Matched keywords: large language model, llm, rag, inference, serving, attention, transformer

TL;DR

The paper proposes Differential Preference Steering (DPS), a training-free mechanistic interpretability framework that identifies sparse “Preference Heads” — attention heads causally encoding user-specific style and topic — and contrasts logits with/without them at decoding time to deliver interpretable personalization in LLMs.

LayerBoost: Layer-Aware Attention Reduction for Efficient LLMs

Mon, 27 Apr 2026 09:38:09 +0000

arXiv: 2604.22050 · PDF

Authors: Mohamed Ali Souibgui, Jan Fostier, Rodrigo Abadía-Heredia, Bohdan Denysenko, Christian Marschke, Igor Peric

Affiliations: Openchip & Softwares Technologies

Primary category: cs.LG · all: cs.CL, cs.LG

Matched keywords: llm, inference, serving, attention, transformer, throughput, latency

TL;DR

LayerBoost is a layer-aware attention reduction method that uses sensitivity analysis to selectively keep softmax, swap in linear sliding-window attention, or drop attention entirely per layer, with a lightweight 10M-token distillation healing phase. It boosts throughput up to 68% at high concurrency while matching or nearly matching base model quality.

Lightweight Retrieval-Augmented Generation and Large Language Model-Based Modeling for Scalable Patient-Trial Matching

Mon, 27 Apr 2026 09:37:11 +0000

arXiv: 2604.22061 · PDF

Authors: Xiaodi Li, Yang Xiao, Munhwan Lee, Konstantinos Leventakos, Young J. Juhn, David Jones, Terence T. Sio, Wei Liu, Maria Vassilaki, Nansu Zong

Affiliations: Mayo Clinic, University of Tulsa

Primary category: cs.CL · all: cs.AI, cs.CL, cs.LG

Matched keywords: large language model, llm, retrieval, reasoning, serving, fine-tun

TL;DR

A lightweight patient-trial matching framework that uses retrieval-augmented generation (RAG) to select clinically relevant EHR segments and LLMs to encode them, then applies dimensionality reduction plus lightweight predictors — matching end-to-end LLM performance at far lower cost.

Behavioral Canaries: Auditing Private Retrieved Context Usage in RL Fine-Tuning

Mon, 27 Apr 2026 09:35:52 +0000

arXiv: 2604.22191 · PDF

Authors: Chaoran Chen, Dayu Yuan, Peter Kairouz

Affiliations: Google

Primary category: cs.CR · all: cs.CL, cs.CR

Matched keywords: llm, agent, agentic, inference, fine-tun, post-train

TL;DR

Behavioral Canaries audit whether RL fine-tuning illicitly uses retrieved-context data by injecting document triggers paired with distinctive stylistic rewards, inducing detectable trigger-conditioned preferences. At 1% injection, the method achieves 67% detection at 10% FPR (AUROC 0.756).

Key Ideas

Standard memorization/MI audits fail for RL-trained LLMs because RL shapes behavioral style, not fact retention.
Introduces Behavioral Canaries: pair document triggers with feedback rewarding a distinctive stylistic response.
If the provider trains on protected retrieved contexts, a latent trigger-conditioned preference emerges and is detectable.
Reframes auditing around distributional behavioral change instead of verbatim leakage.

Approach

The framework instruments preference data used in RLFT pipelines. Auditors seed the retrieved-context corpus with canary documents whose triggers are linked to preference labels favoring a distinctive stylistic response. During audit, the model is queried on trigger-bearing documents; significant elevation of the planted style indicates the canaries were incorporated into RL post-training.

GR-Evolve: Design-Adaptive Global Routing via LLM-Driven Algorithm Evolution

Mon, 27 Apr 2026 09:34:41 +0000

arXiv: 2604.22234 · PDF

Authors: Taizun Jafri, Vidya A. Chhabria

Affiliations: Arizona State University

Primary category: cs.AR · all: cs.AR

Matched keywords: large language model, llm, agent, agentic, rag

TL;DR

GR-Evolve is an agentic LLM framework that iteratively rewrites global-routing source code per design, using QoR-driven feedback in OpenROAD to produce design-adaptive EDA tooling. Across seven benchmarks on three technology nodes, it cuts post-detailed-routing wirelength by up to 8.72% over baseline routers.

Guess-Verify-Refine: Data-Aware Top-K for Sparse-Attention Decoding on Blackwell via Temporal Correlation

Mon, 27 Apr 2026 09:33:13 +0000

arXiv: 2604.22312 · PDF

Authors: Long Cheng, Ritchie Zhao, Timmy Liu, Mindy Li, Xianjie Qiao, Kefeng Duan, Yu-Jung Chen, Xiaoming Chen, Bita Darvish Rouhani, June Yang

Affiliations: NVIDIA

Primary category: cs.DC · all: cs.AR, cs.DC, cs.PF

Matched keywords: llm, rag, serving, speculative decoding, attention, latency

TL;DR

Guess-Verify-Refine (GVR) is a data-aware exact Top-K kernel for sparse-attention decoding on NVIDIA Blackwell that exploits temporal correlation between consecutive decode steps, delivering 1.88× average speedup over production radix-select while preserving bit-exact outputs.

Memanto: Typed Semantic Memory with Information-Theoretic Retrieval for Long-Horizon Agents

Mon, 27 Apr 2026 09:31:40 +0000

arXiv: 2604.22085 · PDF

Authors: Seyed Moein Abtahi, Rasa Rahnema, Hetkumar Patel, Neel Patel, Majid Fekri, Tara Khani

Affiliations: Moorcheh AI, EdgeAI Innovations

Primary category: cs.AI · all: cs.AI

Matched keywords: large language model, agent, agentic, retrieval, inference, latency

TL;DR

Memanto is a universal memory layer for long-horizon agents that replaces hybrid knowledge-graph pipelines with a typed semantic schema plus Moorcheh’s information-theoretic search, hitting 89.8% on LongMemEval and 87.1% on LoCoMo with sub-90 ms single-query retrieval and zero ingestion cost.

Sovereign Agentic Loops: Decoupling AI Reasoning from Execution in Real-World Systems

Mon, 27 Apr 2026 09:30:16 +0000

arXiv: 2604.22136 · PDF

Authors: Jun He, Deying Yu

Affiliations: OpenKedge.io

Primary category: cs.CR · all: cs.CR, cs.LG

Matched keywords: large language model, llm, agent, agentic, reasoning, latency

TL;DR

Sovereign Agentic Loops (SAL) is a control-plane architecture that decouples LLM reasoning from execution: models emit structured intents with justifications, which are validated against true system state and policy before any mutation. A prototype blocks unsafe actions with 12.4 ms median overhead.

Key Ideas

Direct coupling of stochastic LLM outputs to execution layers is unsafe; model correctness and alignment cannot be assumed at runtime.
Models should emit structured intents with justifications, not raw API calls.
An obfuscation membrane limits model access to identity-sensitive state.
A cryptographically linked Evidence Chain enables auditability and deterministic replay.
Formal guarantees: policy-bounded execution, identity isolation, deterministic replay.

Approach

SAL inserts a control plane between the LLM and execution layer. The model produces structured intents plus justifications; the control plane validates each intent against true system state and policy before dispatching it. The obfuscation membrane mediates what identity-sensitive state the model can observe, and every decision is recorded in a cryptographically chained Evidence Log that supports replay. The authors formalize the architecture and prove the safety properties under stated assumptions.

Preference Heads in Large Language Models: A Mechanistic Framework for Interpretable Personalization

Mon, 27 Apr 2026 09:28:57 +0000

arXiv: 2604.22345 · PDF

Authors: Weixu Zhang, Ye Yuan, Changjiang Han, Yuxing Tian, Zipeng Sun, Linfeng Du, Jikun Kang, Hong Kang, Xue Liu, Haolun Wu

Affiliations: McGill University, Mila - Quebec AI Institute, MBZUAI, University of Montreal, Salesforce

Primary category: cs.CL · all: cs.CL

Matched keywords: large language model, llm, rag, inference, serving, attention, transformer

TL;DR

The paper posits that LLM personalization is concentrated in a sparse set of “Preference Heads” and introduces Differential Preference Steering (DPS), a training-free method that identifies these heads via causal masking and contrasts logits with/without them at decoding to amplify user-aligned outputs.

Emergent Strategic Reasoning Risks in AI: A Taxonomy-Driven Evaluation Framework

Mon, 27 Apr 2026 09:26:31 +0000

arXiv: 2604.22119 · PDF

Authors: Tharindu Kumarage, Lisa Bauer, Yao Ma, Dan Rosen, Yashasvi Raghavendra Guduri, Anna Rumshisky, Kai-Wei Chang, Aram Galstyan, Rahul Gupta, Charith Peris

Affiliations: Amazon Nova Responsible AI

Primary category: cs.AI · all: cs.AI

Matched keywords: large language model, llm, agent, agentic, reasoning

TL;DR

The paper introduces ESRRSim, a taxonomy-driven agentic framework for benchmarking Emergent Strategic Reasoning Risks (ESRRs) — deception, evaluation gaming, reward hacking — in LLMs. Across 11 reasoning models, detection rates span 14.45%–72.72%, with newer generations showing dramatic safety gains.