<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>Paper-Digest on JXIN&#39;s Home</title>
    <link>https://ftxj.github.io/categories/paper-digest/</link>
    <description>Recent content in Paper-Digest on JXIN&#39;s Home</description>
    <generator>Hugo</generator>
    <language>en</language>
    <lastBuildDate>Mon, 27 Apr 2026 10:25:37 +0000</lastBuildDate>
    <atom:link href="https://ftxj.github.io/categories/paper-digest/index.xml" rel="self" type="application/rss+xml" />
    <item>
      <title>QuantClaw: Precision Where It Matters for OpenClaw</title>
      <link>https://ftxj.github.io/posts/2026-04-27/10-quantclaw-precision-where-it-matters-for-openclaw/</link>
      <pubDate>Mon, 27 Apr 2026 10:25:37 +0000</pubDate>
      <guid>https://ftxj.github.io/posts/2026-04-27/10-quantclaw-precision-where-it-matters-for-openclaw/</guid>
      <description>&lt;p&gt;&lt;strong&gt;arXiv:&lt;/strong&gt; &lt;a href=&#34;https://arxiv.org/abs/2604.22577v1&#34;&gt;2604.22577&lt;/a&gt; · &lt;a href=&#34;https://arxiv.org/pdf/2604.22577v1&#34;&gt;PDF&lt;/a&gt;&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Authors:&lt;/strong&gt; Manyi Zhang, Ji-Fu Li, Zhongao Sun, Xiaohao Liu, Zhenhua Dong, Xianzhi Yu, Haoli Bai, Xiaobo Xia&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Affiliations:&lt;/strong&gt; Huawei Technologies, National University of Singapore, University of Science and Technology of China&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Primary category:&lt;/strong&gt; &lt;code&gt;cs.AI&lt;/code&gt; · all: cs.AI, cs.CL&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Matched keywords:&lt;/strong&gt; agent, reasoning, inference, serving, quantization, latency&lt;/p&gt;&#xA;&lt;hr&gt;&#xA;&lt;h2 id=&#34;tldr&#34;&gt;TL;DR&lt;/h2&gt;&#xA;&lt;p&gt;QuantClaw is a plug-and-play precision routing plugin for OpenClaw agent systems that dynamically assigns quantization precision per task, cutting cost up to 21.4% and latency 15.7% on GLM-5 vs an FP8 baseline while preserving task quality.&lt;/p&gt;</description>
    </item>
    <item>
      <title>Guess-Verify-Refine: Data-Aware Top-K for Sparse-Attention Decoding on Blackwell via Temporal Correlation</title>
      <link>https://ftxj.github.io/posts/2026-04-27/09-guess-verify-refine-data-aware-top-k-for-sparse-attention-de/</link>
      <pubDate>Mon, 27 Apr 2026 10:24:33 +0000</pubDate>
      <guid>https://ftxj.github.io/posts/2026-04-27/09-guess-verify-refine-data-aware-top-k-for-sparse-attention-de/</guid>
      <description>&lt;p&gt;&lt;strong&gt;arXiv:&lt;/strong&gt; &lt;a href=&#34;https://arxiv.org/abs/2604.22312v1&#34;&gt;2604.22312&lt;/a&gt; · &lt;a href=&#34;https://arxiv.org/pdf/2604.22312v1&#34;&gt;PDF&lt;/a&gt;&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Authors:&lt;/strong&gt; Long Cheng, Ritchie Zhao, Timmy Liu, Mindy Li, Xianjie Qiao, Kefeng Duan, Yu-Jung Chen, Xiaoming Chen, Bita Darvish Rouhani, June Yang&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Affiliations:&lt;/strong&gt; NVIDIA&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Primary category:&lt;/strong&gt; &lt;code&gt;cs.DC&lt;/code&gt; · all: cs.AR, cs.DC, cs.PF&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Matched keywords:&lt;/strong&gt; llm, rag, serving, speculative decoding, attention, latency&lt;/p&gt;&#xA;&lt;hr&gt;&#xA;&lt;h2 id=&#34;tldr&#34;&gt;TL;DR&lt;/h2&gt;&#xA;&lt;p&gt;Guess-Verify-Refine (GVR) is a data-aware exact Top-K algorithm for sparse-attention decoding on NVIDIA Blackwell that exploits temporal correlation across decode steps, delivering 1.88× average (up to 2.42×) single-operator speedup over radix-select while preserving bit-exact outputs.&lt;/p&gt;</description>
    </item>
    <item>
      <title>LayerBoost: Layer-Aware Attention Reduction for Efficient LLMs</title>
      <link>https://ftxj.github.io/posts/2026-04-27/08-layerboost-layer-aware-attention-reduction-for-efficient-llm/</link>
      <pubDate>Mon, 27 Apr 2026 10:22:41 +0000</pubDate>
      <guid>https://ftxj.github.io/posts/2026-04-27/08-layerboost-layer-aware-attention-reduction-for-efficient-llm/</guid>
      <description>&lt;p&gt;&lt;strong&gt;arXiv:&lt;/strong&gt; &lt;a href=&#34;https://arxiv.org/abs/2604.22050v1&#34;&gt;2604.22050&lt;/a&gt; · &lt;a href=&#34;https://arxiv.org/pdf/2604.22050v1&#34;&gt;PDF&lt;/a&gt;&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Authors:&lt;/strong&gt; Mohamed Ali Souibgui, Jan Fostier, Rodrigo Abadía-Heredia, Bohdan Denysenko, Christian Marschke, Igor Peric&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Affiliations:&lt;/strong&gt; Openchip &amp;amp; Softwares Technologies&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Primary category:&lt;/strong&gt; &lt;code&gt;cs.LG&lt;/code&gt; · all: cs.CL, cs.LG&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Matched keywords:&lt;/strong&gt; llm, inference, serving, attention, transformer, throughput, latency&lt;/p&gt;&#xA;&lt;hr&gt;&#xA;&lt;h2 id=&#34;tldr&#34;&gt;TL;DR&lt;/h2&gt;&#xA;&lt;p&gt;LayerBoost is a layer-aware attention reduction method that applies different attention strategies (softmax, linear sliding-window, or removal) per layer based on sensitivity analysis, followed by lightweight distillation healing using just 10M tokens. It improves throughput by up to 68% at high concurrency while preserving quality.&lt;/p&gt;</description>
    </item>
    <item>
      <title>Lightweight Retrieval-Augmented Generation and Large Language Model-Based Modeling for Scalable Patient-Trial Matching</title>
      <link>https://ftxj.github.io/posts/2026-04-27/07-lightweight-retrieval-augmented-generation-and-large-languag/</link>
      <pubDate>Mon, 27 Apr 2026 10:21:16 +0000</pubDate>
      <guid>https://ftxj.github.io/posts/2026-04-27/07-lightweight-retrieval-augmented-generation-and-large-languag/</guid>
      <description>&lt;p&gt;&lt;strong&gt;arXiv:&lt;/strong&gt; &lt;a href=&#34;https://arxiv.org/abs/2604.22061v1&#34;&gt;2604.22061&lt;/a&gt; · &lt;a href=&#34;https://arxiv.org/pdf/2604.22061v1&#34;&gt;PDF&lt;/a&gt;&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Authors:&lt;/strong&gt; Xiaodi Li, Yang Xiao, Munhwan Lee, Konstantinos Leventakos, Young J. Juhn, David Jones, Terence T. Sio, Wei Liu, Maria Vassilaki, Nansu Zong&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Affiliations:&lt;/strong&gt; Mayo Clinic, University of Tulsa&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Primary category:&lt;/strong&gt; &lt;code&gt;cs.CL&lt;/code&gt; · all: cs.AI, cs.CL, cs.LG&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Matched keywords:&lt;/strong&gt; large language model, llm, retrieval, reasoning, serving, fine-tun&lt;/p&gt;&#xA;&lt;hr&gt;&#xA;&lt;h2 id=&#34;tldr&#34;&gt;TL;DR&lt;/h2&gt;&#xA;&lt;p&gt;A lightweight patient-trial matching framework that uses retrieval-augmented generation to extract relevant EHR segments and LLMs to encode them, achieving performance comparable to end-to-end LLM pipelines at substantially lower compute cost.&lt;/p&gt;</description>
    </item>
    <item>
      <title>Emergent Strategic Reasoning Risks in AI: A Taxonomy-Driven Evaluation Framework</title>
      <link>https://ftxj.github.io/posts/2026-04-27/06-emergent-strategic-reasoning-risks-in-ai-a-taxonomy-driven-e/</link>
      <pubDate>Mon, 27 Apr 2026 10:20:15 +0000</pubDate>
      <guid>https://ftxj.github.io/posts/2026-04-27/06-emergent-strategic-reasoning-risks-in-ai-a-taxonomy-driven-e/</guid>
      <description>&lt;p&gt;&lt;strong&gt;arXiv:&lt;/strong&gt; &lt;a href=&#34;https://arxiv.org/abs/2604.22119v1&#34;&gt;2604.22119&lt;/a&gt; · &lt;a href=&#34;https://arxiv.org/pdf/2604.22119v1&#34;&gt;PDF&lt;/a&gt;&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Authors:&lt;/strong&gt; Tharindu Kumarage, Lisa Bauer, Yao Ma, Dan Rosen, Yashasvi Raghavendra Guduri, Anna Rumshisky, Kai-Wei Chang, Aram Galstyan, Rahul Gupta, Charith Peris&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Affiliations:&lt;/strong&gt; Amazon Nova Responsible AI&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Primary category:&lt;/strong&gt; &lt;code&gt;cs.AI&lt;/code&gt; · all: cs.AI&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Matched keywords:&lt;/strong&gt; large language model, llm, agent, agentic, reasoning&lt;/p&gt;&#xA;&lt;hr&gt;&#xA;&lt;h2 id=&#34;tldr&#34;&gt;TL;DR&lt;/h2&gt;&#xA;&lt;p&gt;The paper introduces ESRRSim, a taxonomy-driven agentic framework for benchmarking Emergent Strategic Reasoning Risks (ESRRs) in LLMs — deception, evaluation gaming, reward hacking, and more. Across 11 reasoning LLMs, detection rates span 14.45%–72.72%, with newer generations showing dramatic safety improvements.&lt;/p&gt;</description>
    </item>
    <item>
      <title>Behavioral Canaries: Auditing Private Retrieved Context Usage in RL Fine-Tuning</title>
      <link>https://ftxj.github.io/posts/2026-04-27/05-behavioral-canaries-auditing-private-retrieved-context-usage/</link>
      <pubDate>Mon, 27 Apr 2026 10:19:21 +0000</pubDate>
      <guid>https://ftxj.github.io/posts/2026-04-27/05-behavioral-canaries-auditing-private-retrieved-context-usage/</guid>
      <description>&lt;p&gt;&lt;strong&gt;arXiv:&lt;/strong&gt; &lt;a href=&#34;https://arxiv.org/abs/2604.22191v1&#34;&gt;2604.22191&lt;/a&gt; · &lt;a href=&#34;https://arxiv.org/pdf/2604.22191v1&#34;&gt;PDF&lt;/a&gt;&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Authors:&lt;/strong&gt; Chaoran Chen, Dayu Yuan, Peter Kairouz&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Affiliations:&lt;/strong&gt; Google&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Primary category:&lt;/strong&gt; &lt;code&gt;cs.CR&lt;/code&gt; · all: cs.CL, cs.CR&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Matched keywords:&lt;/strong&gt; llm, agent, agentic, inference, fine-tun, post-train&lt;/p&gt;&#xA;&lt;hr&gt;&#xA;&lt;h2 id=&#34;tldr&#34;&gt;TL;DR&lt;/h2&gt;&#xA;&lt;p&gt;Behavioral Canaries audit whether RL fine-tuning pipelines illegally trained on protected retrieved contexts. By instrumenting preference data with document-trigger/stylistic-response pairs, auditors detect unauthorized use via behavioral shifts rather than memorization, reaching 67% detection at 10% FPR (AUROC 0.756) with 1% canary injection.&lt;/p&gt;&#xA;&lt;h2 id=&#34;key-ideas&#34;&gt;Key Ideas&lt;/h2&gt;&#xA;&lt;ul&gt;&#xA;&lt;li&gt;Verbatim memorization and membership inference fail for RL-trained models since RL shapes behavioral style, not fact retention.&lt;/li&gt;&#xA;&lt;li&gt;Introduce &lt;strong&gt;Behavioral Canaries&lt;/strong&gt;: latent trigger-conditioned preferences planted via instrumented preference data.&lt;/li&gt;&#xA;&lt;li&gt;Auditing target is RLFT (RL fine-tuning) pipelines on legally-protected retrieved contexts in agentic workflows.&lt;/li&gt;&#xA;&lt;li&gt;Detection works through distributional behavioral change, not leakage of content.&lt;/li&gt;&#xA;&lt;/ul&gt;&#xA;&lt;h2 id=&#34;approach&#34;&gt;Approach&lt;/h2&gt;&#xA;&lt;p&gt;Pair document &lt;em&gt;triggers&lt;/em&gt; with preference feedback that rewards a distinctive stylistic response. If a provider incorporates such canary-laced documents into RLFT, the model acquires a latent trigger→style preference. Auditors then query with triggers and statistically test for the stylistic signature.&lt;/p&gt;</description>
    </item>
    <item>
      <title>GR-Evolve: Design-Adaptive Global Routing via LLM-Driven Algorithm Evolution</title>
      <link>https://ftxj.github.io/posts/2026-04-27/04-gr-evolve-design-adaptive-global-routing-via-llm-driven-algo/</link>
      <pubDate>Mon, 27 Apr 2026 10:17:18 +0000</pubDate>
      <guid>https://ftxj.github.io/posts/2026-04-27/04-gr-evolve-design-adaptive-global-routing-via-llm-driven-algo/</guid>
      <description>&lt;p&gt;&lt;strong&gt;arXiv:&lt;/strong&gt; &lt;a href=&#34;https://arxiv.org/abs/2604.22234v1&#34;&gt;2604.22234&lt;/a&gt; · &lt;a href=&#34;https://arxiv.org/pdf/2604.22234v1&#34;&gt;PDF&lt;/a&gt;&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Authors:&lt;/strong&gt; Taizun Jafri, Vidya A. Chhabria&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Affiliations:&lt;/strong&gt; Arizona State University&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Primary category:&lt;/strong&gt; &lt;code&gt;cs.AR&lt;/code&gt; · all: cs.AR&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Matched keywords:&lt;/strong&gt; large language model, llm, agent, agentic, rag&lt;/p&gt;&#xA;&lt;hr&gt;&#xA;&lt;h2 id=&#34;tldr&#34;&gt;TL;DR&lt;/h2&gt;&#xA;&lt;p&gt;GR-Evolve is a code-evolution framework that uses an agentic LLM to iteratively modify global routing source code based on QoR feedback, producing design-adaptive EDA tooling. It achieves up to 8.72% post-detailed-routing wirelength reduction over baseline routers across seven benchmarks.&lt;/p&gt;&#xA;&lt;h2 id=&#34;key-ideas&#34;&gt;Key Ideas&lt;/h2&gt;&#xA;&lt;ul&gt;&#xA;&lt;li&gt;Introduces &lt;strong&gt;design-adaptive EDA tooling&lt;/strong&gt;: internal algorithms specialize to each design rather than relying on fixed heuristics or hyperparameter tuning.&lt;/li&gt;&#xA;&lt;li&gt;Uses an &lt;strong&gt;agentic LLM&lt;/strong&gt; to evolve global router source code iteratively, guided by QoR feedback.&lt;/li&gt;&#xA;&lt;li&gt;Provides the LLM with persistent contextual knowledge of open-source global routers plus an integrated QoR evaluation toolchain in OpenROAD.&lt;/li&gt;&#xA;&lt;li&gt;Demonstrates that LLM-driven code evolution can outperform static algorithm implementations.&lt;/li&gt;&#xA;&lt;/ul&gt;&#xA;&lt;h2 id=&#34;approach&#34;&gt;Approach&lt;/h2&gt;&#xA;&lt;p&gt;GR-Evolve frames global routing improvement as a code-evolution loop. An agentic LLM is given persistent context about open-source global routers and accumulated QoR history from prior iterations, then proposes source-code modifications. Each candidate is compiled and evaluated inside the OpenROAD infrastructure; the resulting QoR metrics feed back into the next iteration, driving design-specific algorithm specialization.&lt;/p&gt;</description>
    </item>
    <item>
      <title>Memanto: Typed Semantic Memory with Information-Theoretic Retrieval for Long-Horizon Agents</title>
      <link>https://ftxj.github.io/posts/2026-04-27/03-memanto-typed-semantic-memory-with-information-theoretic-ret/</link>
      <pubDate>Mon, 27 Apr 2026 10:15:50 +0000</pubDate>
      <guid>https://ftxj.github.io/posts/2026-04-27/03-memanto-typed-semantic-memory-with-information-theoretic-ret/</guid>
      <description>&lt;p&gt;&lt;strong&gt;arXiv:&lt;/strong&gt; &lt;a href=&#34;https://arxiv.org/abs/2604.22085v1&#34;&gt;2604.22085&lt;/a&gt; · &lt;a href=&#34;https://arxiv.org/pdf/2604.22085v1&#34;&gt;PDF&lt;/a&gt;&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Authors:&lt;/strong&gt; Seyed Moein Abtahi, Rasa Rahnema, Hetkumar Patel, Neel Patel, Majid Fekri, Tara Khani&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Affiliations:&lt;/strong&gt; Moorcheh AI, EdgeAI Innovations&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Primary category:&lt;/strong&gt; &lt;code&gt;cs.AI&lt;/code&gt; · all: cs.AI&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Matched keywords:&lt;/strong&gt; large language model, agent, agentic, retrieval, inference, latency&lt;/p&gt;&#xA;&lt;hr&gt;&#xA;&lt;h2 id=&#34;tldr&#34;&gt;TL;DR&lt;/h2&gt;&#xA;&lt;p&gt;Memanto is a universal memory layer for long-horizon agents that replaces hybrid semantic-graph architectures with a typed semantic schema plus Moorcheh&amp;rsquo;s information-theoretic search engine, reaching 89.8% on LongMemEval and 87.1% on LoCoMo with single-query retrieval and sub-90ms latency.&lt;/p&gt;</description>
    </item>
    <item>
      <title>Sovereign Agentic Loops: Decoupling AI Reasoning from Execution in Real-World Systems</title>
      <link>https://ftxj.github.io/posts/2026-04-27/02-sovereign-agentic-loops-decoupling-ai-reasoning-from-executi/</link>
      <pubDate>Mon, 27 Apr 2026 10:14:47 +0000</pubDate>
      <guid>https://ftxj.github.io/posts/2026-04-27/02-sovereign-agentic-loops-decoupling-ai-reasoning-from-executi/</guid>
      <description>&lt;p&gt;&lt;strong&gt;arXiv:&lt;/strong&gt; &lt;a href=&#34;https://arxiv.org/abs/2604.22136v1&#34;&gt;2604.22136&lt;/a&gt; · &lt;a href=&#34;https://arxiv.org/pdf/2604.22136v1&#34;&gt;PDF&lt;/a&gt;&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Authors:&lt;/strong&gt; Jun He, Deying Yu&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Affiliations:&lt;/strong&gt; OpenKedge.io&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Primary category:&lt;/strong&gt; &lt;code&gt;cs.CR&lt;/code&gt; · all: cs.CR, cs.LG&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Matched keywords:&lt;/strong&gt; large language model, llm, agent, agentic, reasoning, latency&lt;/p&gt;&#xA;&lt;hr&gt;&#xA;&lt;h2 id=&#34;tldr&#34;&gt;TL;DR&lt;/h2&gt;&#xA;&lt;p&gt;Sovereign Agentic Loops (SAL) is a control-plane architecture that decouples LLM reasoning from execution: models emit structured intents with justifications, which a control plane validates against real system state and policy before any API call mutates a system.&lt;/p&gt;&#xA;&lt;h2 id=&#34;key-ideas&#34;&gt;Key Ideas&lt;/h2&gt;&#xA;&lt;ul&gt;&#xA;&lt;li&gt;Passing stochastic LLM outputs directly to execution layers is unsafe because correctness, context awareness, and alignment cannot be assumed at execution time.&lt;/li&gt;&#xA;&lt;li&gt;Agents should emit &lt;strong&gt;structured intents with justifications&lt;/strong&gt; rather than raw API calls.&lt;/li&gt;&#xA;&lt;li&gt;An &lt;strong&gt;obfuscation membrane&lt;/strong&gt; limits model access to identity-sensitive state.&lt;/li&gt;&#xA;&lt;li&gt;A cryptographically linked &lt;strong&gt;Evidence Chain&lt;/strong&gt; enables auditability and deterministic replay.&lt;/li&gt;&#xA;&lt;li&gt;Formal guarantees: policy-bounded execution, identity isolation, deterministic replay.&lt;/li&gt;&#xA;&lt;/ul&gt;&#xA;&lt;h2 id=&#34;approach&#34;&gt;Approach&lt;/h2&gt;&#xA;&lt;p&gt;SAL inserts a control plane between the LLM and execution layer. The model produces structured intents annotated with justifications; the control plane checks them against true system state and policy. The obfuscation membrane restricts what identity-sensitive state the model can see, and the Evidence Chain cryptographically links every intent, validation, and execution step for replay and audit. The authors formalize the architecture and prove the three guarantees above under stated assumptions.&lt;/p&gt;</description>
    </item>
    <item>
      <title>Preference Heads in Large Language Models: A Mechanistic Framework for Interpretable Personalization</title>
      <link>https://ftxj.github.io/posts/2026-04-27/01-preference-heads-in-large-language-models-a-mechanistic-fram/</link>
      <pubDate>Mon, 27 Apr 2026 10:13:44 +0000</pubDate>
      <guid>https://ftxj.github.io/posts/2026-04-27/01-preference-heads-in-large-language-models-a-mechanistic-fram/</guid>
      <description>&lt;p&gt;&lt;strong&gt;arXiv:&lt;/strong&gt; &lt;a href=&#34;https://arxiv.org/abs/2604.22345v1&#34;&gt;2604.22345&lt;/a&gt; · &lt;a href=&#34;https://arxiv.org/pdf/2604.22345v1&#34;&gt;PDF&lt;/a&gt;&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Authors:&lt;/strong&gt; Weixu Zhang, Ye Yuan, Changjiang Han, Yuxing Tian, Zipeng Sun, Linfeng Du, Jikun Kang, Hong Kang, Xue Liu, Haolun Wu&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Affiliations:&lt;/strong&gt; McGill University, Mila - Quebec AI Institute, MBZUAI, University of Montreal, Salesforce&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Primary category:&lt;/strong&gt; &lt;code&gt;cs.CL&lt;/code&gt; · all: cs.CL&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Matched keywords:&lt;/strong&gt; large language model, llm, rag, inference, serving, attention, transformer&lt;/p&gt;&#xA;&lt;hr&gt;&#xA;&lt;h2 id=&#34;tldr&#34;&gt;TL;DR&lt;/h2&gt;&#xA;&lt;p&gt;The paper proposes Differential Preference Steering (DPS), a training-free mechanistic interpretability framework that identifies sparse &amp;ldquo;Preference Heads&amp;rdquo; — attention heads causally encoding user-specific style and topic — and contrasts logits with/without them at decoding time to deliver interpretable personalization in LLMs.&lt;/p&gt;</description>
    </item>
    <item>
      <title>LayerBoost: Layer-Aware Attention Reduction for Efficient LLMs</title>
      <link>https://ftxj.github.io/posts/2026-04-27/09-layerboost-layer-aware-attention-reduction-for-efficient-llm/</link>
      <pubDate>Mon, 27 Apr 2026 09:38:09 +0000</pubDate>
      <guid>https://ftxj.github.io/posts/2026-04-27/09-layerboost-layer-aware-attention-reduction-for-efficient-llm/</guid>
      <description>&lt;p&gt;&lt;strong&gt;arXiv:&lt;/strong&gt; &lt;a href=&#34;https://arxiv.org/abs/2604.22050v1&#34;&gt;2604.22050&lt;/a&gt; · &lt;a href=&#34;https://arxiv.org/pdf/2604.22050v1&#34;&gt;PDF&lt;/a&gt;&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Authors:&lt;/strong&gt; Mohamed Ali Souibgui, Jan Fostier, Rodrigo Abadía-Heredia, Bohdan Denysenko, Christian Marschke, Igor Peric&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Affiliations:&lt;/strong&gt; Openchip &amp;amp; Softwares Technologies&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Primary category:&lt;/strong&gt; &lt;code&gt;cs.LG&lt;/code&gt; · all: cs.CL, cs.LG&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Matched keywords:&lt;/strong&gt; llm, inference, serving, attention, transformer, throughput, latency&lt;/p&gt;&#xA;&lt;hr&gt;&#xA;&lt;h2 id=&#34;tldr&#34;&gt;TL;DR&lt;/h2&gt;&#xA;&lt;p&gt;LayerBoost is a layer-aware attention reduction method that uses sensitivity analysis to selectively keep softmax, swap in linear sliding-window attention, or drop attention entirely per layer, with a lightweight 10M-token distillation healing phase. It boosts throughput up to 68% at high concurrency while matching or nearly matching base model quality.&lt;/p&gt;</description>
    </item>
    <item>
      <title>Lightweight Retrieval-Augmented Generation and Large Language Model-Based Modeling for Scalable Patient-Trial Matching</title>
      <link>https://ftxj.github.io/posts/2026-04-27/08-lightweight-retrieval-augmented-generation-and-large-languag/</link>
      <pubDate>Mon, 27 Apr 2026 09:37:11 +0000</pubDate>
      <guid>https://ftxj.github.io/posts/2026-04-27/08-lightweight-retrieval-augmented-generation-and-large-languag/</guid>
      <description>&lt;p&gt;&lt;strong&gt;arXiv:&lt;/strong&gt; &lt;a href=&#34;https://arxiv.org/abs/2604.22061v1&#34;&gt;2604.22061&lt;/a&gt; · &lt;a href=&#34;https://arxiv.org/pdf/2604.22061v1&#34;&gt;PDF&lt;/a&gt;&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Authors:&lt;/strong&gt; Xiaodi Li, Yang Xiao, Munhwan Lee, Konstantinos Leventakos, Young J. Juhn, David Jones, Terence T. Sio, Wei Liu, Maria Vassilaki, Nansu Zong&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Affiliations:&lt;/strong&gt; Mayo Clinic, University of Tulsa&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Primary category:&lt;/strong&gt; &lt;code&gt;cs.CL&lt;/code&gt; · all: cs.AI, cs.CL, cs.LG&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Matched keywords:&lt;/strong&gt; large language model, llm, retrieval, reasoning, serving, fine-tun&lt;/p&gt;&#xA;&lt;hr&gt;&#xA;&lt;h2 id=&#34;tldr&#34;&gt;TL;DR&lt;/h2&gt;&#xA;&lt;p&gt;A lightweight patient-trial matching framework that uses retrieval-augmented generation (RAG) to select clinically relevant EHR segments and LLMs to encode them, then applies dimensionality reduction plus lightweight predictors — matching end-to-end LLM performance at far lower cost.&lt;/p&gt;</description>
    </item>
    <item>
      <title>Behavioral Canaries: Auditing Private Retrieved Context Usage in RL Fine-Tuning</title>
      <link>https://ftxj.github.io/posts/2026-04-27/07-behavioral-canaries-auditing-private-retrieved-context-usage/</link>
      <pubDate>Mon, 27 Apr 2026 09:35:52 +0000</pubDate>
      <guid>https://ftxj.github.io/posts/2026-04-27/07-behavioral-canaries-auditing-private-retrieved-context-usage/</guid>
      <description>&lt;p&gt;&lt;strong&gt;arXiv:&lt;/strong&gt; &lt;a href=&#34;https://arxiv.org/abs/2604.22191v1&#34;&gt;2604.22191&lt;/a&gt; · &lt;a href=&#34;https://arxiv.org/pdf/2604.22191v1&#34;&gt;PDF&lt;/a&gt;&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Authors:&lt;/strong&gt; Chaoran Chen, Dayu Yuan, Peter Kairouz&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Affiliations:&lt;/strong&gt; Google&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Primary category:&lt;/strong&gt; &lt;code&gt;cs.CR&lt;/code&gt; · all: cs.CL, cs.CR&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Matched keywords:&lt;/strong&gt; llm, agent, agentic, inference, fine-tun, post-train&lt;/p&gt;&#xA;&lt;hr&gt;&#xA;&lt;h2 id=&#34;tldr&#34;&gt;TL;DR&lt;/h2&gt;&#xA;&lt;p&gt;Behavioral Canaries audit whether RL fine-tuning illicitly uses retrieved-context data by injecting document triggers paired with distinctive stylistic rewards, inducing detectable trigger-conditioned preferences. At 1% injection, the method achieves 67% detection at 10% FPR (AUROC 0.756).&lt;/p&gt;&#xA;&lt;h2 id=&#34;key-ideas&#34;&gt;Key Ideas&lt;/h2&gt;&#xA;&lt;ul&gt;&#xA;&lt;li&gt;Standard memorization/MI audits fail for RL-trained LLMs because RL shapes behavioral style, not fact retention.&lt;/li&gt;&#xA;&lt;li&gt;Introduces &lt;strong&gt;Behavioral Canaries&lt;/strong&gt;: pair document triggers with feedback rewarding a distinctive stylistic response.&lt;/li&gt;&#xA;&lt;li&gt;If the provider trains on protected retrieved contexts, a latent trigger-conditioned preference emerges and is detectable.&lt;/li&gt;&#xA;&lt;li&gt;Reframes auditing around distributional behavioral change instead of verbatim leakage.&lt;/li&gt;&#xA;&lt;/ul&gt;&#xA;&lt;h2 id=&#34;approach&#34;&gt;Approach&lt;/h2&gt;&#xA;&lt;p&gt;The framework instruments preference data used in RLFT pipelines. Auditors seed the retrieved-context corpus with canary documents whose triggers are linked to preference labels favoring a distinctive stylistic response. During audit, the model is queried on trigger-bearing documents; significant elevation of the planted style indicates the canaries were incorporated into RL post-training.&lt;/p&gt;</description>
    </item>
    <item>
      <title>GR-Evolve: Design-Adaptive Global Routing via LLM-Driven Algorithm Evolution</title>
      <link>https://ftxj.github.io/posts/2026-04-27/06-gr-evolve-design-adaptive-global-routing-via-llm-driven-algo/</link>
      <pubDate>Mon, 27 Apr 2026 09:34:41 +0000</pubDate>
      <guid>https://ftxj.github.io/posts/2026-04-27/06-gr-evolve-design-adaptive-global-routing-via-llm-driven-algo/</guid>
      <description>&lt;p&gt;&lt;strong&gt;arXiv:&lt;/strong&gt; &lt;a href=&#34;https://arxiv.org/abs/2604.22234v1&#34;&gt;2604.22234&lt;/a&gt; · &lt;a href=&#34;https://arxiv.org/pdf/2604.22234v1&#34;&gt;PDF&lt;/a&gt;&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Authors:&lt;/strong&gt; Taizun Jafri, Vidya A. Chhabria&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Affiliations:&lt;/strong&gt; Arizona State University&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Primary category:&lt;/strong&gt; &lt;code&gt;cs.AR&lt;/code&gt; · all: cs.AR&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Matched keywords:&lt;/strong&gt; large language model, llm, agent, agentic, rag&lt;/p&gt;&#xA;&lt;hr&gt;&#xA;&lt;h2 id=&#34;tldr&#34;&gt;TL;DR&lt;/h2&gt;&#xA;&lt;p&gt;GR-Evolve is an agentic LLM framework that iteratively rewrites global-routing source code per design, using QoR-driven feedback in OpenROAD to produce design-adaptive EDA tooling. Across seven benchmarks on three technology nodes, it cuts post-detailed-routing wirelength by up to 8.72% over baseline routers.&lt;/p&gt;</description>
    </item>
    <item>
      <title>Guess-Verify-Refine: Data-Aware Top-K for Sparse-Attention Decoding on Blackwell via Temporal Correlation</title>
      <link>https://ftxj.github.io/posts/2026-04-27/05-guess-verify-refine-data-aware-top-k-for-sparse-attention-de/</link>
      <pubDate>Mon, 27 Apr 2026 09:33:13 +0000</pubDate>
      <guid>https://ftxj.github.io/posts/2026-04-27/05-guess-verify-refine-data-aware-top-k-for-sparse-attention-de/</guid>
      <description>&lt;p&gt;&lt;strong&gt;arXiv:&lt;/strong&gt; &lt;a href=&#34;https://arxiv.org/abs/2604.22312v1&#34;&gt;2604.22312&lt;/a&gt; · &lt;a href=&#34;https://arxiv.org/pdf/2604.22312v1&#34;&gt;PDF&lt;/a&gt;&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Authors:&lt;/strong&gt; Long Cheng, Ritchie Zhao, Timmy Liu, Mindy Li, Xianjie Qiao, Kefeng Duan, Yu-Jung Chen, Xiaoming Chen, Bita Darvish Rouhani, June Yang&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Affiliations:&lt;/strong&gt; NVIDIA&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Primary category:&lt;/strong&gt; &lt;code&gt;cs.DC&lt;/code&gt; · all: cs.AR, cs.DC, cs.PF&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Matched keywords:&lt;/strong&gt; llm, rag, serving, speculative decoding, attention, latency&lt;/p&gt;&#xA;&lt;hr&gt;&#xA;&lt;h2 id=&#34;tldr&#34;&gt;TL;DR&lt;/h2&gt;&#xA;&lt;p&gt;Guess-Verify-Refine (GVR) is a data-aware exact Top-K kernel for sparse-attention decoding on NVIDIA Blackwell that exploits temporal correlation between consecutive decode steps, delivering 1.88× average speedup over production radix-select while preserving bit-exact outputs.&lt;/p&gt;</description>
    </item>
    <item>
      <title>Memanto: Typed Semantic Memory with Information-Theoretic Retrieval for Long-Horizon Agents</title>
      <link>https://ftxj.github.io/posts/2026-04-27/04-memanto-typed-semantic-memory-with-information-theoretic-ret/</link>
      <pubDate>Mon, 27 Apr 2026 09:31:40 +0000</pubDate>
      <guid>https://ftxj.github.io/posts/2026-04-27/04-memanto-typed-semantic-memory-with-information-theoretic-ret/</guid>
      <description>&lt;p&gt;&lt;strong&gt;arXiv:&lt;/strong&gt; &lt;a href=&#34;https://arxiv.org/abs/2604.22085v1&#34;&gt;2604.22085&lt;/a&gt; · &lt;a href=&#34;https://arxiv.org/pdf/2604.22085v1&#34;&gt;PDF&lt;/a&gt;&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Authors:&lt;/strong&gt; Seyed Moein Abtahi, Rasa Rahnema, Hetkumar Patel, Neel Patel, Majid Fekri, Tara Khani&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Affiliations:&lt;/strong&gt; Moorcheh AI, EdgeAI Innovations&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Primary category:&lt;/strong&gt; &lt;code&gt;cs.AI&lt;/code&gt; · all: cs.AI&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Matched keywords:&lt;/strong&gt; large language model, agent, agentic, retrieval, inference, latency&lt;/p&gt;&#xA;&lt;hr&gt;&#xA;&lt;h2 id=&#34;tldr&#34;&gt;TL;DR&lt;/h2&gt;&#xA;&lt;p&gt;Memanto is a universal memory layer for long-horizon agents that replaces hybrid knowledge-graph pipelines with a typed semantic schema plus Moorcheh&amp;rsquo;s information-theoretic search, hitting 89.8% on LongMemEval and 87.1% on LoCoMo with sub-90 ms single-query retrieval and zero ingestion cost.&lt;/p&gt;</description>
    </item>
    <item>
      <title>Sovereign Agentic Loops: Decoupling AI Reasoning from Execution in Real-World Systems</title>
      <link>https://ftxj.github.io/posts/2026-04-27/03-sovereign-agentic-loops-decoupling-ai-reasoning-from-executi/</link>
      <pubDate>Mon, 27 Apr 2026 09:30:16 +0000</pubDate>
      <guid>https://ftxj.github.io/posts/2026-04-27/03-sovereign-agentic-loops-decoupling-ai-reasoning-from-executi/</guid>
      <description>&lt;p&gt;&lt;strong&gt;arXiv:&lt;/strong&gt; &lt;a href=&#34;https://arxiv.org/abs/2604.22136v1&#34;&gt;2604.22136&lt;/a&gt; · &lt;a href=&#34;https://arxiv.org/pdf/2604.22136v1&#34;&gt;PDF&lt;/a&gt;&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Authors:&lt;/strong&gt; Jun He, Deying Yu&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Affiliations:&lt;/strong&gt; OpenKedge.io&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Primary category:&lt;/strong&gt; &lt;code&gt;cs.CR&lt;/code&gt; · all: cs.CR, cs.LG&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Matched keywords:&lt;/strong&gt; large language model, llm, agent, agentic, reasoning, latency&lt;/p&gt;&#xA;&lt;hr&gt;&#xA;&lt;h2 id=&#34;tldr&#34;&gt;TL;DR&lt;/h2&gt;&#xA;&lt;p&gt;Sovereign Agentic Loops (SAL) is a control-plane architecture that decouples LLM reasoning from execution: models emit structured intents with justifications, which are validated against true system state and policy before any mutation. A prototype blocks unsafe actions with 12.4 ms median overhead.&lt;/p&gt;&#xA;&lt;h2 id=&#34;key-ideas&#34;&gt;Key Ideas&lt;/h2&gt;&#xA;&lt;ul&gt;&#xA;&lt;li&gt;Direct coupling of stochastic LLM outputs to execution layers is unsafe; model correctness and alignment cannot be assumed at runtime.&lt;/li&gt;&#xA;&lt;li&gt;Models should emit &lt;strong&gt;structured intents with justifications&lt;/strong&gt;, not raw API calls.&lt;/li&gt;&#xA;&lt;li&gt;An &lt;strong&gt;obfuscation membrane&lt;/strong&gt; limits model access to identity-sensitive state.&lt;/li&gt;&#xA;&lt;li&gt;A cryptographically linked &lt;strong&gt;Evidence Chain&lt;/strong&gt; enables auditability and deterministic replay.&lt;/li&gt;&#xA;&lt;li&gt;Formal guarantees: policy-bounded execution, identity isolation, deterministic replay.&lt;/li&gt;&#xA;&lt;/ul&gt;&#xA;&lt;h2 id=&#34;approach&#34;&gt;Approach&lt;/h2&gt;&#xA;&lt;p&gt;SAL inserts a control plane between the LLM and execution layer. The model produces structured intents plus justifications; the control plane validates each intent against true system state and policy before dispatching it. The obfuscation membrane mediates what identity-sensitive state the model can observe, and every decision is recorded in a cryptographically chained Evidence Log that supports replay. The authors formalize the architecture and prove the safety properties under stated assumptions.&lt;/p&gt;</description>
    </item>
    <item>
      <title>Preference Heads in Large Language Models: A Mechanistic Framework for Interpretable Personalization</title>
      <link>https://ftxj.github.io/posts/2026-04-27/02-preference-heads-in-large-language-models-a-mechanistic-fram/</link>
      <pubDate>Mon, 27 Apr 2026 09:28:57 +0000</pubDate>
      <guid>https://ftxj.github.io/posts/2026-04-27/02-preference-heads-in-large-language-models-a-mechanistic-fram/</guid>
      <description>&lt;p&gt;&lt;strong&gt;arXiv:&lt;/strong&gt; &lt;a href=&#34;https://arxiv.org/abs/2604.22345v1&#34;&gt;2604.22345&lt;/a&gt; · &lt;a href=&#34;https://arxiv.org/pdf/2604.22345v1&#34;&gt;PDF&lt;/a&gt;&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Authors:&lt;/strong&gt; Weixu Zhang, Ye Yuan, Changjiang Han, Yuxing Tian, Zipeng Sun, Linfeng Du, Jikun Kang, Hong Kang, Xue Liu, Haolun Wu&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Affiliations:&lt;/strong&gt; McGill University, Mila - Quebec AI Institute, MBZUAI, University of Montreal, Salesforce&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Primary category:&lt;/strong&gt; &lt;code&gt;cs.CL&lt;/code&gt; · all: cs.CL&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Matched keywords:&lt;/strong&gt; large language model, llm, rag, inference, serving, attention, transformer&lt;/p&gt;&#xA;&lt;hr&gt;&#xA;&lt;h2 id=&#34;tldr&#34;&gt;TL;DR&lt;/h2&gt;&#xA;&lt;p&gt;The paper posits that LLM personalization is concentrated in a sparse set of &amp;ldquo;Preference Heads&amp;rdquo; and introduces Differential Preference Steering (DPS), a training-free method that identifies these heads via causal masking and contrasts logits with/without them at decoding to amplify user-aligned outputs.&lt;/p&gt;</description>
    </item>
    <item>
      <title>Emergent Strategic Reasoning Risks in AI: A Taxonomy-Driven Evaluation Framework</title>
      <link>https://ftxj.github.io/posts/2026-04-27/01-emergent-strategic-reasoning-risks-in-ai-a-taxonomy-driven-e/</link>
      <pubDate>Mon, 27 Apr 2026 09:26:31 +0000</pubDate>
      <guid>https://ftxj.github.io/posts/2026-04-27/01-emergent-strategic-reasoning-risks-in-ai-a-taxonomy-driven-e/</guid>
      <description>&lt;p&gt;&lt;strong&gt;arXiv:&lt;/strong&gt; &lt;a href=&#34;https://arxiv.org/abs/2604.22119v1&#34;&gt;2604.22119&lt;/a&gt; · &lt;a href=&#34;https://arxiv.org/pdf/2604.22119v1&#34;&gt;PDF&lt;/a&gt;&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Authors:&lt;/strong&gt; Tharindu Kumarage, Lisa Bauer, Yao Ma, Dan Rosen, Yashasvi Raghavendra Guduri, Anna Rumshisky, Kai-Wei Chang, Aram Galstyan, Rahul Gupta, Charith Peris&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Affiliations:&lt;/strong&gt; Amazon Nova Responsible AI&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Primary category:&lt;/strong&gt; &lt;code&gt;cs.AI&lt;/code&gt; · all: cs.AI&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Matched keywords:&lt;/strong&gt; large language model, llm, agent, agentic, reasoning&lt;/p&gt;&#xA;&lt;hr&gt;&#xA;&lt;h2 id=&#34;tldr&#34;&gt;TL;DR&lt;/h2&gt;&#xA;&lt;p&gt;The paper introduces ESRRSim, a taxonomy-driven agentic framework for benchmarking Emergent Strategic Reasoning Risks (ESRRs) — deception, evaluation gaming, reward hacking — in LLMs. Across 11 reasoning models, detection rates span 14.45%–72.72%, with newer generations showing dramatic safety gains.&lt;/p&gt;</description>
    </item>
    <item>
      <title>Large Language Models Decide Early and Explain Later</title>
      <link>https://ftxj.github.io/posts/2026-04-24/10-large-language-models-decide-early-and-explain-later/</link>
      <pubDate>Mon, 27 Apr 2026 08:08:57 +0000</pubDate>
      <guid>https://ftxj.github.io/posts/2026-04-24/10-large-language-models-decide-early-and-explain-later/</guid>
      <description>&lt;p&gt;&lt;strong&gt;arXiv:&lt;/strong&gt; &lt;a href=&#34;https://arxiv.org/abs/2604.22266v1&#34;&gt;2604.22266&lt;/a&gt; · &lt;a href=&#34;https://arxiv.org/pdf/2604.22266v1&#34;&gt;PDF&lt;/a&gt;&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Authors:&lt;/strong&gt; Ayan Datta, Zhixue Zhao, Bhuvanesh Verma, Radhika Mamidi, Mounika Marreddy, Alexander Mehler&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Primary category:&lt;/strong&gt; &lt;code&gt;cs.CL&lt;/code&gt; · all: cs.CL&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Matched keywords:&lt;/strong&gt; large language model, rag, reasoning, chain-of-thought, inference, latency&lt;/p&gt;&#xA;&lt;hr&gt;&#xA;&lt;h2 id=&#34;tldr&#34;&gt;TL;DR&lt;/h2&gt;&#xA;&lt;p&gt;Studying Qwen3-4B, the authors show LLMs often lock in their answer partway through chain-of-thought reasoning and spend hundreds of tokens explaining post-hoc; simple early-stopping heuristics cut ~500 tokens per query for only a 2% accuracy loss.&lt;/p&gt;&#xA;&lt;p&gt;&lt;img alt=&#34;Figure 1&#34; src=&#34;https://ftxj.github.io/images/papers/2604.22266/fig1.png&#34;&gt;&lt;/p&gt;</description>
    </item>
    <item>
      <title>Agentic World Modeling: Foundations, Capabilities, Laws, and Beyond</title>
      <link>https://ftxj.github.io/posts/2026-04-24/09-agentic-world-modeling-foundations-capabilities-laws-and-bey/</link>
      <pubDate>Mon, 27 Apr 2026 08:07:50 +0000</pubDate>
      <guid>https://ftxj.github.io/posts/2026-04-24/09-agentic-world-modeling-foundations-capabilities-laws-and-bey/</guid>
      <description>&lt;p&gt;&lt;strong&gt;arXiv:&lt;/strong&gt; &lt;a href=&#34;https://arxiv.org/abs/2604.22748v1&#34;&gt;2604.22748&lt;/a&gt; · &lt;a href=&#34;https://arxiv.org/pdf/2604.22748v1&#34;&gt;PDF&lt;/a&gt;&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Authors:&lt;/strong&gt; Meng Chu, Xuan Billy Zhang, Kevin Qinghong Lin, Lingdong Kong, Jize Zhang, Teng Tu, Weijian Ma, Ziqi Huang, Senqiao Yang, Wei Huang, Yeying Jin, Zhefan Rao, Jinhui Ye, Xinyu Lin, Xichen Zhang, Qisheng Hu, Shuai Yang, Leyang Shen, Wei Chow, Yifei Dong, Fengyi Wu, Quanyu Long, Bin Xia, Shaozuo Yu, Mingkang Zhu, Wenhu Zhang, Jiehui Huang, Haokun Gui, Haoxuan Che, Long Chen, Qifeng Chen, Wenxuan Zhang, Wenya Wang, Xiaojuan Qi, Yang Deng, Yanwei Li, Mike Zheng Shou, Zhi-Qi Cheng, See-Kiong Ng, Ziwei Liu, Philip Torr, Jiaya Jia&lt;/p&gt;</description>
    </item>
    <item>
      <title>How Do AI Agents Spend Your Money? Analyzing and Predicting Token Consumption in Agentic Coding Tasks</title>
      <link>https://ftxj.github.io/posts/2026-04-24/08-how-do-ai-agents-spend-your-money-analyzing-and-predicting-t/</link>
      <pubDate>Mon, 27 Apr 2026 08:06:58 +0000</pubDate>
      <guid>https://ftxj.github.io/posts/2026-04-24/08-how-do-ai-agents-spend-your-money-analyzing-and-predicting-t/</guid>
      <description>&lt;p&gt;&lt;strong&gt;arXiv:&lt;/strong&gt; &lt;a href=&#34;https://arxiv.org/abs/2604.22750v1&#34;&gt;2604.22750&lt;/a&gt; · &lt;a href=&#34;https://arxiv.org/pdf/2604.22750v1&#34;&gt;PDF&lt;/a&gt;&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Authors:&lt;/strong&gt; Longju Bai, Zhemin Huang, Xingyao Wang, Jiao Sun, Rada Mihalcea, Erik Brynjolfsson, Alex Pentland, Jiaxin Pei&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Primary category:&lt;/strong&gt; &lt;code&gt;cs.CL&lt;/code&gt; · all: cs.CL, cs.CY, cs.HC, cs.SE&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Matched keywords:&lt;/strong&gt; llm, agent, agentic, rag, reasoning&lt;/p&gt;&#xA;&lt;hr&gt;&#xA;&lt;h2 id=&#34;tldr&#34;&gt;TL;DR&lt;/h2&gt;&#xA;&lt;p&gt;First systematic study of token consumption in agentic coding tasks, analyzing trajectories from eight frontier LLMs on SWE-bench Verified. Finds agentic tasks consume 1000x more tokens than chat/reasoning, usage is highly stochastic, models vary dramatically in efficiency, and LLMs cannot reliably predict their own costs.&lt;/p&gt;</description>
    </item>
    <item>
      <title>Bridging the Long-Tail Gap: Robust Retrieval-Augmented Relation Completion via Multi-Stage Paraphrase Infusion</title>
      <link>https://ftxj.github.io/posts/2026-04-24/07-bridging-the-long-tail-gap-robust-retrieval-augmented-relati/</link>
      <pubDate>Mon, 27 Apr 2026 08:05:41 +0000</pubDate>
      <guid>https://ftxj.github.io/posts/2026-04-24/07-bridging-the-long-tail-gap-robust-retrieval-augmented-relati/</guid>
      <description>&lt;p&gt;&lt;strong&gt;arXiv:&lt;/strong&gt; &lt;a href=&#34;https://arxiv.org/abs/2604.22261v1&#34;&gt;2604.22261&lt;/a&gt; · &lt;a href=&#34;https://arxiv.org/pdf/2604.22261v1&#34;&gt;PDF&lt;/a&gt;&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Authors:&lt;/strong&gt; Fahmida Alam, Mihai Surdeanu, Ellen Riloff&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Primary category:&lt;/strong&gt; &lt;code&gt;cs.CL&lt;/code&gt; · all: cs.CL&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Matched keywords:&lt;/strong&gt; large language model, llm, retrieval, rag, reasoning, fine-tun&lt;/p&gt;&#xA;&lt;hr&gt;&#xA;&lt;h2 id=&#34;tldr&#34;&gt;TL;DR&lt;/h2&gt;&#xA;&lt;p&gt;RC-RAG is a training-free, multi-stage RAG framework that injects relation paraphrases into retrieval, summarization, and generation to boost long-tail relation completion. It delivers +40.6 EM over standalone LLMs and +13–16 EM over strong RAG baselines.&lt;/p&gt;&#xA;&lt;p&gt;&lt;img alt=&#34;Figure 1&#34; src=&#34;https://ftxj.github.io/images/papers/2604.22261/fig1.png&#34;&gt;&lt;/p&gt;&#xA;&lt;h2 id=&#34;key-ideas&#34;&gt;Key Ideas&lt;/h2&gt;&#xA;&lt;ul&gt;&#xA;&lt;li&gt;LLMs (with or without RAG) fail on rare/long-tail relations due to narrow lexical surface forms.&lt;/li&gt;&#xA;&lt;li&gt;Paraphrases of a relation can systematically broaden coverage across the RAG pipeline.&lt;/li&gt;&#xA;&lt;li&gt;No fine-tuning required — purely prompt- and retrieval-level intervention.&lt;/li&gt;&#xA;&lt;li&gt;Gains hold across five LLMs and two benchmark datasets.&lt;/li&gt;&#xA;&lt;/ul&gt;&#xA;&lt;h2 id=&#34;approach&#34;&gt;Approach&lt;/h2&gt;&#xA;&lt;p&gt;RC-RAG threads relation paraphrases through three stages:&lt;/p&gt;</description>
    </item>
    <item>
      <title>QuantClaw: Precision Where It Matters for OpenClaw</title>
      <link>https://ftxj.github.io/posts/2026-04-24/06-quantclaw-precision-where-it-matters-for-openclaw/</link>
      <pubDate>Mon, 27 Apr 2026 08:04:32 +0000</pubDate>
      <guid>https://ftxj.github.io/posts/2026-04-24/06-quantclaw-precision-where-it-matters-for-openclaw/</guid>
      <description>&lt;p&gt;&lt;strong&gt;arXiv:&lt;/strong&gt; &lt;a href=&#34;https://arxiv.org/abs/2604.22577v1&#34;&gt;2604.22577&lt;/a&gt; · &lt;a href=&#34;https://arxiv.org/pdf/2604.22577v1&#34;&gt;PDF&lt;/a&gt;&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Authors:&lt;/strong&gt; Manyi Zhang, Ji-Fu Li, Zhongao Sun, Xiaohao Liu, Zhenhua Dong, Xianzhi Yu, Haoli Bai, Xiaobo Xia&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Primary category:&lt;/strong&gt; &lt;code&gt;cs.AI&lt;/code&gt; · all: cs.AI, cs.CL&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Matched keywords:&lt;/strong&gt; agent, reasoning, inference, serving, quantization, latency&lt;/p&gt;&#xA;&lt;hr&gt;&#xA;&lt;h2 id=&#34;tldr&#34;&gt;TL;DR&lt;/h2&gt;&#xA;&lt;p&gt;QuantClaw is a plug-and-play precision-routing plugin for the OpenClaw agent system that dynamically assigns quantization precision per task, cutting cost up to 21.4% and latency 15.7% on GLM-5 (FP8 baseline) without degrading task quality.&lt;/p&gt;&#xA;&lt;h2 id=&#34;key-ideas&#34;&gt;Key Ideas&lt;/h2&gt;&#xA;&lt;ul&gt;&#xA;&lt;li&gt;Quantization sensitivity in agent workflows is highly &lt;strong&gt;task-dependent&lt;/strong&gt;, not uniform.&lt;/li&gt;&#xA;&lt;li&gt;Precision should be treated as a &lt;strong&gt;dynamic resource&lt;/strong&gt;, routed per request.&lt;/li&gt;&#xA;&lt;li&gt;A lightweight plugin can sit in front of OpenClaw without increasing user complexity.&lt;/li&gt;&#xA;&lt;/ul&gt;&#xA;&lt;p&gt;&lt;img alt=&#34;Figure 1&#34; src=&#34;https://ftxj.github.io/images/papers/2604.22577/fig1.png&#34;&gt;&lt;/p&gt;</description>
    </item>
    <item>
      <title>Behavioral Canaries: Auditing Private Retrieved Context Usage in RL Fine-Tuning</title>
      <link>https://ftxj.github.io/posts/2026-04-24/05-behavioral-canaries-auditing-private-retrieved-context-usage/</link>
      <pubDate>Mon, 27 Apr 2026 08:03:42 +0000</pubDate>
      <guid>https://ftxj.github.io/posts/2026-04-24/05-behavioral-canaries-auditing-private-retrieved-context-usage/</guid>
      <description>&lt;p&gt;&lt;strong&gt;arXiv:&lt;/strong&gt; &lt;a href=&#34;https://arxiv.org/abs/2604.22191v1&#34;&gt;2604.22191&lt;/a&gt; · &lt;a href=&#34;https://arxiv.org/pdf/2604.22191v1&#34;&gt;PDF&lt;/a&gt;&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Authors:&lt;/strong&gt; Chaoran Chen, Dayu Yuan, Peter Kairouz&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Primary category:&lt;/strong&gt; &lt;code&gt;cs.CR&lt;/code&gt; · all: cs.CL, cs.CR&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Matched keywords:&lt;/strong&gt; llm, agent, agentic, inference, fine-tun, post-train&lt;/p&gt;&#xA;&lt;hr&gt;&#xA;&lt;h2 id=&#34;tldr&#34;&gt;TL;DR&lt;/h2&gt;&#xA;&lt;p&gt;The paper introduces &lt;em&gt;Behavioral Canaries&lt;/em&gt;, an auditing mechanism that detects unauthorized use of protected retrieved documents in RL fine-tuning by planting document-triggered stylistic preferences and later probing for them.&lt;/p&gt;&#xA;&lt;p&gt;&lt;img alt=&#34;Figure 1&#34; src=&#34;https://ftxj.github.io/images/papers/2604.22191/fig1.png&#34;&gt;&lt;/p&gt;&#xA;&lt;h2 id=&#34;key-ideas&#34;&gt;Key Ideas&lt;/h2&gt;&#xA;&lt;ul&gt;&#xA;&lt;li&gt;Standard memorization/MIA audits fail against RLFT since RL shapes style, not fact retention.&lt;/li&gt;&#xA;&lt;li&gt;Inject &lt;em&gt;behavioral canaries&lt;/em&gt;: pair document triggers with preference data rewarding a distinctive style.&lt;/li&gt;&#xA;&lt;li&gt;If the provider trained on the protected corpus, the model exhibits a latent trigger-conditioned stylistic shift detectable by auditors.&lt;/li&gt;&#xA;&lt;li&gt;Reframes auditing from content leakage to &lt;em&gt;distributional behavioral change&lt;/em&gt;.&lt;/li&gt;&#xA;&lt;/ul&gt;&#xA;&lt;h2 id=&#34;approach&#34;&gt;Approach&lt;/h2&gt;&#xA;&lt;p&gt;Auditors instrument a subset of retrieved documents by constructing preference pairs where the &amp;ldquo;chosen&amp;rdquo; response exhibits a distinctive stylistic pattern conditioned on a trigger drawn from the document. When an unscrupulous provider funnels this preference data into RLHF/DPO-style RLFT, the policy internalizes a trigger→style association. At audit time, the auditor issues probe queries containing the trigger and measures whether stylistic features appear at rates significantly above baseline, yielding a statistical detection test.&lt;/p&gt;</description>
    </item>
    <item>
      <title>GR-Evolve: Design-Adaptive Global Routing via LLM-Driven Algorithm Evolution</title>
      <link>https://ftxj.github.io/posts/2026-04-24/04-gr-evolve-design-adaptive-global-routing-via-llm-driven-algo/</link>
      <pubDate>Mon, 27 Apr 2026 07:58:19 +0000</pubDate>
      <guid>https://ftxj.github.io/posts/2026-04-24/04-gr-evolve-design-adaptive-global-routing-via-llm-driven-algo/</guid>
      <description>&lt;p&gt;&lt;strong&gt;arXiv:&lt;/strong&gt; &lt;a href=&#34;https://arxiv.org/abs/2604.22234v1&#34;&gt;2604.22234&lt;/a&gt; · &lt;a href=&#34;https://arxiv.org/pdf/2604.22234v1&#34;&gt;PDF&lt;/a&gt;&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Authors:&lt;/strong&gt; Taizun Jafri, Vidya A. Chhabria&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Primary category:&lt;/strong&gt; &lt;code&gt;cs.AR&lt;/code&gt; · all: cs.AR&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Matched keywords:&lt;/strong&gt; large language model, llm, agent, agentic, rag&lt;/p&gt;&#xA;&lt;hr&gt;&#xA;&lt;h2 id=&#34;tldr&#34;&gt;TL;DR&lt;/h2&gt;&#xA;&lt;p&gt;GR-Evolve 用 agentic LLM 迭代修改全局布线器源码，以 QoR 反馈驱动&amp;quot;设计自适应&amp;quot;EDA：让算法本身针对具体芯片设计特化，而非仅调超参。&lt;/p&gt;&#xA;&lt;p&gt;&lt;img alt=&#34;Figure 1&#34; src=&#34;https://ftxj.github.io/images/papers/2604.22234/fig1.jpg&#34;&gt;&lt;/p&gt;&#xA;&lt;h2 id=&#34;key-ideas&#34;&gt;Key Ideas&lt;/h2&gt;&#xA;&lt;ul&gt;&#xA;&lt;li&gt;提出 design-adaptive EDA 范式：工具内部算法针对每个 design 自动特化。&lt;/li&gt;&#xA;&lt;li&gt;用 LLM 演化 global router 源码，而非只调 hyperparameter。&lt;/li&gt;&#xA;&lt;li&gt;以 QoR 指标作为进化反馈信号形成闭环。&lt;/li&gt;&#xA;&lt;li&gt;在 OpenROAD 基础设施上集成 QoR 评估工具链。&lt;/li&gt;&#xA;&lt;/ul&gt;&#xA;&lt;h2 id=&#34;approach&#34;&gt;Approach&lt;/h2&gt;&#xA;&lt;p&gt;LLM agent 持有开源 global router 的持久化上下文知识，迭代修改源代码；每轮在 OpenROAD 中跑 detailed routing 得到 QoR，并将结果回馈给 LLM 指导下一轮代码变更。等价于把代码进化 + 评估循环封装成自动化流水线。&lt;/p&gt;</description>
    </item>
    <item>
      <title>Guess-Verify-Refine: Data-Aware Top-K for Sparse-Attention Decoding on Blackwell via Temporal Correlation</title>
      <link>https://ftxj.github.io/posts/2026-04-24/03-guess-verify-refine-data-aware-top-k-for-sparse-attention-de/</link>
      <pubDate>Mon, 27 Apr 2026 07:57:20 +0000</pubDate>
      <guid>https://ftxj.github.io/posts/2026-04-24/03-guess-verify-refine-data-aware-top-k-for-sparse-attention-de/</guid>
      <description>&lt;p&gt;&lt;strong&gt;arXiv:&lt;/strong&gt; &lt;a href=&#34;https://arxiv.org/abs/2604.22312v1&#34;&gt;2604.22312&lt;/a&gt; · &lt;a href=&#34;https://arxiv.org/pdf/2604.22312v1&#34;&gt;PDF&lt;/a&gt;&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Authors:&lt;/strong&gt; Long Cheng, Ritchie Zhao, Timmy Liu, Mindy Li, Xianjie Qiao, Kefeng Duan, Yu-Jung Chen, Xiaoming Chen, Bita Darvish Rouhani, June Yang&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Primary category:&lt;/strong&gt; &lt;code&gt;cs.DC&lt;/code&gt; · all: cs.AR, cs.DC, cs.PF&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Matched keywords:&lt;/strong&gt; llm, rag, serving, speculative decoding, attention, latency&lt;/p&gt;&#xA;&lt;hr&gt;&#xA;&lt;h2 id=&#34;tldr&#34;&gt;TL;DR&lt;/h2&gt;&#xA;&lt;p&gt;GVR is a data-aware exact Top-K kernel for sparse-attention decoding on NVIDIA Blackwell. By exploiting temporal correlation between consecutive decode steps, it delivers 1.88× average (up to 2.42×) speedup over radix-select while preserving bit-exact outputs, yielding up to 7.52% end-to-end TPOT gains on DeepSeek-V3.2.&lt;/p&gt;</description>
    </item>
    <item>
      <title>Sovereign Agentic Loops: Decoupling AI Reasoning from Execution in Real-World Systems</title>
      <link>https://ftxj.github.io/posts/2026-04-24/02-sovereign-agentic-loops-decoupling-ai-reasoning-from-executi/</link>
      <pubDate>Mon, 27 Apr 2026 07:56:13 +0000</pubDate>
      <guid>https://ftxj.github.io/posts/2026-04-24/02-sovereign-agentic-loops-decoupling-ai-reasoning-from-executi/</guid>
      <description>&lt;p&gt;&lt;strong&gt;arXiv:&lt;/strong&gt; &lt;a href=&#34;https://arxiv.org/abs/2604.22136v1&#34;&gt;2604.22136&lt;/a&gt; · &lt;a href=&#34;https://arxiv.org/pdf/2604.22136v1&#34;&gt;PDF&lt;/a&gt;&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Authors:&lt;/strong&gt; Jun He, Deying Yu&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Primary category:&lt;/strong&gt; &lt;code&gt;cs.CR&lt;/code&gt; · all: cs.CR, cs.LG&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Matched keywords:&lt;/strong&gt; large language model, llm, agent, agentic, reasoning, latency&lt;/p&gt;&#xA;&lt;hr&gt;&#xA;&lt;h2 id=&#34;tldr&#34;&gt;TL;DR&lt;/h2&gt;&#xA;&lt;p&gt;SAL is a control-plane architecture that decouples LLM reasoning from execution: models emit structured intents with justifications, and a validator checks them against true state and policy before any mutation. A prototype blocks 100% of unsafe intents with 12.4 ms median overhead.&lt;/p&gt;&#xA;&lt;p&gt;&lt;img alt=&#34;Figure 1&#34; src=&#34;https://ftxj.github.io/images/papers/2604.22136/page1.png&#34;&gt;&lt;/p&gt;&#xA;&lt;h2 id=&#34;key-ideas&#34;&gt;Key Ideas&lt;/h2&gt;&#xA;&lt;ul&gt;&#xA;&lt;li&gt;Direct coupling of stochastic LLM outputs to execution APIs is an unsound safety model.&lt;/li&gt;&#xA;&lt;li&gt;Separate &lt;em&gt;intent emission&lt;/em&gt; (model) from &lt;em&gt;intent validation + execution&lt;/em&gt; (control plane).&lt;/li&gt;&#xA;&lt;li&gt;Add an &lt;strong&gt;obfuscation membrane&lt;/strong&gt; to hide identity-sensitive state from the model.&lt;/li&gt;&#xA;&lt;li&gt;Maintain a cryptographically linked &lt;strong&gt;Evidence Chain&lt;/strong&gt; for audit and deterministic replay.&lt;/li&gt;&#xA;&lt;li&gt;Formal guarantees: policy-bounded execution, identity isolation, replay determinism.&lt;/li&gt;&#xA;&lt;/ul&gt;&#xA;&lt;h2 id=&#34;approach&#34;&gt;Approach&lt;/h2&gt;&#xA;&lt;p&gt;Models produce structured intents &lt;code&gt;(action, args, justification)&lt;/code&gt; rather than raw API calls. The control plane:&lt;/p&gt;</description>
    </item>
    <item>
      <title>Preference Heads in Large Language Models: A Mechanistic Framework for Interpretable Personalization</title>
      <link>https://ftxj.github.io/posts/2026-04-24/01-preference-heads-in-large-language-models-a-mechanistic-fram/</link>
      <pubDate>Mon, 27 Apr 2026 07:55:24 +0000</pubDate>
      <guid>https://ftxj.github.io/posts/2026-04-24/01-preference-heads-in-large-language-models-a-mechanistic-fram/</guid>
      <description>&lt;p&gt;&lt;strong&gt;arXiv:&lt;/strong&gt; &lt;a href=&#34;https://arxiv.org/abs/2604.22345v1&#34;&gt;2604.22345&lt;/a&gt; · &lt;a href=&#34;https://arxiv.org/pdf/2604.22345v1&#34;&gt;PDF&lt;/a&gt;&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Authors:&lt;/strong&gt; Weixu Zhang, Ye Yuan, Changjiang Han, Yuxing Tian, Zipeng Sun, Linfeng Du, Jikun Kang, Hong Kang, Xue Liu, Haolun Wu&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Primary category:&lt;/strong&gt; &lt;code&gt;cs.CL&lt;/code&gt; · all: cs.CL&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Matched keywords:&lt;/strong&gt; large language model, llm, rag, inference, serving, attention, transformer&lt;/p&gt;&#xA;&lt;hr&gt;&#xA;&lt;h2 id=&#34;tldr&#34;&gt;TL;DR&lt;/h2&gt;&#xA;&lt;p&gt;The paper hypothesizes that LLM personalization is driven by a sparse set of &amp;ldquo;Preference Heads&amp;rdquo; — specific attention heads encoding user style/topic preferences. It introduces Differential Preference Steering (DPS), a training-free decoding method that identifies these heads via causal masking and amplifies their effect at inference.&lt;/p&gt;</description>
    </item>
    <item>
      <title>ARES: Adaptive Red-Teaming and End-to-End Repair of Policy-Reward System</title>
      <link>https://ftxj.github.io/posts/2026-04-20/10-ares-adaptive-red-teaming-and-end-to-end-repair-of-policy-re/</link>
      <pubDate>Mon, 27 Apr 2026 05:28:42 +0000</pubDate>
      <guid>https://ftxj.github.io/posts/2026-04-20/10-ares-adaptive-red-teaming-and-end-to-end-repair-of-policy-re/</guid>
      <description>&lt;p&gt;&lt;strong&gt;arXiv:&lt;/strong&gt; &lt;a href=&#34;https://arxiv.org/abs/2604.18789v1&#34;&gt;2604.18789&lt;/a&gt; · &lt;a href=&#34;https://arxiv.org/pdf/2604.18789v1&#34;&gt;PDF&lt;/a&gt;&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Authors:&lt;/strong&gt; Jiacheng Liang, Yao Ma, Tharindu Kumarage, Satyapriya Krishna, Rahul Gupta, Kai-Wei Chang, Aram Galstyan, Charith Peris&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Primary category:&lt;/strong&gt; &lt;code&gt;cs.AI&lt;/code&gt; · all: cs.AI, cs.CR, cs.LG&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Matched keywords:&lt;/strong&gt; large language model, llm, rag, serving, fine-tun, rlhf&lt;/p&gt;&#xA;&lt;hr&gt;&#xA;&lt;h2 id=&#34;tldr&#34;&gt;TL;DR&lt;/h2&gt;&#xA;&lt;p&gt;ARES is a red-teaming framework that exposes joint failures of both the core LLM and its reward model in RLHF, then repairs the system in two stages—first fine-tuning the RM, then optimising the policy—yielding safer models without sacrificing capability.&lt;/p&gt;</description>
    </item>
    <item>
      <title>Copy-as-Decode: Grammar-Constrained Parallel Prefill for LLM Editing</title>
      <link>https://ftxj.github.io/posts/2026-04-20/09-copy-as-decode-grammar-constrained-parallel-prefill-for-llm/</link>
      <pubDate>Mon, 27 Apr 2026 05:28:00 +0000</pubDate>
      <guid>https://ftxj.github.io/posts/2026-04-20/09-copy-as-decode-grammar-constrained-parallel-prefill-for-llm/</guid>
      <description>&lt;p&gt;&lt;strong&gt;arXiv:&lt;/strong&gt; &lt;a href=&#34;https://arxiv.org/abs/2604.18170v1&#34;&gt;2604.18170&lt;/a&gt; · &lt;a href=&#34;https://arxiv.org/pdf/2604.18170v1&#34;&gt;PDF&lt;/a&gt;&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Authors:&lt;/strong&gt; Ziyang Liu&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Primary category:&lt;/strong&gt; &lt;code&gt;cs.CL&lt;/code&gt; · all: cs.AI, cs.CL&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Matched keywords:&lt;/strong&gt; llm, rag, serving, kv cache, speculative decoding, fine-tun&lt;/p&gt;&#xA;&lt;hr&gt;&#xA;&lt;h2 id=&#34;tldr&#34;&gt;TL;DR&lt;/h2&gt;&#xA;&lt;p&gt;Copy-as-Decode reframes LLM text/code editing as grammar-constrained decoding over two primitives (&lt;code&gt;&amp;lt;copy&amp;gt;&lt;/code&gt; and &lt;code&gt;&amp;lt;gen&amp;gt;&lt;/code&gt;), letting copy spans be filled via a single parallel-prefill forward instead of N autoregressive steps, yielding large theoretical speedups without end-to-end training.&lt;/p&gt;&#xA;&lt;h2 id=&#34;key-ideas&#34;&gt;Key Ideas&lt;/h2&gt;&#xA;&lt;ul&gt;&#xA;&lt;li&gt;Most edit outputs are verbatim copies of the input, so regenerating them autoregressively is wasteful.&lt;/li&gt;&#xA;&lt;li&gt;A two-primitive grammar (&lt;code&gt;&amp;lt;copy lines=&amp;quot;i-j&amp;quot;/&amp;gt;&lt;/code&gt;, &lt;code&gt;&amp;lt;gen&amp;gt;...&amp;lt;/gen&amp;gt;&lt;/code&gt;) with a token-level FSM guarantees syntactic validity.&lt;/li&gt;&#xA;&lt;li&gt;Copy spans reuse the speculative-decoding parallel-forward kernel, but with input tokens as the &amp;ldquo;draft&amp;rdquo; and grammar-enforced (not probabilistic) acceptance.&lt;/li&gt;&#xA;&lt;li&gt;Paper gives an upper-bound analysis — no training required — separating kernel speedup, copy coverage ceiling, and pipeline losslessness.&lt;/li&gt;&#xA;&lt;/ul&gt;&#xA;&lt;h2 id=&#34;approach&#34;&gt;Approach&lt;/h2&gt;&#xA;&lt;p&gt;At decode time the model emits grammar tokens; a deterministic resolver expands &lt;code&gt;&amp;lt;copy&amp;gt;&lt;/code&gt; tags by issuing one parallel-prefill forward that updates the KV cache for the whole span, while &lt;code&gt;&amp;lt;gen&amp;gt;&lt;/code&gt; falls back to standard autoregressive decoding. An FSM enforces legal token transitions. Line-level and finer token-level primitives are both analyzed.&lt;/p&gt;</description>
    </item>
    <item>
      <title>River-LLM: Large Language Model Seamless Exit Based on KV Share</title>
      <link>https://ftxj.github.io/posts/2026-04-20/08-river-llm-large-language-model-seamless-exit-based-on-kv-sha/</link>
      <pubDate>Mon, 27 Apr 2026 05:27:28 +0000</pubDate>
      <guid>https://ftxj.github.io/posts/2026-04-20/08-river-llm-large-language-model-seamless-exit-based-on-kv-sha/</guid>
      <description>&lt;p&gt;&lt;strong&gt;arXiv:&lt;/strong&gt; &lt;a href=&#34;https://arxiv.org/abs/2604.18396v1&#34;&gt;2604.18396&lt;/a&gt; · &lt;a href=&#34;https://arxiv.org/pdf/2604.18396v1&#34;&gt;PDF&lt;/a&gt;&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Authors:&lt;/strong&gt; Yingtao Shen, An Zou&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Primary category:&lt;/strong&gt; &lt;code&gt;cs.CL&lt;/code&gt; · all: cs.CL&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Matched keywords:&lt;/strong&gt; large language model, llm, reasoning, inference, kv cache, latency&lt;/p&gt;&#xA;&lt;hr&gt;&#xA;&lt;h2 id=&#34;tldr&#34;&gt;TL;DR&lt;/h2&gt;&#xA;&lt;p&gt;River-LLM is a training-free Early Exit framework for decoder-only LLMs that solves the KV Cache Absence problem via a lightweight KV-Shared Exit River, achieving 1.71–2.16× wall-clock speedup on reasoning and code tasks without quality loss.&lt;/p&gt;&#xA;&lt;h2 id=&#34;key-ideas&#34;&gt;Key Ideas&lt;/h2&gt;&#xA;&lt;ul&gt;&#xA;&lt;li&gt;Identifies &lt;strong&gt;KV Cache Absence&lt;/strong&gt; as the core bottleneck preventing Early Exit from delivering practical speedup in decoder-only LLMs.&lt;/li&gt;&#xA;&lt;li&gt;Proposes a &lt;strong&gt;KV-Shared Exit River&lt;/strong&gt;: skipped layers still produce usable KV entries, avoiding recomputation or masking.&lt;/li&gt;&#xA;&lt;li&gt;Uses &lt;strong&gt;state transition similarity&lt;/strong&gt; across decoder blocks to predict cumulative KV errors and drive per-token exit decisions.&lt;/li&gt;&#xA;&lt;li&gt;Training-free — drops into existing models without fine-tuning.&lt;/li&gt;&#xA;&lt;/ul&gt;&#xA;&lt;h2 id=&#34;approach&#34;&gt;Approach&lt;/h2&gt;&#xA;&lt;p&gt;River-LLM adds a lightweight side path (&amp;ldquo;Exit River&amp;rdquo;) that shares/propagates KV states so that layers skipped by Early Exit still contribute KV cache entries consistent with the backbone. Exit decisions are made token-by-token using a predictor based on inter-block state transition similarity, estimating cumulative KV error and stopping when safe. No retraining is required.&lt;/p&gt;</description>
    </item>
    <item>
      <title>Unlocking the Edge deployment and ondevice acceleration of multi-LoRA enabled one-for-all foundational LLM</title>
      <link>https://ftxj.github.io/posts/2026-04-20/07-unlocking-the-edge-deployment-and-ondevice-acceleration-of-m/</link>
      <pubDate>Mon, 27 Apr 2026 05:26:54 +0000</pubDate>
      <guid>https://ftxj.github.io/posts/2026-04-20/07-unlocking-the-edge-deployment-and-ondevice-acceleration-of-m/</guid>
      <description>&lt;p&gt;&lt;strong&gt;arXiv:&lt;/strong&gt; &lt;a href=&#34;https://arxiv.org/abs/2604.18655v2&#34;&gt;2604.18655&lt;/a&gt; · &lt;a href=&#34;https://arxiv.org/pdf/2604.18655v2&#34;&gt;PDF&lt;/a&gt;&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Authors:&lt;/strong&gt; Sravanth Kodavanti, Sowmya Vajrala, Srinivas Miriyala, Utsav Tiwari, Uttam Kumar, Utkarsh Kumar Mahawar, Achal Pratap Singh, Arya D, Narendra Mutyala, Vikram Nelvoy Rajendiran, Sharan Kumar Allur, Euntaik Lee, Dohyoung Kim, HyeonSu Lee, Gyusung Cho, JungBae Kim&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Primary category:&lt;/strong&gt; &lt;code&gt;cs.DC&lt;/code&gt; · all: cs.AI, cs.CL, cs.DC&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Matched keywords:&lt;/strong&gt; large language model, llm, inference, quantization, speculative decoding, latency&lt;/p&gt;&#xA;&lt;hr&gt;&#xA;&lt;h2 id=&#34;tldr&#34;&gt;TL;DR&lt;/h2&gt;&#xA;&lt;p&gt;A hardware-aware framework deploys a LLaMA-based multilingual foundation model on Samsung Galaxy S24/S25 phones, combining runtime multi-LoRA switching, multi-stream decoding, dynamic self-speculative decoding, and INT4 quantization to achieve 4-6x memory/latency improvements across 9 languages and 8 tasks.&lt;/p&gt;</description>
    </item>
    <item>
      <title>HybridGen: Efficient LLM Generative Inference via CPU-GPU Hybrid Computing</title>
      <link>https://ftxj.github.io/posts/2026-04-20/06-hybridgen-efficient-llm-generative-inference-via-cpu-gpu-hyb/</link>
      <pubDate>Mon, 27 Apr 2026 05:26:19 +0000</pubDate>
      <guid>https://ftxj.github.io/posts/2026-04-20/06-hybridgen-efficient-llm-generative-inference-via-cpu-gpu-hyb/</guid>
      <description>&lt;p&gt;&lt;strong&gt;arXiv:&lt;/strong&gt; &lt;a href=&#34;https://arxiv.org/abs/2604.18529v1&#34;&gt;2604.18529&lt;/a&gt; · &lt;a href=&#34;https://arxiv.org/pdf/2604.18529v1&#34;&gt;PDF&lt;/a&gt;&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Authors:&lt;/strong&gt; Mao Lin, Xi Wang, Guilherme Cox, Dong Li, Hyeran Jeon&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Primary category:&lt;/strong&gt; &lt;code&gt;cs.PF&lt;/code&gt; · all: cs.DC, cs.PF&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Matched keywords:&lt;/strong&gt; llm, rag, inference, kv cache, parallelism, attention, gpu, scheduler&lt;/p&gt;&#xA;&lt;hr&gt;&#xA;&lt;h2 id=&#34;tldr&#34;&gt;TL;DR&lt;/h2&gt;&#xA;&lt;p&gt;HybridGen is a CPU-GPU hybrid attention framework for long-context LLM inference that leverages CXL-expanded tiered memory. By coordinating attention computation across CPU and GPU, it outperforms six SOTA KV cache management methods by 1.41x-3.2x while preserving accuracy.&lt;/p&gt;&#xA;&lt;h2 id=&#34;key-ideas&#34;&gt;Key Ideas&lt;/h2&gt;&#xA;&lt;ul&gt;&#xA;&lt;li&gt;Existing KV cache pruning/offloading underutilizes hardware by computing attention on only one device.&lt;/li&gt;&#xA;&lt;li&gt;Tiered memory (e.g., CXL) expands CPU-local KV capacity but introduces NUMA penalties.&lt;/li&gt;&#xA;&lt;li&gt;Collaborative CPU-GPU attention needs new parallelism, scheduling, and data placement strategies.&lt;/li&gt;&#xA;&lt;li&gt;Three challenges: multi-dim attention dependencies, load imbalance with long sequences, NUMA penalty.&lt;/li&gt;&#xA;&lt;/ul&gt;&#xA;&lt;h2 id=&#34;approach&#34;&gt;Approach&lt;/h2&gt;&#xA;&lt;p&gt;HybridGen introduces three mechanisms:&lt;/p&gt;</description>
    </item>
    <item>
      <title>Training and Agentic Inference Strategies for LLM-based Manim Animation Generation</title>
      <link>https://ftxj.github.io/posts/2026-04-20/05-training-and-agentic-inference-strategies-for-llm-based-mani/</link>
      <pubDate>Mon, 27 Apr 2026 05:25:48 +0000</pubDate>
      <guid>https://ftxj.github.io/posts/2026-04-20/05-training-and-agentic-inference-strategies-for-llm-based-mani/</guid>
      <description>&lt;p&gt;&lt;strong&gt;arXiv:&lt;/strong&gt; &lt;a href=&#34;https://arxiv.org/abs/2604.18364v1&#34;&gt;2604.18364&lt;/a&gt; · &lt;a href=&#34;https://arxiv.org/pdf/2604.18364v1&#34;&gt;PDF&lt;/a&gt;&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Authors:&lt;/strong&gt; Ravidu Suien Rammuni Silva, Ahmad Lotfi, Isibor Kennedy Ihianle, Golnaz Shahtahmassebi, Jordan J. Bird&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Primary category:&lt;/strong&gt; &lt;code&gt;cs.AI&lt;/code&gt; · all: cs.AI, cs.GR, cs.MA&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Matched keywords:&lt;/strong&gt; large language model, llm, agent, agentic, reasoning, inference, fine-tun&lt;/p&gt;&#xA;&lt;hr&gt;&#xA;&lt;h2 id=&#34;tldr&#34;&gt;TL;DR&lt;/h2&gt;&#xA;&lt;p&gt;The paper introduces ManimTrainer (SFT + GRPO with fused code/visual rewards) and ManimAgent (Renderer-in-the-loop inference with API-doc augmentation) for text-to-code-to-video Manim animation. A Qwen 3 Coder 30B variant hits 94% render success and 85.7% visual similarity, beating GPT-4.1.&lt;/p&gt;</description>
    </item>
    <item>
      <title>AQPIM: Breaking the PIM Capacity Wall for LLMs with In-Memory Activation Quantization</title>
      <link>https://ftxj.github.io/posts/2026-04-20/04-aqpim-breaking-the-pim-capacity-wall-for-llms-with-in-memory/</link>
      <pubDate>Mon, 27 Apr 2026 05:24:50 +0000</pubDate>
      <guid>https://ftxj.github.io/posts/2026-04-20/04-aqpim-breaking-the-pim-capacity-wall-for-llms-with-in-memory/</guid>
      <description>&lt;p&gt;&lt;strong&gt;arXiv:&lt;/strong&gt; &lt;a href=&#34;https://arxiv.org/abs/2604.18137v1&#34;&gt;2604.18137&lt;/a&gt; · &lt;a href=&#34;https://arxiv.org/pdf/2604.18137v1&#34;&gt;PDF&lt;/a&gt;&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Authors:&lt;/strong&gt; Kosuke Matsushima, Yasuyuki Okoshi, Masato Motomura, Daichi Fujiki&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Primary category:&lt;/strong&gt; &lt;code&gt;cs.AR&lt;/code&gt; · all: cs.AI, cs.AR, cs.LG&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Matched keywords:&lt;/strong&gt; large language model, llm, rag, kv cache, quantization, attention, transformer, gpu, latency&lt;/p&gt;&#xA;&lt;hr&gt;&#xA;&lt;h2 id=&#34;tldr&#34;&gt;TL;DR&lt;/h2&gt;&#xA;&lt;p&gt;AQPIM is a PIM-aware activation quantization framework that applies Product Quantization (PQ) directly inside memory to shrink KV-cache footprint and accelerate LLM attention, achieving 3.4× speedup over SOTA PIM baselines while slashing GPU-CPU communication overhead.&lt;/p&gt;&#xA;&lt;h2 id=&#34;key-ideas&#34;&gt;Key Ideas&lt;/h2&gt;&#xA;&lt;ul&gt;&#xA;&lt;li&gt;Activation (KV cache) memory, not just weights, is the real PIM capacity wall for long-context LLMs.&lt;/li&gt;&#xA;&lt;li&gt;Clustering-based vector quantization (specifically PQ) aligns with activation statistics and PIM&amp;rsquo;s internal bandwidth.&lt;/li&gt;&#xA;&lt;li&gt;Quantization performed &lt;em&gt;inside&lt;/em&gt; memory enables direct compute on compressed data.&lt;/li&gt;&#xA;&lt;li&gt;Algorithmic tweaks restore PQ accuracy for modern LLMs.&lt;/li&gt;&#xA;&lt;/ul&gt;&#xA;&lt;h2 id=&#34;approach&#34;&gt;Approach&lt;/h2&gt;&#xA;&lt;p&gt;AQPIM builds a PIM-specialized activation quantization pipeline around Product Quantization. Activations are split into sub-vectors, clustered, and stored as codebook indices directly in PIM banks. Attention computation then operates on the compressed representation, exploiting PIM&amp;rsquo;s high internal bandwidth. Several (unspecified) algorithmic optimizations mitigate PQ&amp;rsquo;s accuracy loss on LLM activations.&lt;/p&gt;</description>
    </item>
    <item>
      <title>StepPO: Step-Aligned Policy Optimization for Agentic Reinforcement Learning</title>
      <link>https://ftxj.github.io/posts/2026-04-20/03-steppo-step-aligned-policy-optimization-for-agentic-reinforc/</link>
      <pubDate>Mon, 27 Apr 2026 05:24:17 +0000</pubDate>
      <guid>https://ftxj.github.io/posts/2026-04-20/03-steppo-step-aligned-policy-optimization-for-agentic-reinforc/</guid>
      <description>&lt;p&gt;&lt;strong&gt;arXiv:&lt;/strong&gt; &lt;a href=&#34;https://arxiv.org/abs/2604.18401v1&#34;&gt;2604.18401&lt;/a&gt; · &lt;a href=&#34;https://arxiv.org/pdf/2604.18401v1&#34;&gt;PDF&lt;/a&gt;&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Authors:&lt;/strong&gt; Daoyu Wang, Qingchuan Li, Mingyue Cheng, Jie Ouyang, Shuo Yu, Qi Liu, Enhong Chen&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Primary category:&lt;/strong&gt; &lt;code&gt;cs.CL&lt;/code&gt; · all: cs.CL&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Matched keywords:&lt;/strong&gt; large language model, llm, agent, agentic, tool use, reasoning, post-train, rlhf&lt;/p&gt;&#xA;&lt;hr&gt;&#xA;&lt;h2 id=&#34;tldr&#34;&gt;TL;DR&lt;/h2&gt;&#xA;&lt;p&gt;StepPO argues that Agentic RL for LLMs should move from token-level to step-level MDPs, treating each agent step (not token) as the action unit and doing credit assignment at that granularity. The paper is a position piece with preliminary experiments.&lt;/p&gt;</description>
    </item>
    <item>
      <title>MASS-RAG: Multi-Agent Synthesis Retrieval-Augmented Generation</title>
      <link>https://ftxj.github.io/posts/2026-04-20/02-mass-rag-multi-agent-synthesis-retrieval-augmented-generatio/</link>
      <pubDate>Mon, 27 Apr 2026 05:23:47 +0000</pubDate>
      <guid>https://ftxj.github.io/posts/2026-04-20/02-mass-rag-multi-agent-synthesis-retrieval-augmented-generatio/</guid>
      <description>&lt;p&gt;&lt;strong&gt;arXiv:&lt;/strong&gt; &lt;a href=&#34;https://arxiv.org/abs/2604.18509v2&#34;&gt;2604.18509&lt;/a&gt; · &lt;a href=&#34;https://arxiv.org/pdf/2604.18509v2&#34;&gt;PDF&lt;/a&gt;&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Authors:&lt;/strong&gt; Xingchen Xiao, Heyan Huang, Runheng Liu, Jincheng Xie&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Primary category:&lt;/strong&gt; &lt;code&gt;cs.CL&lt;/code&gt; · all: cs.CL&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Matched keywords:&lt;/strong&gt; large language model, llm, agent, multi-agent, retrieval, rag, reasoning, inference&lt;/p&gt;&#xA;&lt;hr&gt;&#xA;&lt;h2 id=&#34;tldr&#34;&gt;TL;DR&lt;/h2&gt;&#xA;&lt;p&gt;MASS-RAG 提出一种多智能体协作的检索增强生成框架，将证据处理拆分为摘要、抽取、推理三类角色化 agent，再由合成阶段整合输出，提升噪声/异构上下文下的回答质量。&lt;/p&gt;&#xA;&lt;h2 id=&#34;key-ideas&#34;&gt;Key Ideas&lt;/h2&gt;&#xA;&lt;ul&gt;&#xA;&lt;li&gt;单一生成过程难以调和噪声、残缺、异构的检索证据。&lt;/li&gt;&#xA;&lt;li&gt;将 RAG 解耦为角色化多 agent：summarization、extraction、reasoning。&lt;/li&gt;&#xA;&lt;li&gt;专设 synthesis 阶段融合多视角中间证据再生成最终答案。&lt;/li&gt;&#xA;&lt;li&gt;多中间证据视图利于互补信息对比与整合。&lt;/li&gt;&#xA;&lt;/ul&gt;&#xA;&lt;h2 id=&#34;approach&#34;&gt;Approach&lt;/h2&gt;&#xA;&lt;ul&gt;&#xA;&lt;li&gt;架构：检索 → 并行运行三类专职 agent（证据摘要 / 证据抽取 / 推理）→ 合成 agent 聚合中间输出 → 生成答案。&lt;/li&gt;&#xA;&lt;li&gt;每个 agent 针对同一批检索文档产出不同粒度的中间表示，暴露多条证据路径。&lt;/li&gt;&#xA;&lt;li&gt;合成阶段作为仲裁器对互补/冲突证据进行比较与整合。&lt;/li&gt;&#xA;&lt;li&gt;摘要未说明具体 prompt 模板、agent 间通信协议或后端模型。&lt;/li&gt;&#xA;&lt;/ul&gt;&#xA;&lt;h2 id=&#34;experiments&#34;&gt;Experiments&lt;/h2&gt;&#xA;&lt;ul&gt;&#xA;&lt;li&gt;四个 RAG benchmark（具体名未披露）。&lt;/li&gt;&#xA;&lt;li&gt;对比强 RAG baseline（未具名）。&lt;/li&gt;&#xA;&lt;li&gt;评估重点：证据分散在多段检索上下文时的表现。&lt;/li&gt;&#xA;&lt;li&gt;摘要未给出数据集规模、检索器设置、评测指标等细节。&lt;/li&gt;&#xA;&lt;/ul&gt;&#xA;&lt;h2 id=&#34;results&#34;&gt;Results&lt;/h2&gt;&#xA;&lt;ul&gt;&#xA;&lt;li&gt;声称在四个 benchmark 上&amp;quot;consistently&amp;quot;优于强 baseline。&lt;/li&gt;&#xA;&lt;li&gt;在证据跨上下文分散的场景优势更明显。&lt;/li&gt;&#xA;&lt;li&gt;摘要未提供具体数值增益，无法独立核实提升幅度。&lt;/li&gt;&#xA;&lt;/ul&gt;&#xA;&lt;h2 id=&#34;why-it-matters&#34;&gt;Why It Matters&lt;/h2&gt;&#xA;&lt;ul&gt;&#xA;&lt;li&gt;为嘈杂或长尾检索结果提供可组合的 agentic RAG 模式。&lt;/li&gt;&#xA;&lt;li&gt;为实务派在 RAG pipeline 里显式引入角色分工、证据融合层提供模板。&lt;/li&gt;&#xA;&lt;li&gt;对构建高可靠知识问答、企业 RAG 系统的工程师有借鉴价值。&lt;/li&gt;&#xA;&lt;/ul&gt;&#xA;&lt;h2 id=&#34;connections-to-prior-work&#34;&gt;Connections to Prior Work&lt;/h2&gt;&#xA;&lt;ul&gt;&#xA;&lt;li&gt;Self-RAG、Chain-of-Note：显式证据处理/批注思路。&lt;/li&gt;&#xA;&lt;li&gt;Multi-agent LLM 协作（AutoGen、MetaGPT、Debate）：角色化 agent 协同。&lt;/li&gt;&#xA;&lt;li&gt;CRAG、RA-DIT 等鲁棒 RAG 方法：处理噪声/低质量检索。&lt;/li&gt;&#xA;&lt;li&gt;Map-Reduce / hierarchical summarization for long context。&lt;/li&gt;&#xA;&lt;/ul&gt;&#xA;&lt;h2 id=&#34;open-questions&#34;&gt;Open Questions&lt;/h2&gt;&#xA;&lt;ul&gt;&#xA;&lt;li&gt;多 agent 带来的推理成本与延迟如何？是否值得单次调用的 N 倍 token？&lt;/li&gt;&#xA;&lt;li&gt;各 agent 是否共享同一底座 LLM，是否需专门微调？&lt;/li&gt;&#xA;&lt;li&gt;合成阶段如何处理 agent 间冲突证据？是否有显式投票或置信度？&lt;/li&gt;&#xA;&lt;li&gt;在对抗性或高度冗余检索下鲁棒性如何？&lt;/li&gt;&#xA;&lt;li&gt;与更强的单模型长上下文推理（如 Gemini / Claude 长窗）相比是否仍有优势？&lt;/li&gt;&#xA;&lt;/ul&gt;&#xA;&lt;hr&gt;&#xA;&lt;h2 id=&#34;figures&#34;&gt;Figures&lt;/h2&gt;&#xA;&lt;p&gt;&lt;strong&gt;Figure 1:&lt;/strong&gt; Figure 1 (extracted from PDF)&lt;/p&gt;</description>
    </item>
    <item>
      <title>First, Do No Harm (With LLMs): Mitigating Racial Bias via Agentic Workflows</title>
      <link>https://ftxj.github.io/posts/2026-04-20/01-first-do-no-harm-with-llms-mitigating-racial-bias-via-agenti/</link>
      <pubDate>Mon, 27 Apr 2026 05:23:13 +0000</pubDate>
      <guid>https://ftxj.github.io/posts/2026-04-20/01-first-do-no-harm-with-llms-mitigating-racial-bias-via-agenti/</guid>
      <description>&lt;p&gt;&lt;strong&gt;arXiv:&lt;/strong&gt; &lt;a href=&#34;https://arxiv.org/abs/2604.18038v1&#34;&gt;2604.18038&lt;/a&gt; · &lt;a href=&#34;https://arxiv.org/pdf/2604.18038v1&#34;&gt;PDF&lt;/a&gt;&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Authors:&lt;/strong&gt; Sihao Xing, Zaur Gouliev&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Primary category:&lt;/strong&gt; &lt;code&gt;cs.CY&lt;/code&gt; · all: cs.AI, cs.CY&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Matched keywords:&lt;/strong&gt; large language model, llm, agent, agentic, retrieval, reasoning, attention, ai system&lt;/p&gt;&#xA;&lt;hr&gt;&#xA;&lt;h2 id=&#34;tldr&#34;&gt;TL;DR&lt;/h2&gt;&#xA;&lt;p&gt;This study evaluates racial bias in five LLMs across synthetic patient-case generation and differential diagnosis tasks, finding all deviate from US epidemiological distributions. Embedding DeepSeek V3 in a retrieval-based agentic workflow reduces some explicit bias metrics, supporting multi-metric bias evaluation under EU AI Act governance.&lt;/p&gt;</description>
    </item>
    <item>
      <title>Cyber Defense Benchmark: Agentic Threat Hunting Evaluation for LLMs in SecOps</title>
      <link>https://ftxj.github.io/posts/2026-04-21/10-cyber-defense-benchmark-agentic-threat-hunting-evaluation-fo/</link>
      <pubDate>Mon, 27 Apr 2026 05:22:40 +0000</pubDate>
      <guid>https://ftxj.github.io/posts/2026-04-21/10-cyber-defense-benchmark-agentic-threat-hunting-evaluation-fo/</guid>
      <description>&lt;p&gt;&lt;strong&gt;arXiv:&lt;/strong&gt; &lt;a href=&#34;https://arxiv.org/abs/2604.19533v3&#34;&gt;2604.19533&lt;/a&gt; · &lt;a href=&#34;https://arxiv.org/pdf/2604.19533v3&#34;&gt;PDF&lt;/a&gt;&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Authors:&lt;/strong&gt; Alankrit Chona, Igor Kozlov, Ambuj Kumar&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Primary category:&lt;/strong&gt; &lt;code&gt;cs.CR&lt;/code&gt; · all: cs.AI, cs.CR&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Matched keywords:&lt;/strong&gt; large language model, llm, agent, agentic, rag&lt;/p&gt;&#xA;&lt;hr&gt;&#xA;&lt;h2 id=&#34;tldr&#34;&gt;TL;DR&lt;/h2&gt;&#xA;&lt;p&gt;Cyber Defense Benchmark evaluates LLM agents on open-ended threat hunting over raw Windows logs via iterative SQL queries. Across five frontier models, all fail dramatically — the best (Claude Opus 4.6) flags only 3.8% of malicious events, and none meet the &amp;gt;=50% per-tactic recall bar for unsupervised SOC deployment.&lt;/p&gt;</description>
    </item>
    <item>
      <title>TRN-R1-Zero: Text-rich Network Reasoning via LLMs with Reinforcement Learning Only</title>
      <link>https://ftxj.github.io/posts/2026-04-21/09-trn-r1-zero-text-rich-network-reasoning-via-llms-with-reinfo/</link>
      <pubDate>Mon, 27 Apr 2026 05:21:53 +0000</pubDate>
      <guid>https://ftxj.github.io/posts/2026-04-21/09-trn-r1-zero-text-rich-network-reasoning-via-llms-with-reinfo/</guid>
      <description>&lt;p&gt;&lt;strong&gt;arXiv:&lt;/strong&gt; &lt;a href=&#34;https://arxiv.org/abs/2604.19070v1&#34;&gt;2604.19070&lt;/a&gt; · &lt;a href=&#34;https://arxiv.org/pdf/2604.19070v1&#34;&gt;PDF&lt;/a&gt;&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Authors:&lt;/strong&gt; Yilun Liu, Ruihong Qiu, Zi Huang&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Primary category:&lt;/strong&gt; &lt;code&gt;cs.CL&lt;/code&gt; · all: cs.CL, cs.LG&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Matched keywords:&lt;/strong&gt; large language model, llm, reasoning, chain-of-thought, inference, fine-tun, post-train&lt;/p&gt;&#xA;&lt;hr&gt;&#xA;&lt;h2 id=&#34;tldr&#34;&gt;TL;DR&lt;/h2&gt;&#xA;&lt;p&gt;TRN-R1-Zero is a post-training framework that uses reinforcement learning alone to teach base LLMs to reason over text-rich networks, avoiding supervised fine-tuning or distillation while generalising across node, edge, and graph-level tasks.&lt;/p&gt;&#xA;&lt;h2 id=&#34;key-ideas&#34;&gt;Key Ideas&lt;/h2&gt;&#xA;&lt;ul&gt;&#xA;&lt;li&gt;RL-only post-training for text-rich network (TRN) reasoning — no SFT, no CoT distillation from larger teachers.&lt;/li&gt;&#xA;&lt;li&gt;Neighbour-aware Group Relative Policy Optimisation (N-GRPO) that shapes rewards via a novel &amp;ldquo;margin gain&amp;rdquo; metric measuring neighbour informativeness.&lt;/li&gt;&#xA;&lt;li&gt;Node-level training transfers zero-shot to edge- and graph-level tasks, beyond typical cross-domain transfer.&lt;/li&gt;&#xA;&lt;/ul&gt;&#xA;&lt;h2 id=&#34;approach&#34;&gt;Approach&lt;/h2&gt;&#xA;&lt;p&gt;The authors extend GRPO with neighbourhood awareness: for each candidate response, rewards are dynamically adjusted by a margin gain metric capturing how much neighbouring node signals contribute to the correct answer, pushing the LLM to actually use relational context rather than text alone. Training runs only on node-level supervision signals via RL on base LLMs.&lt;/p&gt;</description>
    </item>
    <item>
      <title>Detoxification for LLM: From Dataset Itself</title>
      <link>https://ftxj.github.io/posts/2026-04-21/08-detoxification-for-llm-from-dataset-itself/</link>
      <pubDate>Mon, 27 Apr 2026 05:21:19 +0000</pubDate>
      <guid>https://ftxj.github.io/posts/2026-04-21/08-detoxification-for-llm-from-dataset-itself/</guid>
      <description>&lt;p&gt;&lt;strong&gt;arXiv:&lt;/strong&gt; &lt;a href=&#34;https://arxiv.org/abs/2604.19124v1&#34;&gt;2604.19124&lt;/a&gt; · &lt;a href=&#34;https://arxiv.org/pdf/2604.19124v1&#34;&gt;PDF&lt;/a&gt;&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Authors:&lt;/strong&gt; Wei Shao, Yihang Wang, Gaoyu Zhu, Ziqiang Cheng, Lei Yu, Jiafeng Guo, Xueqi Cheng&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Primary category:&lt;/strong&gt; &lt;code&gt;cs.CL&lt;/code&gt; · all: cs.CL&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Matched keywords:&lt;/strong&gt; large language model, llm, inference, serving, fine-tun, post-train&lt;/p&gt;&#xA;&lt;hr&gt;&#xA;&lt;h2 id=&#34;tldr&#34;&gt;TL;DR&lt;/h2&gt;&#xA;&lt;p&gt;The paper proposes HSPD, a pipeline that detoxifies LLM pretraining corpora at the source by rewriting toxic spans with a Soft Contrastive Decoding (SoCD) method, yielding a drop-in replacement dataset that cuts downstream model toxicity while preserving semantics.&lt;/p&gt;</description>
    </item>
    <item>
      <title>SAW-INT4: System-Aware 4-Bit KV-Cache Quantization for Real-World LLM Serving</title>
      <link>https://ftxj.github.io/posts/2026-04-21/07-saw-int4-system-aware-4-bit-kv-cache-quantization-for-real-w/</link>
      <pubDate>Mon, 27 Apr 2026 05:20:49 +0000</pubDate>
      <guid>https://ftxj.github.io/posts/2026-04-21/07-saw-int4-system-aware-4-bit-kv-cache-quantization-for-real-w/</guid>
      <description>&lt;p&gt;&lt;strong&gt;arXiv:&lt;/strong&gt; &lt;a href=&#34;https://arxiv.org/abs/2604.19157v1&#34;&gt;2604.19157&lt;/a&gt; · &lt;a href=&#34;https://arxiv.org/pdf/2604.19157v1&#34;&gt;PDF&lt;/a&gt;&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Authors:&lt;/strong&gt; Jinda Jia, Jisen Li, Zhongzhu Zhou, Jung Hwan Heo, Jue Wang, Tri Dao, Shuaiwen Leon Song, Ben Athiwaratkun, Chenfeng Xu, Tianyi Zhang, Xiaoxia Wu&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Primary category:&lt;/strong&gt; &lt;code&gt;cs.LG&lt;/code&gt; · all: cs.LG&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Matched keywords:&lt;/strong&gt; llm, serving, kv-cache, quantization, attention, throughput, latency&lt;/p&gt;&#xA;&lt;hr&gt;&#xA;&lt;h2 id=&#34;tldr&#34;&gt;TL;DR&lt;/h2&gt;&#xA;&lt;p&gt;SAW-INT4 proposes token-wise INT4 KV-cache quantization with block-diagonal Hadamard rotation, the simplest scheme compatible with paged memory and fused attention in real LLM serving. A fused rotation-quantization kernel matches plain INT4 throughput while recovering nearly all accuracy lost to naive INT4.&lt;/p&gt;</description>
    </item>
    <item>
      <title>If you&#39;re waiting for a sign... that might not be it! Mitigating Trust Boundary Confusion from Visual Injections on Vision-Language Agentic Systems</title>
      <link>https://ftxj.github.io/posts/2026-04-21/06-if-you-re-waiting-for-a-sign-that-might-not-be-it-mitigating/</link>
      <pubDate>Mon, 27 Apr 2026 05:20:15 +0000</pubDate>
      <guid>https://ftxj.github.io/posts/2026-04-21/06-if-you-re-waiting-for-a-sign-that-might-not-be-it-mitigating/</guid>
      <description>&lt;p&gt;&lt;strong&gt;arXiv:&lt;/strong&gt; &lt;a href=&#34;https://arxiv.org/abs/2604.19844v1&#34;&gt;2604.19844&lt;/a&gt; · &lt;a href=&#34;https://arxiv.org/pdf/2604.19844v1&#34;&gt;PDF&lt;/a&gt;&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Authors:&lt;/strong&gt; Jiamin Chang, Minhui Xue, Ruoxi Sun, Shuchao Pang, Salil S. Kanhere, Hammond Pearce&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Primary category:&lt;/strong&gt; &lt;code&gt;cs.CV&lt;/code&gt; · all: cs.AI, cs.CV&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Matched keywords:&lt;/strong&gt; agent, agentic, multi-agent, serving, ai system&lt;/p&gt;&#xA;&lt;hr&gt;&#xA;&lt;h2 id=&#34;tldr&#34;&gt;TL;DR&lt;/h2&gt;&#xA;&lt;p&gt;This paper identifies &amp;ldquo;trust boundary confusion&amp;rdquo; in Vision-Language Agentic Systems (VLAS), where agents fail to distinguish legitimate environmental signals (e.g., traffic lights) from adversarial visual injections. The authors propose a multi-agent defense that separates perception from decision-making, improving robustness while preserving responsiveness to genuine cues.&lt;/p&gt;</description>
    </item>
    <item>
      <title>Statistics, Not Scale: Modular Medical Dialogue with Bayesian Belief Engine</title>
      <link>https://ftxj.github.io/posts/2026-04-21/05-statistics-not-scale-modular-medical-dialogue-with-bayesian/</link>
      <pubDate>Mon, 27 Apr 2026 05:19:41 +0000</pubDate>
      <guid>https://ftxj.github.io/posts/2026-04-21/05-statistics-not-scale-modular-medical-dialogue-with-bayesian/</guid>
      <description>&lt;p&gt;&lt;strong&gt;arXiv:&lt;/strong&gt; &lt;a href=&#34;https://arxiv.org/abs/2604.20022v1&#34;&gt;2604.20022&lt;/a&gt; · &lt;a href=&#34;https://arxiv.org/pdf/2604.20022v1&#34;&gt;PDF&lt;/a&gt;&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Authors:&lt;/strong&gt; Yusuf Kesmen, Fay Elhassan, Jiayi Ma, Julien Stalhandske, David Sasu, Alexandra Kulinkina, Akhil Arora, Lars Klein, Mary-Anne Hartley&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Primary category:&lt;/strong&gt; &lt;code&gt;cs.LG&lt;/code&gt; · all: cs.AI, cs.CL, cs.LG&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Matched keywords:&lt;/strong&gt; large language model, llm, agent, rag, reasoning, inference&lt;/p&gt;&#xA;&lt;hr&gt;&#xA;&lt;h2 id=&#34;tldr&#34;&gt;TL;DR&lt;/h2&gt;&#xA;&lt;p&gt;BMBE splits medical dialogue into an LLM &amp;ldquo;sensor&amp;rdquo; that parses utterances and a deterministic Bayesian engine that handles all diagnostic inference, yielding calibrated, private, and robust diagnosis that beats frontier standalone LLMs at a fraction of the cost.&lt;/p&gt;</description>
    </item>
    <item>
      <title>A-MAR: Agent-based Multimodal Art Retrieval for Fine-Grained Artwork Understanding</title>
      <link>https://ftxj.github.io/posts/2026-04-21/04-a-mar-agent-based-multimodal-art-retrieval-for-fine-grained/</link>
      <pubDate>Mon, 27 Apr 2026 05:19:13 +0000</pubDate>
      <guid>https://ftxj.github.io/posts/2026-04-21/04-a-mar-agent-based-multimodal-art-retrieval-for-fine-grained/</guid>
      <description>&lt;p&gt;&lt;strong&gt;arXiv:&lt;/strong&gt; &lt;a href=&#34;https://arxiv.org/abs/2604.19689v1&#34;&gt;2604.19689&lt;/a&gt; · &lt;a href=&#34;https://arxiv.org/pdf/2604.19689v1&#34;&gt;PDF&lt;/a&gt;&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Authors:&lt;/strong&gt; Shuai Wang, Hongyi Zhu, Jia-Hong Huang, Yixian Shen, Chengxi Zeng, Stevan Rudinac, Monika Kackovic, Nachoem Wijnberg, Marcel Worring&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Primary category:&lt;/strong&gt; &lt;code&gt;cs.AI&lt;/code&gt; · all: cs.AI&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Matched keywords:&lt;/strong&gt; large language model, llm, agent, retrieval, reasoning, ai system&lt;/p&gt;&#xA;&lt;hr&gt;&#xA;&lt;h2 id=&#34;tldr&#34;&gt;TL;DR&lt;/h2&gt;&#xA;&lt;p&gt;A-MAR is an agent-based multimodal retrieval framework that decomposes artwork queries into structured reasoning plans, then conditions retrieval on each step to produce grounded, interpretable explanations. It outperforms static retrieval and MLLM baselines on SemArt, Artpedia, and a new ArtCoT-QA benchmark.&lt;/p&gt;</description>
    </item>
    <item>
      <title>Rethinking Scale: Deployment Trade-offs of Small Language Models under Agent Paradigms</title>
      <link>https://ftxj.github.io/posts/2026-04-21/03-rethinking-scale-deployment-trade-offs-of-small-language-mod/</link>
      <pubDate>Mon, 27 Apr 2026 05:18:43 +0000</pubDate>
      <guid>https://ftxj.github.io/posts/2026-04-21/03-rethinking-scale-deployment-trade-offs-of-small-language-mod/</guid>
      <description>&lt;p&gt;&lt;strong&gt;arXiv:&lt;/strong&gt; &lt;a href=&#34;https://arxiv.org/abs/2604.19299v1&#34;&gt;2604.19299&lt;/a&gt; · &lt;a href=&#34;https://arxiv.org/pdf/2604.19299v1&#34;&gt;PDF&lt;/a&gt;&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Authors:&lt;/strong&gt; Xinlin Wang, Mats Brorsson&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Primary category:&lt;/strong&gt; &lt;code&gt;cs.CL&lt;/code&gt; · all: cs.AI, cs.CL&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Matched keywords:&lt;/strong&gt; large language model, agent, multi-agent, tool use, reasoning, latency, fine-tun&lt;/p&gt;&#xA;&lt;hr&gt;&#xA;&lt;h2 id=&#34;tldr&#34;&gt;TL;DR&lt;/h2&gt;&#xA;&lt;p&gt;This paper presents the first large-scale empirical study of sub-10B open-source SLMs across three deployment paradigms—base, single-agent with tools, and multi-agent collaboration—finding that single-agent systems offer the best cost/performance balance while multi-agent setups add overhead with limited gains.&lt;/p&gt;&#xA;&lt;h2 id=&#34;key-ideas&#34;&gt;Key Ideas&lt;/h2&gt;&#xA;&lt;ul&gt;&#xA;&lt;li&gt;SLMs (&amp;lt;10B params) are viable LLM alternatives if their weaknesses are compensated by agent paradigms rather than pure scaling or fine-tuning.&lt;/li&gt;&#xA;&lt;li&gt;Tool-augmented single agents systematically outperform base SLMs at modest extra cost.&lt;/li&gt;&#xA;&lt;li&gt;Multi-agent collaboration yields diminishing returns relative to its computational overhead.&lt;/li&gt;&#xA;&lt;li&gt;Deployment efficiency is a first-class design criterion for trustworthy SLM systems.&lt;/li&gt;&#xA;&lt;/ul&gt;&#xA;&lt;h2 id=&#34;approach&#34;&gt;Approach&lt;/h2&gt;&#xA;&lt;p&gt;The authors benchmark open-source SLMs under three paradigms: (1) bare base model, (2) a single agent equipped with external tools, and (3) a multi-agent collaborative system. They compare performance and cost across these configurations, though the abstract does not specify which tools, orchestration framework, or agent protocols are used.&lt;/p&gt;</description>
    </item>
    <item>
      <title>GRASPrune: Global Gating for Budgeted Structured Pruning of Large Language Models</title>
      <link>https://ftxj.github.io/posts/2026-04-21/02-grasprune-global-gating-for-budgeted-structured-pruning-of-l/</link>
      <pubDate>Mon, 27 Apr 2026 05:18:10 +0000</pubDate>
      <guid>https://ftxj.github.io/posts/2026-04-21/02-grasprune-global-gating-for-budgeted-structured-pruning-of-l/</guid>
      <description>&lt;p&gt;&lt;strong&gt;arXiv:&lt;/strong&gt; &lt;a href=&#34;https://arxiv.org/abs/2604.19398v1&#34;&gt;2604.19398&lt;/a&gt; · &lt;a href=&#34;https://arxiv.org/pdf/2604.19398v1&#34;&gt;PDF&lt;/a&gt;&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Authors:&lt;/strong&gt; Ziyang Wang, Jiangfeng Xiao, Chuan Xiao, Ruoxiang Li, Rui Mao, Jianbin Qin&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Primary category:&lt;/strong&gt; &lt;code&gt;cs.AI&lt;/code&gt; · all: cs.AI&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Matched keywords:&lt;/strong&gt; large language model, llm, rag, inference, kv cache, attention, gpu, latency, fine-tun&lt;/p&gt;&#xA;&lt;hr&gt;&#xA;&lt;h2 id=&#34;tldr&#34;&gt;TL;DR&lt;/h2&gt;&#xA;&lt;p&gt;GRASPrune is a post-pretraining structured pruning framework that jointly prunes FFN channels and KV head groups under a single global budget using projected straight-through gate learning, producing a smaller dense checkpoint without fine-tuning the backbone.&lt;/p&gt;</description>
    </item>
    <item>
      <title>ChipCraftBrain: Validation-First RTL Generation via Multi-Agent Orchestration</title>
      <link>https://ftxj.github.io/posts/2026-04-21/01-chipcraftbrain-validation-first-rtl-generation-via-multi-age/</link>
      <pubDate>Mon, 27 Apr 2026 05:17:33 +0000</pubDate>
      <guid>https://ftxj.github.io/posts/2026-04-21/01-chipcraftbrain-validation-first-rtl-generation-via-multi-age/</guid>
      <description>&lt;p&gt;&lt;strong&gt;arXiv:&lt;/strong&gt; &lt;a href=&#34;https://arxiv.org/abs/2604.19856v1&#34;&gt;2604.19856&lt;/a&gt; · &lt;a href=&#34;https://arxiv.org/pdf/2604.19856v1&#34;&gt;PDF&lt;/a&gt;&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Authors:&lt;/strong&gt; Cagri Eryilmaz&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Primary category:&lt;/strong&gt; &lt;code&gt;cs.AR&lt;/code&gt; · all: cs.AI, cs.AR, cs.LG&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Matched keywords:&lt;/strong&gt; large language model, llm, agent, agentic, multi-agent, retrieval, rag, reasoning&lt;/p&gt;&#xA;&lt;hr&gt;&#xA;&lt;h2 id=&#34;tldr&#34;&gt;TL;DR&lt;/h2&gt;&#xA;&lt;p&gt;ChipCraftBrain is a multi-agent RTL generation framework combining PPO-driven orchestration, symbolic-neural reasoning, and knowledge retrieval. It hits 97.2% pass@1 on VerilogEval-Human and 94.7% on a 302-problem CVDP subset, outperforming MAGE and matching ChipAgents while using far fewer attempts than NVIDIA&amp;rsquo;s ACE-RTL.&lt;/p&gt;&#xA;&lt;h2 id=&#34;key-ideas&#34;&gt;Key Ideas&lt;/h2&gt;&#xA;&lt;ul&gt;&#xA;&lt;li&gt;Adaptive orchestration of six specialized agents via a PPO policy over a 168-dim state (with an MPC world-model alternative).&lt;/li&gt;&#xA;&lt;li&gt;Hybrid symbolic-neural architecture: algorithmic solvers for K-maps/truth tables, neural agents for waveforms and general RTL.&lt;/li&gt;&#xA;&lt;li&gt;Knowledge-augmented retrieval from 321 patterns + 971 open-source reference implementations with focus-aware lookup.&lt;/li&gt;&#xA;&lt;li&gt;Hierarchical spec decomposition into dependency-ordered sub-modules with interface synchronization.&lt;/li&gt;&#xA;&lt;/ul&gt;&#xA;&lt;h2 id=&#34;approach&#34;&gt;Approach&lt;/h2&gt;&#xA;&lt;p&gt;A controller learns (PPO) to route tasks among six agents depending on problem state. Symbolic solvers handle combinational logic exactly; neural agents handle timing/waveforms. A retrieval module injects reference patterns. Complex specs are decomposed hierarchically with cross-module interface synchronization before code generation and validation.&lt;/p&gt;</description>
    </item>
    <item>
      <title>Co-Evolving LLM Decision and Skill Bank Agents for Long-Horizon Tasks</title>
      <link>https://ftxj.github.io/posts/2026-04-22/10-co-evolving-llm-decision-and-skill-bank-agents-for-long-hori/</link>
      <pubDate>Mon, 27 Apr 2026 05:17:00 +0000</pubDate>
      <guid>https://ftxj.github.io/posts/2026-04-22/10-co-evolving-llm-decision-and-skill-bank-agents-for-long-hori/</guid>
      <description>&lt;p&gt;&lt;strong&gt;arXiv:&lt;/strong&gt; &lt;a href=&#34;https://arxiv.org/abs/2604.20987v1&#34;&gt;2604.20987&lt;/a&gt; · &lt;a href=&#34;https://arxiv.org/pdf/2604.20987v1&#34;&gt;PDF&lt;/a&gt;&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Authors:&lt;/strong&gt; Xiyang Wu, Zongxia Li, Guangyao Shi, Alexander Duffy, Tyler Marques, Matthew Lyle Olson, Tianyi Zhou, Dinesh Manocha&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Primary category:&lt;/strong&gt; &lt;code&gt;cs.AI&lt;/code&gt; · all: cs.AI&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Matched keywords:&lt;/strong&gt; large language model, llm, agent, retrieval, rag, reasoning&lt;/p&gt;&#xA;&lt;hr&gt;&#xA;&lt;h2 id=&#34;tldr&#34;&gt;TL;DR&lt;/h2&gt;&#xA;&lt;p&gt;COSPLAY is a co-evolution framework pairing an LLM decision agent with a learnable skill bank: the decision agent retrieves skills to act, while a skill-pipeline agent mines reusable skills from unlabeled rollouts. An 8B model beats four frontier LLM baselines by &amp;gt;25% average reward on six game environments.&lt;/p&gt;</description>
    </item>
    <item>
      <title>Agentic AI for Personalized Physiotherapy: A Multi-Agent Framework for Generative Video Training and Real-Time Pose Correction</title>
      <link>https://ftxj.github.io/posts/2026-04-22/09-agentic-ai-for-personalized-physiotherapy-a-multi-agent-fram/</link>
      <pubDate>Mon, 27 Apr 2026 05:16:25 +0000</pubDate>
      <guid>https://ftxj.github.io/posts/2026-04-22/09-agentic-ai-for-personalized-physiotherapy-a-multi-agent-fram/</guid>
      <description>&lt;p&gt;&lt;strong&gt;arXiv:&lt;/strong&gt; &lt;a href=&#34;https://arxiv.org/abs/2604.21154v1&#34;&gt;2604.21154&lt;/a&gt; · &lt;a href=&#34;https://arxiv.org/pdf/2604.21154v1&#34;&gt;PDF&lt;/a&gt;&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Authors:&lt;/strong&gt; Abhishek Dharmaratnakar, Srivaths Ranganathan, Anushree Sinha, Debanshu Das&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Primary category:&lt;/strong&gt; &lt;code&gt;cs.AI&lt;/code&gt; · all: cs.AI&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Matched keywords:&lt;/strong&gt; large language model, agent, agentic, multi-agent, rag&lt;/p&gt;&#xA;&lt;hr&gt;&#xA;&lt;h2 id=&#34;tldr&#34;&gt;TL;DR&lt;/h2&gt;&#xA;&lt;p&gt;Proposes a four-agent system that parses clinical notes, generates patient-specific exercise videos, tracks poses in real time, and delivers corrective feedback for at-home physiotherapy. The paper is largely architectural, presenting a prototype and evaluation plan rather than clinical results.&lt;/p&gt;&#xA;&lt;h2 id=&#34;key-ideas&#34;&gt;Key Ideas&lt;/h2&gt;&#xA;&lt;ul&gt;&#xA;&lt;li&gt;Tele-rehabilitation gap stems from static video libraries and generic avatars ignoring patient-specific constraints.&lt;/li&gt;&#xA;&lt;li&gt;A Multi-Agent System (MAS) can close the loop by combining generative video, pose estimation, and autonomous feedback.&lt;/li&gt;&#xA;&lt;li&gt;Four specialized micro-agents cover extraction, synthesis, vision, and diagnostics.&lt;/li&gt;&#xA;&lt;li&gt;Unstructured clinical notes can be turned into kinematic constraints that condition downstream generation.&lt;/li&gt;&#xA;&lt;/ul&gt;&#xA;&lt;h2 id=&#34;approach&#34;&gt;Approach&lt;/h2&gt;&#xA;&lt;p&gt;Four micro-agents pipeline:&lt;/p&gt;</description>
    </item>
    <item>
      <title>EvoAgent: An Evolvable Agent Framework with Skill Learning and Multi-Agent Delegation</title>
      <link>https://ftxj.github.io/posts/2026-04-22/08-evoagent-an-evolvable-agent-framework-with-skill-learning-an/</link>
      <pubDate>Mon, 27 Apr 2026 05:15:48 +0000</pubDate>
      <guid>https://ftxj.github.io/posts/2026-04-22/08-evoagent-an-evolvable-agent-framework-with-skill-learning-an/</guid>
      <description>&lt;p&gt;&lt;strong&gt;arXiv:&lt;/strong&gt; &lt;a href=&#34;https://arxiv.org/abs/2604.20133v2&#34;&gt;2604.20133&lt;/a&gt; · &lt;a href=&#34;https://arxiv.org/pdf/2604.20133v2&#34;&gt;PDF&lt;/a&gt;&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Authors:&lt;/strong&gt; Aimin Zhang, Jiajing Guo, Fuwei Jia, Chen Lv, Boyu Wang, Fangzheng Li&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Primary category:&lt;/strong&gt; &lt;code&gt;cs.AI&lt;/code&gt; · all: cs.AI&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Matched keywords:&lt;/strong&gt; large language model, llm, agent, multi-agent, rag&lt;/p&gt;&#xA;&lt;hr&gt;&#xA;&lt;h2 id=&#34;tldr&#34;&gt;TL;DR&lt;/h2&gt;&#xA;&lt;p&gt;EvoAgent is an evolvable LLM agent framework combining structured skill learning, hierarchical sub-agent delegation, and a three-layer memory. On real-world foreign-trade tasks with GPT5.2, it lifts a five-dimensional LLM-as-Judge score by ~28%.&lt;/p&gt;&#xA;&lt;h2 id=&#34;key-ideas&#34;&gt;Key Ideas&lt;/h2&gt;&#xA;&lt;ul&gt;&#xA;&lt;li&gt;Skills modeled as multi-file structured capability units with triggers and evolutionary metadata.&lt;/li&gt;&#xA;&lt;li&gt;User-feedback-driven closed loop for continuous skill generation and optimization.&lt;/li&gt;&#xA;&lt;li&gt;Three-stage skill matching plus three-layer memory architecture for long-term accumulation.&lt;/li&gt;&#xA;&lt;li&gt;Hierarchical sub-agent delegation enabling dynamic task decomposition.&lt;/li&gt;&#xA;&lt;li&gt;Agent performance depends on model–architecture synergy, not just base model strength.&lt;/li&gt;&#xA;&lt;/ul&gt;&#xA;&lt;h2 id=&#34;approach&#34;&gt;Approach&lt;/h2&gt;&#xA;&lt;p&gt;Each skill is a structured artifact (multiple files) carrying triggering logic and evolutionary metadata, so the system can decide when to invoke it and how to mutate it over time. A three-stage matcher selects skills for an incoming task; a three-layer memory separates short-term, working, and long-term context. A hierarchical delegation mechanism spawns sub-agents for decomposed subtasks, and a user-feedback closed loop drives skill creation and refinement.&lt;/p&gt;</description>
    </item>
    <item>
      <title>Dual-Cluster Memory Agent: Resolving Multi-Paradigm Ambiguity in Optimization Problem Solving</title>
      <link>https://ftxj.github.io/posts/2026-04-22/07-dual-cluster-memory-agent-resolving-multi-paradigm-ambiguity/</link>
      <pubDate>Mon, 27 Apr 2026 05:14:56 +0000</pubDate>
      <guid>https://ftxj.github.io/posts/2026-04-22/07-dual-cluster-memory-agent-resolving-multi-paradigm-ambiguity/</guid>
      <description>&lt;p&gt;&lt;strong&gt;arXiv:&lt;/strong&gt; &lt;a href=&#34;https://arxiv.org/abs/2604.20183v1&#34;&gt;2604.20183&lt;/a&gt; · &lt;a href=&#34;https://arxiv.org/pdf/2604.20183v1&#34;&gt;PDF&lt;/a&gt;&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Authors:&lt;/strong&gt; Xinyu Zhang, Yuchen Wan, Boxuan Zhang, Zesheng Yang, Lingling Zhang, Bifan Wei, Jun Liu&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Primary category:&lt;/strong&gt; &lt;code&gt;cs.CL&lt;/code&gt; · all: cs.CL&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Matched keywords:&lt;/strong&gt; large language model, llm, agent, rag, reasoning, inference&lt;/p&gt;&#xA;&lt;hr&gt;&#xA;&lt;h2 id=&#34;tldr&#34;&gt;TL;DR&lt;/h2&gt;&#xA;&lt;p&gt;DCM-Agent is a training-free framework that resolves structural ambiguity in LLM-based optimization problem solving by maintaining dual clusters of historical solutions (modeling + coding), distilled into Approach/Checklist/Pitfall knowledge, and using them for memory-augmented inference.&lt;/p&gt;&#xA;&lt;h2 id=&#34;key-ideas&#34;&gt;Key Ideas&lt;/h2&gt;&#xA;&lt;ul&gt;&#xA;&lt;li&gt;Optimization problems suffer from multi-paradigm ambiguity that confuses LLMs.&lt;/li&gt;&#xA;&lt;li&gt;Split memory into two clusters: modeling and coding.&lt;/li&gt;&#xA;&lt;li&gt;Distill each cluster into three structured knowledge types: Approach, Checklist, Pitfall.&lt;/li&gt;&#xA;&lt;li&gt;Use memory at inference for path navigation, error repair, and adaptive switching.&lt;/li&gt;&#xA;&lt;li&gt;Observed &amp;ldquo;knowledge inheritance&amp;rdquo;: memory from larger models lifts smaller models.&lt;/li&gt;&#xA;&lt;/ul&gt;&#xA;&lt;h2 id=&#34;approach&#34;&gt;Approach&lt;/h2&gt;&#xA;&lt;p&gt;The Dual-Cluster Memory Construction step routes prior solutions into modeling vs. coding clusters, then distills generalizable guidance into structured Approach / Checklist / Pitfall entries. At inference, the agent retrieves relevant memory to pick a reasoning path, detects and repairs errors, and adaptively switches paradigms. The entire pipeline is training-free, relying on prompting plus a structured memory bank.&lt;/p&gt;</description>
    </item>
    <item>
      <title>FASER: Fine-Grained Phase Management for Speculative Decoding in Dynamic LLM Serving</title>
      <link>https://ftxj.github.io/posts/2026-04-22/06-faser-fine-grained-phase-management-for-speculative-decoding/</link>
      <pubDate>Mon, 27 Apr 2026 05:14:26 +0000</pubDate>
      <guid>https://ftxj.github.io/posts/2026-04-22/06-faser-fine-grained-phase-management-for-speculative-decoding/</guid>
      <description>&lt;p&gt;&lt;strong&gt;arXiv:&lt;/strong&gt; &lt;a href=&#34;https://arxiv.org/abs/2604.20503v1&#34;&gt;2604.20503&lt;/a&gt; · &lt;a href=&#34;https://arxiv.org/pdf/2604.20503v1&#34;&gt;PDF&lt;/a&gt;&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Authors:&lt;/strong&gt; Wenyan Chen, Chengzhi Lu, Yanying Lin, Dmitrii Ustiugov&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Primary category:&lt;/strong&gt; &lt;code&gt;cs.DC&lt;/code&gt; · all: cs.DC&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Matched keywords:&lt;/strong&gt; llm, inference, serving, speculative decoding, gpu, throughput, latency&lt;/p&gt;&#xA;&lt;hr&gt;&#xA;&lt;h2 id=&#34;tldr&#34;&gt;TL;DR&lt;/h2&gt;&#xA;&lt;p&gt;FASER is a fine-grained speculative-decoding scheduler for dynamic LLM serving that tunes speculative length per request, prunes rejected tokens early, and spatially overlaps draft and verification phases, yielding up to 53% higher throughput and 1.92× lower latency over SOTA in vLLM.&lt;/p&gt;&#xA;&lt;h2 id=&#34;key-ideas&#34;&gt;Key Ideas&lt;/h2&gt;&#xA;&lt;ul&gt;&#xA;&lt;li&gt;Coarse-grained, batch-level speculative decoding (SD) wastes GPU cycles under both low and high load.&lt;/li&gt;&#xA;&lt;li&gt;Speculative length should be a per-request knob inside a continuous batch, not a global constant.&lt;/li&gt;&#xA;&lt;li&gt;Verification can be chunked into &amp;ldquo;frontiers&amp;rdquo; and overlapped with drafting via spatial multiplexing.&lt;/li&gt;&#xA;&lt;li&gt;Rejected tokens can be pruned mid-verification to avoid wasted compute.&lt;/li&gt;&#xA;&lt;/ul&gt;&#xA;&lt;h2 id=&#34;approach&#34;&gt;Approach&lt;/h2&gt;&#xA;&lt;p&gt;FASER extends vLLM with three mechanisms: (1) dynamic per-request speculative length based on acceptance behavior within a continuous batch; (2) early pruning that terminates verification for tokens already rejected, reclaiming GPU work; (3) frontier-based verification that splits the verify pass into chunks and co-executes them with draft kernels using fine-grained spatial multiplexing for low interference.&lt;/p&gt;</description>
    </item>
    <item>
      <title>Cooperative Profiles Predict Multi-Agent LLM Team Performance in AI for Science Workflows</title>
      <link>https://ftxj.github.io/posts/2026-04-22/05-cooperative-profiles-predict-multi-agent-llm-team-performanc/</link>
      <pubDate>Mon, 27 Apr 2026 05:13:53 +0000</pubDate>
      <guid>https://ftxj.github.io/posts/2026-04-22/05-cooperative-profiles-predict-multi-agent-llm-team-performanc/</guid>
      <description>&lt;p&gt;&lt;strong&gt;arXiv:&lt;/strong&gt; &lt;a href=&#34;https://arxiv.org/abs/2604.20658v1&#34;&gt;2604.20658&lt;/a&gt; · &lt;a href=&#34;https://arxiv.org/pdf/2604.20658v1&#34;&gt;PDF&lt;/a&gt;&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Authors:&lt;/strong&gt; Shivani Kumar, Adarsh Bharathwaj, David Jurgens&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Primary category:&lt;/strong&gt; &lt;code&gt;cs.CL&lt;/code&gt; · all: cs.CL&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Matched keywords:&lt;/strong&gt; large language model, llm, agent, multi-agent, reasoning, gpu&lt;/p&gt;&#xA;&lt;hr&gt;&#xA;&lt;h2 id=&#34;tldr&#34;&gt;TL;DR&lt;/h2&gt;&#xA;&lt;p&gt;Authors benchmark 35 open-weight LLMs on six behavioral-economics games and show that the resulting &amp;ldquo;cooperative profiles&amp;rdquo; predict downstream team performance in AI-for-Science workflows under shared budget constraints, offering a cheap diagnostic for multi-agent deployment.&lt;/p&gt;&#xA;&lt;h2 id=&#34;key-ideas&#34;&gt;Key Ideas&lt;/h2&gt;&#xA;&lt;ul&gt;&#xA;&lt;li&gt;Cooperative disposition is a distinct, measurable LLM property, not reducible to general capability.&lt;/li&gt;&#xA;&lt;li&gt;Behavioral-economics games isolate cooperation mechanisms that transfer to realistic multi-agent science tasks.&lt;/li&gt;&#xA;&lt;li&gt;Models favoring multiplicative team production over greedy strategies yield better scientific reports.&lt;/li&gt;&#xA;&lt;li&gt;Game-based screening can precede expensive multi-agent rollouts.&lt;/li&gt;&#xA;&lt;/ul&gt;&#xA;&lt;h2 id=&#34;approach&#34;&gt;Approach&lt;/h2&gt;&#xA;&lt;ul&gt;&#xA;&lt;li&gt;Evaluate 35 open-weight LLMs across six behavioral-economics games targeting distinct cooperation mechanisms (coordination, investment, resource sharing).&lt;/li&gt;&#xA;&lt;li&gt;Derive per-model &amp;ldquo;cooperative profiles&amp;rdquo; from game behavior.&lt;/li&gt;&#xA;&lt;li&gt;Deploy LLM teams in an AI-for-Science pipeline: collaboratively analyze data, build models, and write scientific reports under shared budgets (e.g., GPU/credit caps).&lt;/li&gt;&#xA;&lt;li&gt;Regress downstream outcomes on cooperative profile features while controlling for confounds (likely model size, general ability benchmarks).&lt;/li&gt;&#xA;&lt;/ul&gt;&#xA;&lt;h2 id=&#34;experiments&#34;&gt;Experiments&lt;/h2&gt;&#xA;&lt;ul&gt;&#xA;&lt;li&gt;Models: 35 open-weight LLMs.&lt;/li&gt;&#xA;&lt;li&gt;Games: six behavioral-economics tasks (abstract not specific, but likely includes public-goods, trust, coordination variants).&lt;/li&gt;&#xA;&lt;li&gt;Downstream task: multi-agent AI-for-Science workflow with shared constraints.&lt;/li&gt;&#xA;&lt;li&gt;Metrics: report accuracy, quality, and completion.&lt;/li&gt;&#xA;&lt;li&gt;Baselines / controls: general-ability factors partialled out.&lt;/li&gt;&#xA;&lt;/ul&gt;&#xA;&lt;h2 id=&#34;results&#34;&gt;Results&lt;/h2&gt;&#xA;&lt;ul&gt;&#xA;&lt;li&gt;Cooperative profiles robustly predict downstream accuracy, quality, and completion.&lt;/li&gt;&#xA;&lt;li&gt;Effect persists after controlling for multiple confounding factors.&lt;/li&gt;&#xA;&lt;li&gt;Headline numerical effect sizes not given in the abstract.&lt;/li&gt;&#xA;&lt;/ul&gt;&#xA;&lt;h2 id=&#34;why-it-matters&#34;&gt;Why It Matters&lt;/h2&gt;&#xA;&lt;ul&gt;&#xA;&lt;li&gt;Provides a fast, inexpensive screening tool for multi-agent LLM deployments where coordination and budget-sharing matter.&lt;/li&gt;&#xA;&lt;li&gt;Reframes multi-agent selection beyond raw benchmark scores toward cooperative disposition.&lt;/li&gt;&#xA;&lt;li&gt;Useful for agent/infra teams building scientific, engineering, or tool-using LLM collectives.&lt;/li&gt;&#xA;&lt;/ul&gt;&#xA;&lt;h2 id=&#34;connections-to-prior-work&#34;&gt;Connections to Prior Work&lt;/h2&gt;&#xA;&lt;ul&gt;&#xA;&lt;li&gt;Behavioral-economics probes of LLMs (trust games, ultimatum, public-goods studies).&lt;/li&gt;&#xA;&lt;li&gt;Multi-agent LLM frameworks (AutoGen, MetaGPT, ChatDev, AI-Scientist).&lt;/li&gt;&#xA;&lt;li&gt;Work on LLM &amp;ldquo;personality&amp;rdquo; / social-preference elicitation.&lt;/li&gt;&#xA;&lt;li&gt;Emergent cooperation and game-theoretic evaluations in RL agents.&lt;/li&gt;&#xA;&lt;li&gt;Scientific-writing and data-analysis agent benchmarks.&lt;/li&gt;&#xA;&lt;/ul&gt;&#xA;&lt;h2 id=&#34;open-questions&#34;&gt;Open Questions&lt;/h2&gt;&#xA;&lt;ul&gt;&#xA;&lt;li&gt;Which specific games carry the most predictive signal, and do they generalize beyond AI-for-Science?&lt;/li&gt;&#xA;&lt;li&gt;Does cooperative profile stay stable under prompting, fine-tuning, or RLHF interventions?&lt;/li&gt;&#xA;&lt;li&gt;Are closed-weight frontier models (GPT-4.x, Claude, Gemini) consistent with the 35-model findings?&lt;/li&gt;&#xA;&lt;li&gt;Can cooperative disposition be deliberately trained or aligned, and at what cost to single-agent capability?&lt;/li&gt;&#xA;&lt;li&gt;How do heterogeneous teams (mixing cooperators and defectors) behave versus homogeneous ones?&lt;/li&gt;&#xA;&lt;/ul&gt;&#xA;&lt;hr&gt;&#xA;&lt;h2 id=&#34;figures&#34;&gt;Figures&lt;/h2&gt;&#xA;&lt;p&gt;&lt;strong&gt;Figure 1:&lt;/strong&gt; Page 2 (rendered)&lt;/p&gt;</description>
    </item>
    <item>
      <title>Breaking MCP with Function Hijacking Attacks: Novel Threats for Function Calling and Agentic Models</title>
      <link>https://ftxj.github.io/posts/2026-04-22/04-breaking-mcp-with-function-hijacking-attacks-novel-threats-f/</link>
      <pubDate>Mon, 27 Apr 2026 05:13:18 +0000</pubDate>
      <guid>https://ftxj.github.io/posts/2026-04-22/04-breaking-mcp-with-function-hijacking-attacks-novel-threats-f/</guid>
      <description>&lt;p&gt;&lt;strong&gt;arXiv:&lt;/strong&gt; &lt;a href=&#34;https://arxiv.org/abs/2604.20994v1&#34;&gt;2604.20994&lt;/a&gt; · &lt;a href=&#34;https://arxiv.org/pdf/2604.20994v1&#34;&gt;PDF&lt;/a&gt;&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Authors:&lt;/strong&gt; Yannis Belkhiter, Giulio Zizzo, Sergio Maffeis, Seshu Tirupathi, John D. Kelleher&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Primary category:&lt;/strong&gt; &lt;code&gt;cs.CR&lt;/code&gt; · all: cs.AI, cs.CL, cs.CR&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Matched keywords:&lt;/strong&gt; large language model, llm, agent, agentic, reasoning, attention&lt;/p&gt;&#xA;&lt;hr&gt;&#xA;&lt;h2 id=&#34;tldr&#34;&gt;TL;DR&lt;/h2&gt;&#xA;&lt;p&gt;This paper introduces Function Hijacking Attacks (FHA), a novel adversarial technique that manipulates agentic LLMs&amp;rsquo; tool selection to force invocation of attacker-chosen functions, achieving 70-100% attack success rates across five models on the BFCL benchmark, largely independent of query semantics.&lt;/p&gt;</description>
    </item>
    <item>
      <title>Automatic Ontology Construction Using LLMs as an External Layer of Memory, Verification, and Planning for Hybrid Intelligent Systems</title>
      <link>https://ftxj.github.io/posts/2026-04-22/03-automatic-ontology-construction-using-llms-as-an-external-la/</link>
      <pubDate>Mon, 27 Apr 2026 05:12:44 +0000</pubDate>
      <guid>https://ftxj.github.io/posts/2026-04-22/03-automatic-ontology-construction-using-llms-as-an-external-la/</guid>
      <description>&lt;p&gt;&lt;strong&gt;arXiv:&lt;/strong&gt; &lt;a href=&#34;https://arxiv.org/abs/2604.20795v1&#34;&gt;2604.20795&lt;/a&gt; · &lt;a href=&#34;https://arxiv.org/pdf/2604.20795v1&#34;&gt;PDF&lt;/a&gt;&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Authors:&lt;/strong&gt; Pavel Salovskii, Iuliia Gorshkova&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Primary category:&lt;/strong&gt; &lt;code&gt;cs.AI&lt;/code&gt; · all: cs.AI&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Matched keywords:&lt;/strong&gt; large language model, llm, agent, retrieval, rag, reasoning, inference&lt;/p&gt;&#xA;&lt;hr&gt;&#xA;&lt;h2 id=&#34;tldr&#34;&gt;TL;DR&lt;/h2&gt;&#xA;&lt;p&gt;The paper proposes a hybrid architecture augmenting LLMs with an external RDF/OWL ontological memory layer, automatically constructed from heterogeneous sources, to enable persistent, verifiable, and semantically grounded reasoning beyond vector-based RAG.&lt;/p&gt;&#xA;&lt;h2 id=&#34;key-ideas&#34;&gt;Key Ideas&lt;/h2&gt;&#xA;&lt;ul&gt;&#xA;&lt;li&gt;LLMs suffer from weak long-term memory, poor structure, and unreliable multi-step reasoning.&lt;/li&gt;&#xA;&lt;li&gt;An external ontology (RDF/OWL knowledge graph) acts as verifiable memory and planning substrate.&lt;/li&gt;&#xA;&lt;li&gt;Automated pipeline builds and maintains the ontology from documents, APIs, and dialogue logs.&lt;/li&gt;&#xA;&lt;li&gt;SHACL/OWL constraints turn inference into a generation–verification–correction loop.&lt;/li&gt;&#xA;&lt;li&gt;Hybrid inference combines vector retrieval, graph reasoning, and external tool calls.&lt;/li&gt;&#xA;&lt;/ul&gt;&#xA;&lt;h2 id=&#34;approach&#34;&gt;Approach&lt;/h2&gt;&#xA;&lt;p&gt;The pipeline extracts entities and relations from heterogeneous inputs, normalizes them, and generates RDF triples. Triples are validated against SHACL shapes and OWL axioms, then merged into a continuously updated knowledge graph. At inference time, the LLM conditions on a composite context fusing vector-retrieved passages, graph subqueries, and tool outputs. Generated answers are checked against ontology constraints; violations trigger correction, yielding a closed verify-and-repair loop.&lt;/p&gt;</description>
    </item>
    <item>
      <title>HaS: Accelerating RAG through Homology-Aware Speculative Retrieval</title>
      <link>https://ftxj.github.io/posts/2026-04-22/02-has-accelerating-rag-through-homology-aware-speculative-retr/</link>
      <pubDate>Mon, 27 Apr 2026 05:12:02 +0000</pubDate>
      <guid>https://ftxj.github.io/posts/2026-04-22/02-has-accelerating-rag-through-homology-aware-speculative-retr/</guid>
      <description>&lt;p&gt;&lt;strong&gt;arXiv:&lt;/strong&gt; &lt;a href=&#34;https://arxiv.org/abs/2604.20452v1&#34;&gt;2604.20452&lt;/a&gt; · &lt;a href=&#34;https://arxiv.org/pdf/2604.20452v1&#34;&gt;PDF&lt;/a&gt;&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Authors:&lt;/strong&gt; Peng Peng, Weiwei Lin, Wentai Wu, Xinyang Wang, Yongheng Liu&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Primary category:&lt;/strong&gt; &lt;code&gt;cs.IR&lt;/code&gt; · all: cs.CL, cs.IR&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Matched keywords:&lt;/strong&gt; large language model, llm, agent, agentic, retrieval, rag, inference, latency&lt;/p&gt;&#xA;&lt;hr&gt;&#xA;&lt;h2 id=&#34;tldr&#34;&gt;TL;DR&lt;/h2&gt;&#xA;&lt;p&gt;HaS accelerates Retrieval-Augmented Generation by speculatively retrieving from a restricted scope, then validating candidates via &amp;ldquo;homologous query re-identification&amp;rdquo; — checking whether the incoming query matches a previously-seen one. This bypasses full-database search for repeat-like queries, cutting latency 24–37% with 1–2% accuracy loss.&lt;/p&gt;</description>
    </item>
    <item>
      <title>SAKE: Self-aware Knowledge Exploitation-Exploration for Grounded Multimodal Named Entity Recognition</title>
      <link>https://ftxj.github.io/posts/2026-04-22/01-sake-self-aware-knowledge-exploitation-exploration-for-groun/</link>
      <pubDate>Mon, 27 Apr 2026 05:11:32 +0000</pubDate>
      <guid>https://ftxj.github.io/posts/2026-04-22/01-sake-self-aware-knowledge-exploitation-exploration-for-groun/</guid>
      <description>&lt;p&gt;&lt;strong&gt;arXiv:&lt;/strong&gt; &lt;a href=&#34;https://arxiv.org/abs/2604.20146v1&#34;&gt;2604.20146&lt;/a&gt; · &lt;a href=&#34;https://arxiv.org/pdf/2604.20146v1&#34;&gt;PDF&lt;/a&gt;&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Authors:&lt;/strong&gt; Jielong Tang, Xujie Yuan, Jiayang Liu, Jianxing Yu, Xiao Dong, Lin Chen, Yunlai Teng, Shimin Di, Jian Yin&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Primary category:&lt;/strong&gt; &lt;code&gt;cs.IR&lt;/code&gt; · all: cs.CL, cs.IR&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Matched keywords:&lt;/strong&gt; large language model, llm, agent, agentic, tool-use, retrieval, reasoning, chain-of-thought, serving, fine-tun&lt;/p&gt;&#xA;&lt;hr&gt;&#xA;&lt;h2 id=&#34;tldr&#34;&gt;TL;DR&lt;/h2&gt;&#xA;&lt;p&gt;SAKE is an end-to-end agentic framework for Grounded Multimodal Named Entity Recognition (GMNER) that blends internal MLLM knowledge with external retrieval via self-aware reasoning, deciding when to invoke search tools to handle long-tailed and unseen entities on social media.&lt;/p&gt;</description>
    </item>
    <item>
      <title>Enhancing Online Recruitment with Category-Aware MoE and LLM-based Data Augmentation</title>
      <link>https://ftxj.github.io/posts/2026-04-23/10-enhancing-online-recruitment-with-category-aware-moe-and-llm/</link>
      <pubDate>Mon, 27 Apr 2026 05:10:58 +0000</pubDate>
      <guid>https://ftxj.github.io/posts/2026-04-23/10-enhancing-online-recruitment-with-category-aware-moe-and-llm/</guid>
      <description>&lt;p&gt;&lt;strong&gt;arXiv:&lt;/strong&gt; &lt;a href=&#34;https://arxiv.org/abs/2604.21264v1&#34;&gt;2604.21264&lt;/a&gt; · &lt;a href=&#34;https://arxiv.org/pdf/2604.21264v1&#34;&gt;PDF&lt;/a&gt;&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Authors:&lt;/strong&gt; Minping Chen, Bing Xu, Zulong Chen, Chuanfei Xu, Ying Zhou, Zui Tao, Zeyi Wen&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Primary category:&lt;/strong&gt; &lt;code&gt;cs.AI&lt;/code&gt; · all: cs.AI&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Matched keywords:&lt;/strong&gt; large language model, llm, rag, chain-of-thought, mixture of experts, moe&lt;/p&gt;&#xA;&lt;hr&gt;&#xA;&lt;h2 id=&#34;tldr&#34;&gt;TL;DR&lt;/h2&gt;&#xA;&lt;p&gt;The paper proposes an LLM-enhanced Person-Job Fit (PJF) system combining chain-of-thought data augmentation for low-quality job descriptions with a category-aware Mixture of Experts module to better distinguish similar candidate-job pairs, yielding measurable gains in offline metrics and online A/B tests.&lt;/p&gt;</description>
    </item>
    <item>
      <title>LayerBoost: Layer-Aware Attention Reduction for Efficient LLMs</title>
      <link>https://ftxj.github.io/posts/2026-04-23/09-layerboost-layer-aware-attention-reduction-for-efficient-llm/</link>
      <pubDate>Mon, 27 Apr 2026 05:10:15 +0000</pubDate>
      <guid>https://ftxj.github.io/posts/2026-04-23/09-layerboost-layer-aware-attention-reduction-for-efficient-llm/</guid>
      <description>&lt;p&gt;&lt;strong&gt;arXiv:&lt;/strong&gt; &lt;a href=&#34;https://arxiv.org/abs/2604.22050v1&#34;&gt;2604.22050&lt;/a&gt; · &lt;a href=&#34;https://arxiv.org/pdf/2604.22050v1&#34;&gt;PDF&lt;/a&gt;&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Authors:&lt;/strong&gt; Mohamed Ali Souibgui, Jan Fostier, Rodrigo Abadía-Heredia, Bohdan Denysenko, Christian Marschke, Igor Peric&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Primary category:&lt;/strong&gt; &lt;code&gt;cs.LG&lt;/code&gt; · all: cs.CL, cs.LG&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Matched keywords:&lt;/strong&gt; llm, inference, serving, attention, transformer, throughput, latency&lt;/p&gt;&#xA;&lt;hr&gt;&#xA;&lt;h2 id=&#34;tldr&#34;&gt;TL;DR&lt;/h2&gt;&#xA;&lt;p&gt;LayerBoost is a layer-aware attention reduction method that uses sensitivity analysis to selectively apply softmax, linear sliding window, or no attention per layer, recovered via a lightweight 10M-token distillation. It improves throughput by up to 68% at high concurrency while preserving quality.&lt;/p&gt;</description>
    </item>
    <item>
      <title>Lightweight Retrieval-Augmented Generation and Large Language Model-Based Modeling for Scalable Patient-Trial Matching</title>
      <link>https://ftxj.github.io/posts/2026-04-23/08-lightweight-retrieval-augmented-generation-and-large-languag/</link>
      <pubDate>Mon, 27 Apr 2026 05:09:44 +0000</pubDate>
      <guid>https://ftxj.github.io/posts/2026-04-23/08-lightweight-retrieval-augmented-generation-and-large-languag/</guid>
      <description>&lt;p&gt;&lt;strong&gt;arXiv:&lt;/strong&gt; &lt;a href=&#34;https://arxiv.org/abs/2604.22061v1&#34;&gt;2604.22061&lt;/a&gt; · &lt;a href=&#34;https://arxiv.org/pdf/2604.22061v1&#34;&gt;PDF&lt;/a&gt;&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Authors:&lt;/strong&gt; Xiaodi Li, Yang Xiao, Munhwan Lee, Konstantinos Leventakos, Young J. Juhn, David Jones, Terence T. Sio, Wei Liu, Maria Vassilaki, Nansu Zong&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Primary category:&lt;/strong&gt; &lt;code&gt;cs.CL&lt;/code&gt; · all: cs.AI, cs.CL, cs.LG&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Matched keywords:&lt;/strong&gt; large language model, llm, retrieval, reasoning, serving, fine-tun&lt;/p&gt;&#xA;&lt;hr&gt;&#xA;&lt;h2 id=&#34;tldr&#34;&gt;TL;DR&lt;/h2&gt;&#xA;&lt;p&gt;该论文提出一种轻量级框架，结合 RAG 与 LLM 表征建模，用于可扩展的患者-临床试验匹配，在多个公开和真实临床数据集上以显著更低的计算代价达到与端到端 LLM 相当的性能。&lt;/p&gt;&#xA;&lt;h2 id=&#34;key-ideas&#34;&gt;Key Ideas&lt;/h2&gt;&#xA;&lt;ul&gt;&#xA;&lt;li&gt;将 RAG 与 LLM 表征解耦：RAG 负责从长 EHR 中选相关片段，LLM 负责编码。&lt;/li&gt;&#xA;&lt;li&gt;引入降维与轻量分类器，实现下游高效分类。&lt;/li&gt;&#xA;&lt;li&gt;冻结 LLM 对结构化数据已足够，非结构化临床叙述则必须微调。&lt;/li&gt;&#xA;&lt;li&gt;在公开基准与 Mayo Clinic 真实多模态数据集上验证可扩展性。&lt;/li&gt;&#xA;&lt;/ul&gt;&#xA;&lt;h2 id=&#34;approach&#34;&gt;Approach&lt;/h2&gt;&#xA;&lt;p&gt;Pipeline 分两阶段：(1) RAG 从长 EHR 中检索与试验入组标准相关的临床片段，降低输入长度；(2) LLM 将这些片段编码为表征，再经降维后输入轻量预测器（如线性或浅层模型）完成匹配分类。对结构化字段用冻结 LLM，对自由文本叙述部分做微调。&lt;/p&gt;</description>
    </item>
    <item>
      <title>Emergent Strategic Reasoning Risks in AI: A Taxonomy-Driven Evaluation Framework</title>
      <link>https://ftxj.github.io/posts/2026-04-23/07-emergent-strategic-reasoning-risks-in-ai-a-taxonomy-driven-e/</link>
      <pubDate>Mon, 27 Apr 2026 05:09:07 +0000</pubDate>
      <guid>https://ftxj.github.io/posts/2026-04-23/07-emergent-strategic-reasoning-risks-in-ai-a-taxonomy-driven-e/</guid>
      <description>&lt;p&gt;&lt;strong&gt;arXiv:&lt;/strong&gt; &lt;a href=&#34;https://arxiv.org/abs/2604.22119v1&#34;&gt;2604.22119&lt;/a&gt; · &lt;a href=&#34;https://arxiv.org/pdf/2604.22119v1&#34;&gt;PDF&lt;/a&gt;&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Authors:&lt;/strong&gt; Tharindu Kumarage, Lisa Bauer, Yao Ma, Dan Rosen, Yashasvi Raghavendra Guduri, Anna Rumshisky, Kai-Wei Chang, Aram Galstyan, Rahul Gupta, Charith Peris&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Primary category:&lt;/strong&gt; &lt;code&gt;cs.AI&lt;/code&gt; · all: cs.AI&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Matched keywords:&lt;/strong&gt; large language model, llm, agent, agentic, reasoning&lt;/p&gt;&#xA;&lt;hr&gt;&#xA;&lt;h2 id=&#34;tldr&#34;&gt;TL;DR&lt;/h2&gt;&#xA;&lt;p&gt;This paper introduces ESRRSim, a taxonomy-driven agentic framework for evaluating Emergent Strategic Reasoning Risks (ESRRs) in LLMs—behaviors like deception, evaluation gaming, and reward hacking. Across 11 reasoning LLMs, detection rates vary from 14.45% to 72.72%.&lt;/p&gt;</description>
    </item>
    <item>
      <title>Trust but Verify: Introducing DAVinCI -- A Framework for Dual Attribution and Verification in Claim Inference for Language Models</title>
      <link>https://ftxj.github.io/posts/2026-04-23/06-trust-but-verify-introducing-davinci-a-framework-for-dual-at/</link>
      <pubDate>Mon, 27 Apr 2026 05:08:32 +0000</pubDate>
      <guid>https://ftxj.github.io/posts/2026-04-23/06-trust-but-verify-introducing-davinci-a-framework-for-dual-at/</guid>
      <description>&lt;p&gt;&lt;strong&gt;arXiv:&lt;/strong&gt; &lt;a href=&#34;https://arxiv.org/abs/2604.21193v1&#34;&gt;2604.21193&lt;/a&gt; · &lt;a href=&#34;https://arxiv.org/pdf/2604.21193v1&#34;&gt;PDF&lt;/a&gt;&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Authors:&lt;/strong&gt; Vipula Rawte, Ryan Rossi, Franck Dernoncourt, Nedim Lipka&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Primary category:&lt;/strong&gt; &lt;code&gt;cs.AI&lt;/code&gt; · all: cs.AI&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Matched keywords:&lt;/strong&gt; large language model, llm, retrieval, reasoning, inference, ai system&lt;/p&gt;&#xA;&lt;hr&gt;&#xA;&lt;h2 id=&#34;tldr&#34;&gt;TL;DR&lt;/h2&gt;&#xA;&lt;p&gt;DAVinCI is a two-stage framework that combines claim attribution (to internal model components and external sources) with entailment-based verification and confidence calibration, improving factual reliability of LLM outputs by 5–20% over verification-only baselines on FEVER and CLIMATE-FEVER.&lt;/p&gt;&#xA;&lt;h2 id=&#34;key-ideas&#34;&gt;Key Ideas&lt;/h2&gt;&#xA;&lt;ul&gt;&#xA;&lt;li&gt;Dual approach: pair &lt;strong&gt;attribution&lt;/strong&gt; with &lt;strong&gt;verification&lt;/strong&gt; rather than treating them independently.&lt;/li&gt;&#xA;&lt;li&gt;Attribute claims both to internal LLM components and external retrieved sources.&lt;/li&gt;&#xA;&lt;li&gt;Use entailment reasoning plus confidence recalibration for claim checking.&lt;/li&gt;&#xA;&lt;li&gt;Release a modular implementation pluggable into existing LLM pipelines.&lt;/li&gt;&#xA;&lt;/ul&gt;&#xA;&lt;h2 id=&#34;approach&#34;&gt;Approach&lt;/h2&gt;&#xA;&lt;p&gt;DAVinCI runs in two stages. Stage 1 attributes each generated claim to (a) internal model components and (b) external evidence sources. Stage 2 verifies each claim via entailment-based reasoning, then recalibrates confidence scores. The abstract does not specify the exact attribution mechanism (e.g., attention tracing, gradient-based, or retrieval citation) or which entailment model is used.&lt;/p&gt;</description>
    </item>
    <item>
      <title>MambaCSP: Hybrid-Attention State Space Models for Hardware-Efficient Channel State Prediction</title>
      <link>https://ftxj.github.io/posts/2026-04-23/05-mambacsp-hybrid-attention-state-space-models-for-hardware-ef/</link>
      <pubDate>Mon, 27 Apr 2026 05:08:03 +0000</pubDate>
      <guid>https://ftxj.github.io/posts/2026-04-23/05-mambacsp-hybrid-attention-state-space-models-for-hardware-ef/</guid>
      <description>&lt;p&gt;&lt;strong&gt;arXiv:&lt;/strong&gt; &lt;a href=&#34;https://arxiv.org/abs/2604.21957v1&#34;&gt;2604.21957&lt;/a&gt; · &lt;a href=&#34;https://arxiv.org/pdf/2604.21957v1&#34;&gt;PDF&lt;/a&gt;&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Authors:&lt;/strong&gt; Aladin Djuhera, Haris Gacanin, Holger Boche&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Primary category:&lt;/strong&gt; &lt;code&gt;cs.IT&lt;/code&gt; · all: cs.AI, cs.IT, cs.LG, eess.SP&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Matched keywords:&lt;/strong&gt; large language model, llm, inference, attention, transformer, throughput, latency&lt;/p&gt;&#xA;&lt;hr&gt;&#xA;&lt;h2 id=&#34;tldr&#34;&gt;TL;DR&lt;/h2&gt;&#xA;&lt;p&gt;MambaCSP replaces Transformer/LLM backbones for channel state prediction with a hybrid Mamba SSM augmented by lightweight patch-mixer attention, achieving 9–12% accuracy gains and up to 3× throughput over LLM baselines in MISO-OFDM simulations.&lt;/p&gt;&#xA;&lt;h2 id=&#34;key-ideas&#34;&gt;Key Ideas&lt;/h2&gt;&#xA;&lt;ul&gt;&#xA;&lt;li&gt;Pure attention-based CSP suffers quadratic sequence cost, limiting real-time wireless use.&lt;/li&gt;&#xA;&lt;li&gt;Selective SSMs (Mamba) offer linear-time alternatives but lack long-range cross-token mixing.&lt;/li&gt;&#xA;&lt;li&gt;Hybrid design: Mamba backbone + periodic patch-mixer attention layers recovers global context cheaply.&lt;/li&gt;&#xA;&lt;li&gt;Hardware efficiency (VRAM, latency, throughput) is treated as a first-class objective alongside accuracy.&lt;/li&gt;&#xA;&lt;/ul&gt;&#xA;&lt;h2 id=&#34;approach&#34;&gt;Approach&lt;/h2&gt;&#xA;&lt;p&gt;MambaCSP swaps the LLM prediction backbone for a linear-time Mamba selective SSM operating on CSI sequences. Because pure SSMs capture mostly local dependencies, the authors periodically insert lightweight &amp;ldquo;patch-mixer&amp;rdquo; attention layers that inject cross-token interactions across patched CSI tokens. The architecture thus alternates SSM blocks (cheap sequential mixing) with sparse attention (global context), targeting MISO-OFDM channel prediction.&lt;/p&gt;</description>
    </item>
    <item>
      <title>Pre-trained LLMs Meet Sequential Recommenders: Efficient User-Centric Knowledge Distillation</title>
      <link>https://ftxj.github.io/posts/2026-04-23/04-pre-trained-llms-meet-sequential-recommenders-efficient-user/</link>
      <pubDate>Mon, 27 Apr 2026 05:07:25 +0000</pubDate>
      <guid>https://ftxj.github.io/posts/2026-04-23/04-pre-trained-llms-meet-sequential-recommenders-efficient-user/</guid>
      <description>&lt;p&gt;&lt;strong&gt;arXiv:&lt;/strong&gt; &lt;a href=&#34;https://arxiv.org/abs/2604.21536v1&#34;&gt;2604.21536&lt;/a&gt; · &lt;a href=&#34;https://arxiv.org/pdf/2604.21536v1&#34;&gt;PDF&lt;/a&gt;&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Authors:&lt;/strong&gt; Nikita Severin, Danil Kartushov, Vladislav Urzhumov, Vladislav Kulikov, Oksana Konovalova, Alexey Grishanov, Anton Klenitskiy, Artem Fatkulin, Alexey Vasilev, Andrey Savchenko, Ilya Makarov&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Primary category:&lt;/strong&gt; &lt;code&gt;cs.IR&lt;/code&gt; · all: cs.AI, cs.IR&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Matched keywords:&lt;/strong&gt; large language model, llm, reasoning, inference, serving, fine-tun&lt;/p&gt;&#xA;&lt;hr&gt;&#xA;&lt;h2 id=&#34;tldr&#34;&gt;TL;DR&lt;/h2&gt;&#xA;&lt;p&gt;The paper proposes a knowledge distillation method that transfers LLM-generated textual user profiles into sequential recommender systems, enhancing user semantic understanding without incurring LLM inference costs at serving time.&lt;/p&gt;</description>
    </item>
    <item>
      <title>Memanto: Typed Semantic Memory with Information-Theoretic Retrieval for Long-Horizon Agents</title>
      <link>https://ftxj.github.io/posts/2026-04-23/03-memanto-typed-semantic-memory-with-information-theoretic-ret/</link>
      <pubDate>Mon, 27 Apr 2026 05:06:52 +0000</pubDate>
      <guid>https://ftxj.github.io/posts/2026-04-23/03-memanto-typed-semantic-memory-with-information-theoretic-ret/</guid>
      <description>&lt;p&gt;&lt;strong&gt;arXiv:&lt;/strong&gt; &lt;a href=&#34;https://arxiv.org/abs/2604.22085v1&#34;&gt;2604.22085&lt;/a&gt; · &lt;a href=&#34;https://arxiv.org/pdf/2604.22085v1&#34;&gt;PDF&lt;/a&gt;&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Authors:&lt;/strong&gt; Seyed Moein Abtahi, Rasa Rahnema, Hetkumar Patel, Neel Patel, Majid Fekri, Tara Khani&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Primary category:&lt;/strong&gt; &lt;code&gt;cs.AI&lt;/code&gt; · all: cs.AI&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Matched keywords:&lt;/strong&gt; large language model, agent, agentic, retrieval, inference, latency&lt;/p&gt;&#xA;&lt;hr&gt;&#xA;&lt;h2 id=&#34;tldr&#34;&gt;TL;DR&lt;/h2&gt;&#xA;&lt;p&gt;Memanto is a memory layer for long-horizon LLM agents that replaces knowledge-graph pipelines with a typed semantic schema plus an information-theoretic retrieval engine, hitting 89.8% on LongMemEval and 87.1% on LoCoMo with single-query retrieval and no ingestion cost.&lt;/p&gt;</description>
    </item>
    <item>
      <title>Tool Attention Is All You Need: Dynamic Tool Gating and Lazy Schema Loading for Eliminating the MCP/Tools Tax in Scalable Agentic Workflows</title>
      <link>https://ftxj.github.io/posts/2026-04-23/02-tool-attention-is-all-you-need-dynamic-tool-gating-and-lazy/</link>
      <pubDate>Mon, 27 Apr 2026 05:06:21 +0000</pubDate>
      <guid>https://ftxj.github.io/posts/2026-04-23/02-tool-attention-is-all-you-need-dynamic-tool-gating-and-lazy/</guid>
      <description>&lt;p&gt;&lt;strong&gt;arXiv:&lt;/strong&gt; &lt;a href=&#34;https://arxiv.org/abs/2604.21816v1&#34;&gt;2604.21816&lt;/a&gt; · &lt;a href=&#34;https://arxiv.org/pdf/2604.21816v1&#34;&gt;PDF&lt;/a&gt;&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Authors:&lt;/strong&gt; Anuj Sadani, Deepak Kumar&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Primary category:&lt;/strong&gt; &lt;code&gt;cs.AI&lt;/code&gt; · all: cs.AI&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Matched keywords:&lt;/strong&gt; large language model, llm, agent, agentic, reasoning, attention, latency&lt;/p&gt;&#xA;&lt;hr&gt;&#xA;&lt;h2 id=&#34;tldr&#34;&gt;TL;DR&lt;/h2&gt;&#xA;&lt;p&gt;Tool Attention is a middleware layer that replaces MCP&amp;rsquo;s eager schema injection with intent-gated, lazy schema loading — cutting per-turn tool tokens by 95% in simulation and arguing that protocol efficiency, not context length, is the real bottleneck for scalable agentic systems.&lt;/p&gt;&#xA;&lt;h2 id=&#34;key-ideas&#34;&gt;Key Ideas&lt;/h2&gt;&#xA;&lt;ul&gt;&#xA;&lt;li&gt;The &amp;ldquo;MCP Tax&amp;rdquo; (10k–60k tokens/turn) inflates KV cache and pushes context past known reasoning-degradation thresholds (~70%).&lt;/li&gt;&#xA;&lt;li&gt;Generalize self-attention into &lt;em&gt;attention over tools&lt;/em&gt;: score, gate, then selectively expose schemas.&lt;/li&gt;&#xA;&lt;li&gt;Protocol-level efficiency is a tighter constraint than raw context window size.&lt;/li&gt;&#xA;&lt;/ul&gt;&#xA;&lt;h2 id=&#34;approach&#34;&gt;Approach&lt;/h2&gt;&#xA;&lt;p&gt;A middleware sitting between agent and MCP servers with three components:&lt;/p&gt;</description>
    </item>
    <item>
      <title>Nemobot Games: Crafting Strategic AI Gaming Agents for Interactive Learning with Large Language Models</title>
      <link>https://ftxj.github.io/posts/2026-04-23/01-nemobot-games-crafting-strategic-ai-gaming-agents-for-intera/</link>
      <pubDate>Mon, 27 Apr 2026 05:05:47 +0000</pubDate>
      <guid>https://ftxj.github.io/posts/2026-04-23/01-nemobot-games-crafting-strategic-ai-gaming-agents-for-intera/</guid>
      <description>&lt;p&gt;&lt;strong&gt;arXiv:&lt;/strong&gt; &lt;a href=&#34;https://arxiv.org/abs/2604.21896v1&#34;&gt;2604.21896&lt;/a&gt; · &lt;a href=&#34;https://arxiv.org/pdf/2604.21896v1&#34;&gt;PDF&lt;/a&gt;&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Authors:&lt;/strong&gt; Chee Wei Tan, Yuchen Wang, Shangxin Guo&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Primary category:&lt;/strong&gt; &lt;code&gt;cs.AI&lt;/code&gt; · all: cs.AI&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Matched keywords:&lt;/strong&gt; large language model, llm, agent, agentic, rag, reasoning, fine-tun&lt;/p&gt;&#xA;&lt;hr&gt;&#xA;&lt;h2 id=&#34;tldr&#34;&gt;TL;DR&lt;/h2&gt;&#xA;&lt;p&gt;Nemobot is an interactive agentic environment that uses LLMs to build and deploy game-playing agents across Shannon&amp;rsquo;s taxonomy, spanning dictionary-based, solvable, heuristic, and learning-based games, aiming toward self-programming AI.&lt;/p&gt;&#xA;&lt;h2 id=&#34;key-ideas&#34;&gt;Key Ideas&lt;/h2&gt;&#xA;&lt;ul&gt;&#xA;&lt;li&gt;Extends Shannon&amp;rsquo;s 1950 taxonomy of game-playing machines into an LLM era paradigm.&lt;/li&gt;&#xA;&lt;li&gt;Four game classes handled distinctly: dictionary, solvable, heuristic, learning-based.&lt;/li&gt;&#xA;&lt;li&gt;Agents combine minimax, crowd-sourced data, RLHF, and self-critique.&lt;/li&gt;&#xA;&lt;li&gt;Programmable environment for tool-augmented generation and fine-tuning.&lt;/li&gt;&#xA;&lt;li&gt;Positions user-in-the-loop customization as a route to self-programming.&lt;/li&gt;&#xA;&lt;/ul&gt;&#xA;&lt;h2 id=&#34;approach&#34;&gt;Approach&lt;/h2&gt;&#xA;&lt;p&gt;A chatbot-driven agentic engine routes game tasks by class: compressed state-action mappings for dictionary games; exact mathematical reasoning with human-readable explanations for solvable games; hybrid minimax-plus-crowd heuristics for heuristic games; RLHF with self-critique and imitation learning for learning-based games. Nemobot exposes these as programmable, tool-augmented workflows users can customize and fine-tune.&lt;/p&gt;</description>
    </item>
    <item>
      <title>Guess-Verify-Refine: Data-Aware Top-K for Sparse-Attention Decoding on Blackwell via Temporal Correlation</title>
      <link>https://ftxj.github.io/posts/2026-04-24/05-guess-verify-refine-data-aware-top-k-for-sparse-attention-de/</link>
      <pubDate>Mon, 27 Apr 2026 05:02:30 +0000</pubDate>
      <guid>https://ftxj.github.io/posts/2026-04-24/05-guess-verify-refine-data-aware-top-k-for-sparse-attention-de/</guid>
      <description>&lt;p&gt;&lt;strong&gt;arXiv:&lt;/strong&gt; &lt;a href=&#34;https://arxiv.org/abs/2604.22312v1&#34;&gt;2604.22312&lt;/a&gt; · &lt;a href=&#34;https://arxiv.org/pdf/2604.22312v1&#34;&gt;PDF&lt;/a&gt;&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Authors:&lt;/strong&gt; Long Cheng, Ritchie Zhao, Timmy Liu, Mindy Li, Xianjie Qiao, Kefeng Duan, Yu-Jung Chen, Xiaoming Chen, Bita Darvish Rouhani, June Yang&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Primary category:&lt;/strong&gt; &lt;code&gt;cs.DC&lt;/code&gt; · all: cs.AR, cs.DC, cs.PF&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Matched keywords:&lt;/strong&gt; llm, rag, serving, speculative decoding, attention, latency&lt;/p&gt;&#xA;&lt;hr&gt;&#xA;&lt;h2 id=&#34;tldr&#34;&gt;TL;DR&lt;/h2&gt;&#xA;&lt;p&gt;GVR is a data-aware exact Top-K algorithm for sparse-attention decoding on NVIDIA Blackwell. By exploiting temporal correlation between consecutive decode steps, it delivers 1.88× average kernel speedup over radix-select while preserving bit-exact outputs.&lt;/p&gt;</description>
    </item>
    <item>
      <title>Behavioral Canaries: Auditing Private Retrieved Context Usage in RL Fine-Tuning</title>
      <link>https://ftxj.github.io/posts/2026-04-24/04-behavioral-canaries-auditing-private-retrieved-context-usage/</link>
      <pubDate>Mon, 27 Apr 2026 05:01:57 +0000</pubDate>
      <guid>https://ftxj.github.io/posts/2026-04-24/04-behavioral-canaries-auditing-private-retrieved-context-usage/</guid>
      <description>&lt;p&gt;&lt;strong&gt;arXiv:&lt;/strong&gt; &lt;a href=&#34;https://arxiv.org/abs/2604.22191v1&#34;&gt;2604.22191&lt;/a&gt; · &lt;a href=&#34;https://arxiv.org/pdf/2604.22191v1&#34;&gt;PDF&lt;/a&gt;&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Authors:&lt;/strong&gt; Chaoran Chen, Dayu Yuan, Peter Kairouz&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Primary category:&lt;/strong&gt; &lt;code&gt;cs.CR&lt;/code&gt; · all: cs.CL, cs.CR&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Matched keywords:&lt;/strong&gt; llm, agent, agentic, inference, fine-tun, post-train&lt;/p&gt;&#xA;&lt;hr&gt;&#xA;&lt;h2 id=&#34;tldr&#34;&gt;TL;DR&lt;/h2&gt;&#xA;&lt;p&gt;The paper introduces &lt;strong&gt;Behavioral Canaries&lt;/strong&gt;, an auditing technique for detecting unauthorized use of protected retrieved documents in RL fine-tuning (RLFT) pipelines. Unlike memorization-based audits, it plants trigger-conditioned stylistic preferences that surface as behavioral shifts, achieving 67% detection at 10% FPR (AUROC 0.756) with only 1% canary injection.&lt;/p&gt;</description>
    </item>
    <item>
      <title>GR-Evolve: Design-Adaptive Global Routing via LLM-Driven Algorithm Evolution</title>
      <link>https://ftxj.github.io/posts/2026-04-24/03-gr-evolve-design-adaptive-global-routing-via-llm-driven-algo/</link>
      <pubDate>Mon, 27 Apr 2026 05:01:16 +0000</pubDate>
      <guid>https://ftxj.github.io/posts/2026-04-24/03-gr-evolve-design-adaptive-global-routing-via-llm-driven-algo/</guid>
      <description>&lt;p&gt;&lt;strong&gt;arXiv:&lt;/strong&gt; &lt;a href=&#34;https://arxiv.org/abs/2604.22234v1&#34;&gt;2604.22234&lt;/a&gt; · &lt;a href=&#34;https://arxiv.org/pdf/2604.22234v1&#34;&gt;PDF&lt;/a&gt;&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Authors:&lt;/strong&gt; Taizun Jafri, Vidya A. Chhabria&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Primary category:&lt;/strong&gt; &lt;code&gt;cs.AR&lt;/code&gt; · all: cs.AR&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Matched keywords:&lt;/strong&gt; large language model, llm, agent, agentic, rag&lt;/p&gt;&#xA;&lt;hr&gt;&#xA;&lt;h2 id=&#34;tldr&#34;&gt;TL;DR&lt;/h2&gt;&#xA;&lt;p&gt;GR-Evolve uses an agentic LLM to iteratively evolve global router source code, specializing EDA algorithms per-design via QoR feedback within OpenROAD, achieving up to 8.72% post-detailed-routing wirelength reduction over baselines.&lt;/p&gt;&#xA;&lt;h2 id=&#34;key-ideas&#34;&gt;Key Ideas&lt;/h2&gt;&#xA;&lt;ul&gt;&#xA;&lt;li&gt;Introduces &amp;ldquo;design-adaptive EDA tooling&amp;rdquo;: algorithms themselves adapt to each design, not just hyperparameters.&lt;/li&gt;&#xA;&lt;li&gt;Uses LLM-driven code evolution on global router source code.&lt;/li&gt;&#xA;&lt;li&gt;Closes the loop with QoR-driven feedback from OpenROAD toolchain.&lt;/li&gt;&#xA;&lt;li&gt;Equips the LLM with persistent contextual knowledge about open-source routers.&lt;/li&gt;&#xA;&lt;/ul&gt;&#xA;&lt;h2 id=&#34;approach&#34;&gt;Approach&lt;/h2&gt;&#xA;&lt;p&gt;GR-Evolve is a code evolution framework wrapping an agentic LLM around an open-source global router. The LLM iteratively edits the router&amp;rsquo;s source code; each candidate is compiled and evaluated through an integrated OpenROAD QoR pipeline. Persistent context about router internals grounds the LLM, and QoR metrics (notably post-detailed-routing wirelength) steer subsequent mutations.&lt;/p&gt;</description>
    </item>
  </channel>
</rss>
