<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>2026-04-24 on JXIN&#39;s Home</title>
    <link>https://ftxj.github.io/categories/2026-04-24/</link>
    <description>Recent content in 2026-04-24 on JXIN&#39;s Home</description>
    <generator>Hugo</generator>
    <language>en</language>
    <lastBuildDate>Mon, 27 Apr 2026 08:08:57 +0000</lastBuildDate>
    <atom:link href="https://ftxj.github.io/categories/2026-04-24/index.xml" rel="self" type="application/rss+xml" />
    <item>
      <title>Large Language Models Decide Early and Explain Later</title>
      <link>https://ftxj.github.io/posts/2026-04-24/10-large-language-models-decide-early-and-explain-later/</link>
      <pubDate>Mon, 27 Apr 2026 08:08:57 +0000</pubDate>
      <guid>https://ftxj.github.io/posts/2026-04-24/10-large-language-models-decide-early-and-explain-later/</guid>
      <description>&lt;p&gt;&lt;strong&gt;arXiv:&lt;/strong&gt; &lt;a href=&#34;https://arxiv.org/abs/2604.22266v1&#34;&gt;2604.22266&lt;/a&gt; · &lt;a href=&#34;https://arxiv.org/pdf/2604.22266v1&#34;&gt;PDF&lt;/a&gt;&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Authors:&lt;/strong&gt; Ayan Datta, Zhixue Zhao, Bhuvanesh Verma, Radhika Mamidi, Mounika Marreddy, Alexander Mehler&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Primary category:&lt;/strong&gt; &lt;code&gt;cs.CL&lt;/code&gt; · all: cs.CL&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Matched keywords:&lt;/strong&gt; large language model, rag, reasoning, chain-of-thought, inference, latency&lt;/p&gt;&#xA;&lt;hr&gt;&#xA;&lt;h2 id=&#34;tldr&#34;&gt;TL;DR&lt;/h2&gt;&#xA;&lt;p&gt;Studying Qwen3-4B, the authors show LLMs often lock in their answer partway through chain-of-thought reasoning and spend hundreds of tokens explaining post-hoc; simple early-stopping heuristics cut ~500 tokens per query for only a 2% accuracy loss.&lt;/p&gt;&#xA;&lt;p&gt;&lt;img alt=&#34;Figure 1&#34; src=&#34;https://ftxj.github.io/images/papers/2604.22266/fig1.png&#34;&gt;&lt;/p&gt;</description>
    </item>
    <item>
      <title>Agentic World Modeling: Foundations, Capabilities, Laws, and Beyond</title>
      <link>https://ftxj.github.io/posts/2026-04-24/09-agentic-world-modeling-foundations-capabilities-laws-and-bey/</link>
      <pubDate>Mon, 27 Apr 2026 08:07:50 +0000</pubDate>
      <guid>https://ftxj.github.io/posts/2026-04-24/09-agentic-world-modeling-foundations-capabilities-laws-and-bey/</guid>
      <description>&lt;p&gt;&lt;strong&gt;arXiv:&lt;/strong&gt; &lt;a href=&#34;https://arxiv.org/abs/2604.22748v1&#34;&gt;2604.22748&lt;/a&gt; · &lt;a href=&#34;https://arxiv.org/pdf/2604.22748v1&#34;&gt;PDF&lt;/a&gt;&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Authors:&lt;/strong&gt; Meng Chu, Xuan Billy Zhang, Kevin Qinghong Lin, Lingdong Kong, Jize Zhang, Teng Tu, Weijian Ma, Ziqi Huang, Senqiao Yang, Wei Huang, Yeying Jin, Zhefan Rao, Jinhui Ye, Xinyu Lin, Xichen Zhang, Qisheng Hu, Shuai Yang, Leyang Shen, Wei Chow, Yifei Dong, Fengyi Wu, Quanyu Long, Bin Xia, Shaozuo Yu, Mingkang Zhu, Wenhu Zhang, Jiehui Huang, Haokun Gui, Haoxuan Che, Long Chen, Qifeng Chen, Wenxuan Zhang, Wenya Wang, Xiaojuan Qi, Yang Deng, Yanwei Li, Mike Zheng Shou, Zhi-Qi Cheng, See-Kiong Ng, Ziwei Liu, Philip Torr, Jiaya Jia&lt;/p&gt;</description>
    </item>
    <item>
      <title>How Do AI Agents Spend Your Money? Analyzing and Predicting Token Consumption in Agentic Coding Tasks</title>
      <link>https://ftxj.github.io/posts/2026-04-24/08-how-do-ai-agents-spend-your-money-analyzing-and-predicting-t/</link>
      <pubDate>Mon, 27 Apr 2026 08:06:58 +0000</pubDate>
      <guid>https://ftxj.github.io/posts/2026-04-24/08-how-do-ai-agents-spend-your-money-analyzing-and-predicting-t/</guid>
      <description>&lt;p&gt;&lt;strong&gt;arXiv:&lt;/strong&gt; &lt;a href=&#34;https://arxiv.org/abs/2604.22750v1&#34;&gt;2604.22750&lt;/a&gt; · &lt;a href=&#34;https://arxiv.org/pdf/2604.22750v1&#34;&gt;PDF&lt;/a&gt;&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Authors:&lt;/strong&gt; Longju Bai, Zhemin Huang, Xingyao Wang, Jiao Sun, Rada Mihalcea, Erik Brynjolfsson, Alex Pentland, Jiaxin Pei&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Primary category:&lt;/strong&gt; &lt;code&gt;cs.CL&lt;/code&gt; · all: cs.CL, cs.CY, cs.HC, cs.SE&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Matched keywords:&lt;/strong&gt; llm, agent, agentic, rag, reasoning&lt;/p&gt;&#xA;&lt;hr&gt;&#xA;&lt;h2 id=&#34;tldr&#34;&gt;TL;DR&lt;/h2&gt;&#xA;&lt;p&gt;First systematic study of token consumption in agentic coding tasks, analyzing trajectories from eight frontier LLMs on SWE-bench Verified. Finds agentic tasks consume 1000x more tokens than chat/reasoning, usage is highly stochastic, models vary dramatically in efficiency, and LLMs cannot reliably predict their own costs.&lt;/p&gt;</description>
    </item>
    <item>
      <title>Bridging the Long-Tail Gap: Robust Retrieval-Augmented Relation Completion via Multi-Stage Paraphrase Infusion</title>
      <link>https://ftxj.github.io/posts/2026-04-24/07-bridging-the-long-tail-gap-robust-retrieval-augmented-relati/</link>
      <pubDate>Mon, 27 Apr 2026 08:05:41 +0000</pubDate>
      <guid>https://ftxj.github.io/posts/2026-04-24/07-bridging-the-long-tail-gap-robust-retrieval-augmented-relati/</guid>
      <description>&lt;p&gt;&lt;strong&gt;arXiv:&lt;/strong&gt; &lt;a href=&#34;https://arxiv.org/abs/2604.22261v1&#34;&gt;2604.22261&lt;/a&gt; · &lt;a href=&#34;https://arxiv.org/pdf/2604.22261v1&#34;&gt;PDF&lt;/a&gt;&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Authors:&lt;/strong&gt; Fahmida Alam, Mihai Surdeanu, Ellen Riloff&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Primary category:&lt;/strong&gt; &lt;code&gt;cs.CL&lt;/code&gt; · all: cs.CL&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Matched keywords:&lt;/strong&gt; large language model, llm, retrieval, rag, reasoning, fine-tun&lt;/p&gt;&#xA;&lt;hr&gt;&#xA;&lt;h2 id=&#34;tldr&#34;&gt;TL;DR&lt;/h2&gt;&#xA;&lt;p&gt;RC-RAG is a training-free, multi-stage RAG framework that injects relation paraphrases into retrieval, summarization, and generation to boost long-tail relation completion. It delivers +40.6 EM over standalone LLMs and +13–16 EM over strong RAG baselines.&lt;/p&gt;&#xA;&lt;p&gt;&lt;img alt=&#34;Figure 1&#34; src=&#34;https://ftxj.github.io/images/papers/2604.22261/fig1.png&#34;&gt;&lt;/p&gt;&#xA;&lt;h2 id=&#34;key-ideas&#34;&gt;Key Ideas&lt;/h2&gt;&#xA;&lt;ul&gt;&#xA;&lt;li&gt;LLMs (with or without RAG) fail on rare/long-tail relations due to narrow lexical surface forms.&lt;/li&gt;&#xA;&lt;li&gt;Paraphrases of a relation can systematically broaden coverage across the RAG pipeline.&lt;/li&gt;&#xA;&lt;li&gt;No fine-tuning required — purely prompt- and retrieval-level intervention.&lt;/li&gt;&#xA;&lt;li&gt;Gains hold across five LLMs and two benchmark datasets.&lt;/li&gt;&#xA;&lt;/ul&gt;&#xA;&lt;h2 id=&#34;approach&#34;&gt;Approach&lt;/h2&gt;&#xA;&lt;p&gt;RC-RAG threads relation paraphrases through three stages:&lt;/p&gt;</description>
    </item>
    <item>
      <title>QuantClaw: Precision Where It Matters for OpenClaw</title>
      <link>https://ftxj.github.io/posts/2026-04-24/06-quantclaw-precision-where-it-matters-for-openclaw/</link>
      <pubDate>Mon, 27 Apr 2026 08:04:32 +0000</pubDate>
      <guid>https://ftxj.github.io/posts/2026-04-24/06-quantclaw-precision-where-it-matters-for-openclaw/</guid>
      <description>&lt;p&gt;&lt;strong&gt;arXiv:&lt;/strong&gt; &lt;a href=&#34;https://arxiv.org/abs/2604.22577v1&#34;&gt;2604.22577&lt;/a&gt; · &lt;a href=&#34;https://arxiv.org/pdf/2604.22577v1&#34;&gt;PDF&lt;/a&gt;&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Authors:&lt;/strong&gt; Manyi Zhang, Ji-Fu Li, Zhongao Sun, Xiaohao Liu, Zhenhua Dong, Xianzhi Yu, Haoli Bai, Xiaobo Xia&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Primary category:&lt;/strong&gt; &lt;code&gt;cs.AI&lt;/code&gt; · all: cs.AI, cs.CL&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Matched keywords:&lt;/strong&gt; agent, reasoning, inference, serving, quantization, latency&lt;/p&gt;&#xA;&lt;hr&gt;&#xA;&lt;h2 id=&#34;tldr&#34;&gt;TL;DR&lt;/h2&gt;&#xA;&lt;p&gt;QuantClaw is a plug-and-play precision-routing plugin for the OpenClaw agent system that dynamically assigns quantization precision per task, cutting cost up to 21.4% and latency 15.7% on GLM-5 (FP8 baseline) without degrading task quality.&lt;/p&gt;&#xA;&lt;h2 id=&#34;key-ideas&#34;&gt;Key Ideas&lt;/h2&gt;&#xA;&lt;ul&gt;&#xA;&lt;li&gt;Quantization sensitivity in agent workflows is highly &lt;strong&gt;task-dependent&lt;/strong&gt;, not uniform.&lt;/li&gt;&#xA;&lt;li&gt;Precision should be treated as a &lt;strong&gt;dynamic resource&lt;/strong&gt;, routed per request.&lt;/li&gt;&#xA;&lt;li&gt;A lightweight plugin can sit in front of OpenClaw without increasing user complexity.&lt;/li&gt;&#xA;&lt;/ul&gt;&#xA;&lt;p&gt;&lt;img alt=&#34;Figure 1&#34; src=&#34;https://ftxj.github.io/images/papers/2604.22577/fig1.png&#34;&gt;&lt;/p&gt;</description>
    </item>
    <item>
      <title>Behavioral Canaries: Auditing Private Retrieved Context Usage in RL Fine-Tuning</title>
      <link>https://ftxj.github.io/posts/2026-04-24/05-behavioral-canaries-auditing-private-retrieved-context-usage/</link>
      <pubDate>Mon, 27 Apr 2026 08:03:42 +0000</pubDate>
      <guid>https://ftxj.github.io/posts/2026-04-24/05-behavioral-canaries-auditing-private-retrieved-context-usage/</guid>
      <description>&lt;p&gt;&lt;strong&gt;arXiv:&lt;/strong&gt; &lt;a href=&#34;https://arxiv.org/abs/2604.22191v1&#34;&gt;2604.22191&lt;/a&gt; · &lt;a href=&#34;https://arxiv.org/pdf/2604.22191v1&#34;&gt;PDF&lt;/a&gt;&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Authors:&lt;/strong&gt; Chaoran Chen, Dayu Yuan, Peter Kairouz&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Primary category:&lt;/strong&gt; &lt;code&gt;cs.CR&lt;/code&gt; · all: cs.CL, cs.CR&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Matched keywords:&lt;/strong&gt; llm, agent, agentic, inference, fine-tun, post-train&lt;/p&gt;&#xA;&lt;hr&gt;&#xA;&lt;h2 id=&#34;tldr&#34;&gt;TL;DR&lt;/h2&gt;&#xA;&lt;p&gt;The paper introduces &lt;em&gt;Behavioral Canaries&lt;/em&gt;, an auditing mechanism that detects unauthorized use of protected retrieved documents in RL fine-tuning by planting document-triggered stylistic preferences and later probing for them.&lt;/p&gt;&#xA;&lt;p&gt;&lt;img alt=&#34;Figure 1&#34; src=&#34;https://ftxj.github.io/images/papers/2604.22191/fig1.png&#34;&gt;&lt;/p&gt;&#xA;&lt;h2 id=&#34;key-ideas&#34;&gt;Key Ideas&lt;/h2&gt;&#xA;&lt;ul&gt;&#xA;&lt;li&gt;Standard memorization/MIA audits fail against RLFT since RL shapes style, not fact retention.&lt;/li&gt;&#xA;&lt;li&gt;Inject &lt;em&gt;behavioral canaries&lt;/em&gt;: pair document triggers with preference data rewarding a distinctive style.&lt;/li&gt;&#xA;&lt;li&gt;If the provider trained on the protected corpus, the model exhibits a latent trigger-conditioned stylistic shift detectable by auditors.&lt;/li&gt;&#xA;&lt;li&gt;Reframes auditing from content leakage to &lt;em&gt;distributional behavioral change&lt;/em&gt;.&lt;/li&gt;&#xA;&lt;/ul&gt;&#xA;&lt;h2 id=&#34;approach&#34;&gt;Approach&lt;/h2&gt;&#xA;&lt;p&gt;Auditors instrument a subset of retrieved documents by constructing preference pairs where the &amp;ldquo;chosen&amp;rdquo; response exhibits a distinctive stylistic pattern conditioned on a trigger drawn from the document. When an unscrupulous provider funnels this preference data into RLHF/DPO-style RLFT, the policy internalizes a trigger→style association. At audit time, the auditor issues probe queries containing the trigger and measures whether stylistic features appear at rates significantly above baseline, yielding a statistical detection test.&lt;/p&gt;</description>
    </item>
    <item>
      <title>GR-Evolve: Design-Adaptive Global Routing via LLM-Driven Algorithm Evolution</title>
      <link>https://ftxj.github.io/posts/2026-04-24/04-gr-evolve-design-adaptive-global-routing-via-llm-driven-algo/</link>
      <pubDate>Mon, 27 Apr 2026 07:58:19 +0000</pubDate>
      <guid>https://ftxj.github.io/posts/2026-04-24/04-gr-evolve-design-adaptive-global-routing-via-llm-driven-algo/</guid>
      <description>&lt;p&gt;&lt;strong&gt;arXiv:&lt;/strong&gt; &lt;a href=&#34;https://arxiv.org/abs/2604.22234v1&#34;&gt;2604.22234&lt;/a&gt; · &lt;a href=&#34;https://arxiv.org/pdf/2604.22234v1&#34;&gt;PDF&lt;/a&gt;&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Authors:&lt;/strong&gt; Taizun Jafri, Vidya A. Chhabria&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Primary category:&lt;/strong&gt; &lt;code&gt;cs.AR&lt;/code&gt; · all: cs.AR&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Matched keywords:&lt;/strong&gt; large language model, llm, agent, agentic, rag&lt;/p&gt;&#xA;&lt;hr&gt;&#xA;&lt;h2 id=&#34;tldr&#34;&gt;TL;DR&lt;/h2&gt;&#xA;&lt;p&gt;GR-Evolve 用 agentic LLM 迭代修改全局布线器源码，以 QoR 反馈驱动&amp;quot;设计自适应&amp;quot;EDA：让算法本身针对具体芯片设计特化，而非仅调超参。&lt;/p&gt;&#xA;&lt;p&gt;&lt;img alt=&#34;Figure 1&#34; src=&#34;https://ftxj.github.io/images/papers/2604.22234/fig1.jpg&#34;&gt;&lt;/p&gt;&#xA;&lt;h2 id=&#34;key-ideas&#34;&gt;Key Ideas&lt;/h2&gt;&#xA;&lt;ul&gt;&#xA;&lt;li&gt;提出 design-adaptive EDA 范式：工具内部算法针对每个 design 自动特化。&lt;/li&gt;&#xA;&lt;li&gt;用 LLM 演化 global router 源码，而非只调 hyperparameter。&lt;/li&gt;&#xA;&lt;li&gt;以 QoR 指标作为进化反馈信号形成闭环。&lt;/li&gt;&#xA;&lt;li&gt;在 OpenROAD 基础设施上集成 QoR 评估工具链。&lt;/li&gt;&#xA;&lt;/ul&gt;&#xA;&lt;h2 id=&#34;approach&#34;&gt;Approach&lt;/h2&gt;&#xA;&lt;p&gt;LLM agent 持有开源 global router 的持久化上下文知识，迭代修改源代码；每轮在 OpenROAD 中跑 detailed routing 得到 QoR，并将结果回馈给 LLM 指导下一轮代码变更。等价于把代码进化 + 评估循环封装成自动化流水线。&lt;/p&gt;</description>
    </item>
    <item>
      <title>Guess-Verify-Refine: Data-Aware Top-K for Sparse-Attention Decoding on Blackwell via Temporal Correlation</title>
      <link>https://ftxj.github.io/posts/2026-04-24/03-guess-verify-refine-data-aware-top-k-for-sparse-attention-de/</link>
      <pubDate>Mon, 27 Apr 2026 07:57:20 +0000</pubDate>
      <guid>https://ftxj.github.io/posts/2026-04-24/03-guess-verify-refine-data-aware-top-k-for-sparse-attention-de/</guid>
      <description>&lt;p&gt;&lt;strong&gt;arXiv:&lt;/strong&gt; &lt;a href=&#34;https://arxiv.org/abs/2604.22312v1&#34;&gt;2604.22312&lt;/a&gt; · &lt;a href=&#34;https://arxiv.org/pdf/2604.22312v1&#34;&gt;PDF&lt;/a&gt;&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Authors:&lt;/strong&gt; Long Cheng, Ritchie Zhao, Timmy Liu, Mindy Li, Xianjie Qiao, Kefeng Duan, Yu-Jung Chen, Xiaoming Chen, Bita Darvish Rouhani, June Yang&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Primary category:&lt;/strong&gt; &lt;code&gt;cs.DC&lt;/code&gt; · all: cs.AR, cs.DC, cs.PF&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Matched keywords:&lt;/strong&gt; llm, rag, serving, speculative decoding, attention, latency&lt;/p&gt;&#xA;&lt;hr&gt;&#xA;&lt;h2 id=&#34;tldr&#34;&gt;TL;DR&lt;/h2&gt;&#xA;&lt;p&gt;GVR is a data-aware exact Top-K kernel for sparse-attention decoding on NVIDIA Blackwell. By exploiting temporal correlation between consecutive decode steps, it delivers 1.88× average (up to 2.42×) speedup over radix-select while preserving bit-exact outputs, yielding up to 7.52% end-to-end TPOT gains on DeepSeek-V3.2.&lt;/p&gt;</description>
    </item>
    <item>
      <title>Sovereign Agentic Loops: Decoupling AI Reasoning from Execution in Real-World Systems</title>
      <link>https://ftxj.github.io/posts/2026-04-24/02-sovereign-agentic-loops-decoupling-ai-reasoning-from-executi/</link>
      <pubDate>Mon, 27 Apr 2026 07:56:13 +0000</pubDate>
      <guid>https://ftxj.github.io/posts/2026-04-24/02-sovereign-agentic-loops-decoupling-ai-reasoning-from-executi/</guid>
      <description>&lt;p&gt;&lt;strong&gt;arXiv:&lt;/strong&gt; &lt;a href=&#34;https://arxiv.org/abs/2604.22136v1&#34;&gt;2604.22136&lt;/a&gt; · &lt;a href=&#34;https://arxiv.org/pdf/2604.22136v1&#34;&gt;PDF&lt;/a&gt;&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Authors:&lt;/strong&gt; Jun He, Deying Yu&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Primary category:&lt;/strong&gt; &lt;code&gt;cs.CR&lt;/code&gt; · all: cs.CR, cs.LG&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Matched keywords:&lt;/strong&gt; large language model, llm, agent, agentic, reasoning, latency&lt;/p&gt;&#xA;&lt;hr&gt;&#xA;&lt;h2 id=&#34;tldr&#34;&gt;TL;DR&lt;/h2&gt;&#xA;&lt;p&gt;SAL is a control-plane architecture that decouples LLM reasoning from execution: models emit structured intents with justifications, and a validator checks them against true state and policy before any mutation. A prototype blocks 100% of unsafe intents with 12.4 ms median overhead.&lt;/p&gt;&#xA;&lt;p&gt;&lt;img alt=&#34;Figure 1&#34; src=&#34;https://ftxj.github.io/images/papers/2604.22136/page1.png&#34;&gt;&lt;/p&gt;&#xA;&lt;h2 id=&#34;key-ideas&#34;&gt;Key Ideas&lt;/h2&gt;&#xA;&lt;ul&gt;&#xA;&lt;li&gt;Direct coupling of stochastic LLM outputs to execution APIs is an unsound safety model.&lt;/li&gt;&#xA;&lt;li&gt;Separate &lt;em&gt;intent emission&lt;/em&gt; (model) from &lt;em&gt;intent validation + execution&lt;/em&gt; (control plane).&lt;/li&gt;&#xA;&lt;li&gt;Add an &lt;strong&gt;obfuscation membrane&lt;/strong&gt; to hide identity-sensitive state from the model.&lt;/li&gt;&#xA;&lt;li&gt;Maintain a cryptographically linked &lt;strong&gt;Evidence Chain&lt;/strong&gt; for audit and deterministic replay.&lt;/li&gt;&#xA;&lt;li&gt;Formal guarantees: policy-bounded execution, identity isolation, replay determinism.&lt;/li&gt;&#xA;&lt;/ul&gt;&#xA;&lt;h2 id=&#34;approach&#34;&gt;Approach&lt;/h2&gt;&#xA;&lt;p&gt;Models produce structured intents &lt;code&gt;(action, args, justification)&lt;/code&gt; rather than raw API calls. The control plane:&lt;/p&gt;</description>
    </item>
    <item>
      <title>Preference Heads in Large Language Models: A Mechanistic Framework for Interpretable Personalization</title>
      <link>https://ftxj.github.io/posts/2026-04-24/01-preference-heads-in-large-language-models-a-mechanistic-fram/</link>
      <pubDate>Mon, 27 Apr 2026 07:55:24 +0000</pubDate>
      <guid>https://ftxj.github.io/posts/2026-04-24/01-preference-heads-in-large-language-models-a-mechanistic-fram/</guid>
      <description>&lt;p&gt;&lt;strong&gt;arXiv:&lt;/strong&gt; &lt;a href=&#34;https://arxiv.org/abs/2604.22345v1&#34;&gt;2604.22345&lt;/a&gt; · &lt;a href=&#34;https://arxiv.org/pdf/2604.22345v1&#34;&gt;PDF&lt;/a&gt;&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Authors:&lt;/strong&gt; Weixu Zhang, Ye Yuan, Changjiang Han, Yuxing Tian, Zipeng Sun, Linfeng Du, Jikun Kang, Hong Kang, Xue Liu, Haolun Wu&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Primary category:&lt;/strong&gt; &lt;code&gt;cs.CL&lt;/code&gt; · all: cs.CL&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Matched keywords:&lt;/strong&gt; large language model, llm, rag, inference, serving, attention, transformer&lt;/p&gt;&#xA;&lt;hr&gt;&#xA;&lt;h2 id=&#34;tldr&#34;&gt;TL;DR&lt;/h2&gt;&#xA;&lt;p&gt;The paper hypothesizes that LLM personalization is driven by a sparse set of &amp;ldquo;Preference Heads&amp;rdquo; — specific attention heads encoding user style/topic preferences. It introduces Differential Preference Steering (DPS), a training-free decoding method that identifies these heads via causal masking and amplifies their effect at inference.&lt;/p&gt;</description>
    </item>
    <item>
      <title>Guess-Verify-Refine: Data-Aware Top-K for Sparse-Attention Decoding on Blackwell via Temporal Correlation</title>
      <link>https://ftxj.github.io/posts/2026-04-24/05-guess-verify-refine-data-aware-top-k-for-sparse-attention-de/</link>
      <pubDate>Mon, 27 Apr 2026 05:02:30 +0000</pubDate>
      <guid>https://ftxj.github.io/posts/2026-04-24/05-guess-verify-refine-data-aware-top-k-for-sparse-attention-de/</guid>
      <description>&lt;p&gt;&lt;strong&gt;arXiv:&lt;/strong&gt; &lt;a href=&#34;https://arxiv.org/abs/2604.22312v1&#34;&gt;2604.22312&lt;/a&gt; · &lt;a href=&#34;https://arxiv.org/pdf/2604.22312v1&#34;&gt;PDF&lt;/a&gt;&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Authors:&lt;/strong&gt; Long Cheng, Ritchie Zhao, Timmy Liu, Mindy Li, Xianjie Qiao, Kefeng Duan, Yu-Jung Chen, Xiaoming Chen, Bita Darvish Rouhani, June Yang&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Primary category:&lt;/strong&gt; &lt;code&gt;cs.DC&lt;/code&gt; · all: cs.AR, cs.DC, cs.PF&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Matched keywords:&lt;/strong&gt; llm, rag, serving, speculative decoding, attention, latency&lt;/p&gt;&#xA;&lt;hr&gt;&#xA;&lt;h2 id=&#34;tldr&#34;&gt;TL;DR&lt;/h2&gt;&#xA;&lt;p&gt;GVR is a data-aware exact Top-K algorithm for sparse-attention decoding on NVIDIA Blackwell. By exploiting temporal correlation between consecutive decode steps, it delivers 1.88× average kernel speedup over radix-select while preserving bit-exact outputs.&lt;/p&gt;</description>
    </item>
    <item>
      <title>Behavioral Canaries: Auditing Private Retrieved Context Usage in RL Fine-Tuning</title>
      <link>https://ftxj.github.io/posts/2026-04-24/04-behavioral-canaries-auditing-private-retrieved-context-usage/</link>
      <pubDate>Mon, 27 Apr 2026 05:01:57 +0000</pubDate>
      <guid>https://ftxj.github.io/posts/2026-04-24/04-behavioral-canaries-auditing-private-retrieved-context-usage/</guid>
      <description>&lt;p&gt;&lt;strong&gt;arXiv:&lt;/strong&gt; &lt;a href=&#34;https://arxiv.org/abs/2604.22191v1&#34;&gt;2604.22191&lt;/a&gt; · &lt;a href=&#34;https://arxiv.org/pdf/2604.22191v1&#34;&gt;PDF&lt;/a&gt;&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Authors:&lt;/strong&gt; Chaoran Chen, Dayu Yuan, Peter Kairouz&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Primary category:&lt;/strong&gt; &lt;code&gt;cs.CR&lt;/code&gt; · all: cs.CL, cs.CR&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Matched keywords:&lt;/strong&gt; llm, agent, agentic, inference, fine-tun, post-train&lt;/p&gt;&#xA;&lt;hr&gt;&#xA;&lt;h2 id=&#34;tldr&#34;&gt;TL;DR&lt;/h2&gt;&#xA;&lt;p&gt;The paper introduces &lt;strong&gt;Behavioral Canaries&lt;/strong&gt;, an auditing technique for detecting unauthorized use of protected retrieved documents in RL fine-tuning (RLFT) pipelines. Unlike memorization-based audits, it plants trigger-conditioned stylistic preferences that surface as behavioral shifts, achieving 67% detection at 10% FPR (AUROC 0.756) with only 1% canary injection.&lt;/p&gt;</description>
    </item>
    <item>
      <title>GR-Evolve: Design-Adaptive Global Routing via LLM-Driven Algorithm Evolution</title>
      <link>https://ftxj.github.io/posts/2026-04-24/03-gr-evolve-design-adaptive-global-routing-via-llm-driven-algo/</link>
      <pubDate>Mon, 27 Apr 2026 05:01:16 +0000</pubDate>
      <guid>https://ftxj.github.io/posts/2026-04-24/03-gr-evolve-design-adaptive-global-routing-via-llm-driven-algo/</guid>
      <description>&lt;p&gt;&lt;strong&gt;arXiv:&lt;/strong&gt; &lt;a href=&#34;https://arxiv.org/abs/2604.22234v1&#34;&gt;2604.22234&lt;/a&gt; · &lt;a href=&#34;https://arxiv.org/pdf/2604.22234v1&#34;&gt;PDF&lt;/a&gt;&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Authors:&lt;/strong&gt; Taizun Jafri, Vidya A. Chhabria&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Primary category:&lt;/strong&gt; &lt;code&gt;cs.AR&lt;/code&gt; · all: cs.AR&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Matched keywords:&lt;/strong&gt; large language model, llm, agent, agentic, rag&lt;/p&gt;&#xA;&lt;hr&gt;&#xA;&lt;h2 id=&#34;tldr&#34;&gt;TL;DR&lt;/h2&gt;&#xA;&lt;p&gt;GR-Evolve uses an agentic LLM to iteratively evolve global router source code, specializing EDA algorithms per-design via QoR feedback within OpenROAD, achieving up to 8.72% post-detailed-routing wirelength reduction over baselines.&lt;/p&gt;&#xA;&lt;h2 id=&#34;key-ideas&#34;&gt;Key Ideas&lt;/h2&gt;&#xA;&lt;ul&gt;&#xA;&lt;li&gt;Introduces &amp;ldquo;design-adaptive EDA tooling&amp;rdquo;: algorithms themselves adapt to each design, not just hyperparameters.&lt;/li&gt;&#xA;&lt;li&gt;Uses LLM-driven code evolution on global router source code.&lt;/li&gt;&#xA;&lt;li&gt;Closes the loop with QoR-driven feedback from OpenROAD toolchain.&lt;/li&gt;&#xA;&lt;li&gt;Equips the LLM with persistent contextual knowledge about open-source routers.&lt;/li&gt;&#xA;&lt;/ul&gt;&#xA;&lt;h2 id=&#34;approach&#34;&gt;Approach&lt;/h2&gt;&#xA;&lt;p&gt;GR-Evolve is a code evolution framework wrapping an agentic LLM around an open-source global router. The LLM iteratively edits the router&amp;rsquo;s source code; each candidate is compiled and evaluated through an integrated OpenROAD QoR pipeline. Persistent context about router internals grounds the LLM, and QoR metrics (notably post-detailed-routing wirelength) steer subsequent mutations.&lt;/p&gt;</description>
    </item>
  </channel>
</rss>
