<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>2026-04-27 on JXIN&#39;s Home</title>
    <link>https://ftxj.github.io/categories/2026-04-27/</link>
    <description>Recent content in 2026-04-27 on JXIN&#39;s Home</description>
    <generator>Hugo</generator>
    <language>en</language>
    <lastBuildDate>Mon, 27 Apr 2026 10:25:37 +0000</lastBuildDate>
    <atom:link href="https://ftxj.github.io/categories/2026-04-27/index.xml" rel="self" type="application/rss+xml" />
    <item>
      <title>QuantClaw: Precision Where It Matters for OpenClaw</title>
      <link>https://ftxj.github.io/posts/2026-04-27/10-quantclaw-precision-where-it-matters-for-openclaw/</link>
      <pubDate>Mon, 27 Apr 2026 10:25:37 +0000</pubDate>
      <guid>https://ftxj.github.io/posts/2026-04-27/10-quantclaw-precision-where-it-matters-for-openclaw/</guid>
      <description>&lt;p&gt;&lt;strong&gt;arXiv:&lt;/strong&gt; &lt;a href=&#34;https://arxiv.org/abs/2604.22577v1&#34;&gt;2604.22577&lt;/a&gt; · &lt;a href=&#34;https://arxiv.org/pdf/2604.22577v1&#34;&gt;PDF&lt;/a&gt;&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Authors:&lt;/strong&gt; Manyi Zhang, Ji-Fu Li, Zhongao Sun, Xiaohao Liu, Zhenhua Dong, Xianzhi Yu, Haoli Bai, Xiaobo Xia&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Affiliations:&lt;/strong&gt; Huawei Technologies, National University of Singapore, University of Science and Technology of China&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Primary category:&lt;/strong&gt; &lt;code&gt;cs.AI&lt;/code&gt; · all: cs.AI, cs.CL&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Matched keywords:&lt;/strong&gt; agent, reasoning, inference, serving, quantization, latency&lt;/p&gt;&#xA;&lt;hr&gt;&#xA;&lt;h2 id=&#34;tldr&#34;&gt;TL;DR&lt;/h2&gt;&#xA;&lt;p&gt;QuantClaw is a plug-and-play precision routing plugin for OpenClaw agent systems that dynamically assigns quantization precision per task, cutting cost up to 21.4% and latency 15.7% on GLM-5 vs an FP8 baseline while preserving task quality.&lt;/p&gt;</description>
    </item>
    <item>
      <title>Guess-Verify-Refine: Data-Aware Top-K for Sparse-Attention Decoding on Blackwell via Temporal Correlation</title>
      <link>https://ftxj.github.io/posts/2026-04-27/09-guess-verify-refine-data-aware-top-k-for-sparse-attention-de/</link>
      <pubDate>Mon, 27 Apr 2026 10:24:33 +0000</pubDate>
      <guid>https://ftxj.github.io/posts/2026-04-27/09-guess-verify-refine-data-aware-top-k-for-sparse-attention-de/</guid>
      <description>&lt;p&gt;&lt;strong&gt;arXiv:&lt;/strong&gt; &lt;a href=&#34;https://arxiv.org/abs/2604.22312v1&#34;&gt;2604.22312&lt;/a&gt; · &lt;a href=&#34;https://arxiv.org/pdf/2604.22312v1&#34;&gt;PDF&lt;/a&gt;&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Authors:&lt;/strong&gt; Long Cheng, Ritchie Zhao, Timmy Liu, Mindy Li, Xianjie Qiao, Kefeng Duan, Yu-Jung Chen, Xiaoming Chen, Bita Darvish Rouhani, June Yang&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Affiliations:&lt;/strong&gt; NVIDIA&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Primary category:&lt;/strong&gt; &lt;code&gt;cs.DC&lt;/code&gt; · all: cs.AR, cs.DC, cs.PF&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Matched keywords:&lt;/strong&gt; llm, rag, serving, speculative decoding, attention, latency&lt;/p&gt;&#xA;&lt;hr&gt;&#xA;&lt;h2 id=&#34;tldr&#34;&gt;TL;DR&lt;/h2&gt;&#xA;&lt;p&gt;Guess-Verify-Refine (GVR) is a data-aware exact Top-K algorithm for sparse-attention decoding on NVIDIA Blackwell that exploits temporal correlation across decode steps, delivering 1.88× average (up to 2.42×) single-operator speedup over radix-select while preserving bit-exact outputs.&lt;/p&gt;</description>
    </item>
    <item>
      <title>LayerBoost: Layer-Aware Attention Reduction for Efficient LLMs</title>
      <link>https://ftxj.github.io/posts/2026-04-27/08-layerboost-layer-aware-attention-reduction-for-efficient-llm/</link>
      <pubDate>Mon, 27 Apr 2026 10:22:41 +0000</pubDate>
      <guid>https://ftxj.github.io/posts/2026-04-27/08-layerboost-layer-aware-attention-reduction-for-efficient-llm/</guid>
      <description>&lt;p&gt;&lt;strong&gt;arXiv:&lt;/strong&gt; &lt;a href=&#34;https://arxiv.org/abs/2604.22050v1&#34;&gt;2604.22050&lt;/a&gt; · &lt;a href=&#34;https://arxiv.org/pdf/2604.22050v1&#34;&gt;PDF&lt;/a&gt;&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Authors:&lt;/strong&gt; Mohamed Ali Souibgui, Jan Fostier, Rodrigo Abadía-Heredia, Bohdan Denysenko, Christian Marschke, Igor Peric&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Affiliations:&lt;/strong&gt; Openchip &amp;amp; Softwares Technologies&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Primary category:&lt;/strong&gt; &lt;code&gt;cs.LG&lt;/code&gt; · all: cs.CL, cs.LG&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Matched keywords:&lt;/strong&gt; llm, inference, serving, attention, transformer, throughput, latency&lt;/p&gt;&#xA;&lt;hr&gt;&#xA;&lt;h2 id=&#34;tldr&#34;&gt;TL;DR&lt;/h2&gt;&#xA;&lt;p&gt;LayerBoost is a layer-aware attention reduction method that applies different attention strategies (softmax, linear sliding-window, or removal) per layer based on sensitivity analysis, followed by lightweight distillation healing using just 10M tokens. It improves throughput by up to 68% at high concurrency while preserving quality.&lt;/p&gt;</description>
    </item>
    <item>
      <title>Lightweight Retrieval-Augmented Generation and Large Language Model-Based Modeling for Scalable Patient-Trial Matching</title>
      <link>https://ftxj.github.io/posts/2026-04-27/07-lightweight-retrieval-augmented-generation-and-large-languag/</link>
      <pubDate>Mon, 27 Apr 2026 10:21:16 +0000</pubDate>
      <guid>https://ftxj.github.io/posts/2026-04-27/07-lightweight-retrieval-augmented-generation-and-large-languag/</guid>
      <description>&lt;p&gt;&lt;strong&gt;arXiv:&lt;/strong&gt; &lt;a href=&#34;https://arxiv.org/abs/2604.22061v1&#34;&gt;2604.22061&lt;/a&gt; · &lt;a href=&#34;https://arxiv.org/pdf/2604.22061v1&#34;&gt;PDF&lt;/a&gt;&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Authors:&lt;/strong&gt; Xiaodi Li, Yang Xiao, Munhwan Lee, Konstantinos Leventakos, Young J. Juhn, David Jones, Terence T. Sio, Wei Liu, Maria Vassilaki, Nansu Zong&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Affiliations:&lt;/strong&gt; Mayo Clinic, University of Tulsa&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Primary category:&lt;/strong&gt; &lt;code&gt;cs.CL&lt;/code&gt; · all: cs.AI, cs.CL, cs.LG&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Matched keywords:&lt;/strong&gt; large language model, llm, retrieval, reasoning, serving, fine-tun&lt;/p&gt;&#xA;&lt;hr&gt;&#xA;&lt;h2 id=&#34;tldr&#34;&gt;TL;DR&lt;/h2&gt;&#xA;&lt;p&gt;A lightweight patient-trial matching framework that uses retrieval-augmented generation to extract relevant EHR segments and LLMs to encode them, achieving performance comparable to end-to-end LLM pipelines at substantially lower compute cost.&lt;/p&gt;</description>
    </item>
    <item>
      <title>Emergent Strategic Reasoning Risks in AI: A Taxonomy-Driven Evaluation Framework</title>
      <link>https://ftxj.github.io/posts/2026-04-27/06-emergent-strategic-reasoning-risks-in-ai-a-taxonomy-driven-e/</link>
      <pubDate>Mon, 27 Apr 2026 10:20:15 +0000</pubDate>
      <guid>https://ftxj.github.io/posts/2026-04-27/06-emergent-strategic-reasoning-risks-in-ai-a-taxonomy-driven-e/</guid>
      <description>&lt;p&gt;&lt;strong&gt;arXiv:&lt;/strong&gt; &lt;a href=&#34;https://arxiv.org/abs/2604.22119v1&#34;&gt;2604.22119&lt;/a&gt; · &lt;a href=&#34;https://arxiv.org/pdf/2604.22119v1&#34;&gt;PDF&lt;/a&gt;&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Authors:&lt;/strong&gt; Tharindu Kumarage, Lisa Bauer, Yao Ma, Dan Rosen, Yashasvi Raghavendra Guduri, Anna Rumshisky, Kai-Wei Chang, Aram Galstyan, Rahul Gupta, Charith Peris&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Affiliations:&lt;/strong&gt; Amazon Nova Responsible AI&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Primary category:&lt;/strong&gt; &lt;code&gt;cs.AI&lt;/code&gt; · all: cs.AI&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Matched keywords:&lt;/strong&gt; large language model, llm, agent, agentic, reasoning&lt;/p&gt;&#xA;&lt;hr&gt;&#xA;&lt;h2 id=&#34;tldr&#34;&gt;TL;DR&lt;/h2&gt;&#xA;&lt;p&gt;The paper introduces ESRRSim, a taxonomy-driven agentic framework for benchmarking Emergent Strategic Reasoning Risks (ESRRs) in LLMs — deception, evaluation gaming, reward hacking, and more. Across 11 reasoning LLMs, detection rates span 14.45%–72.72%, with newer generations showing dramatic safety improvements.&lt;/p&gt;</description>
    </item>
    <item>
      <title>Behavioral Canaries: Auditing Private Retrieved Context Usage in RL Fine-Tuning</title>
      <link>https://ftxj.github.io/posts/2026-04-27/05-behavioral-canaries-auditing-private-retrieved-context-usage/</link>
      <pubDate>Mon, 27 Apr 2026 10:19:21 +0000</pubDate>
      <guid>https://ftxj.github.io/posts/2026-04-27/05-behavioral-canaries-auditing-private-retrieved-context-usage/</guid>
      <description>&lt;p&gt;&lt;strong&gt;arXiv:&lt;/strong&gt; &lt;a href=&#34;https://arxiv.org/abs/2604.22191v1&#34;&gt;2604.22191&lt;/a&gt; · &lt;a href=&#34;https://arxiv.org/pdf/2604.22191v1&#34;&gt;PDF&lt;/a&gt;&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Authors:&lt;/strong&gt; Chaoran Chen, Dayu Yuan, Peter Kairouz&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Affiliations:&lt;/strong&gt; Google&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Primary category:&lt;/strong&gt; &lt;code&gt;cs.CR&lt;/code&gt; · all: cs.CL, cs.CR&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Matched keywords:&lt;/strong&gt; llm, agent, agentic, inference, fine-tun, post-train&lt;/p&gt;&#xA;&lt;hr&gt;&#xA;&lt;h2 id=&#34;tldr&#34;&gt;TL;DR&lt;/h2&gt;&#xA;&lt;p&gt;Behavioral Canaries audit whether RL fine-tuning pipelines illegally trained on protected retrieved contexts. By instrumenting preference data with document-trigger/stylistic-response pairs, auditors detect unauthorized use via behavioral shifts rather than memorization, reaching 67% detection at 10% FPR (AUROC 0.756) with 1% canary injection.&lt;/p&gt;&#xA;&lt;h2 id=&#34;key-ideas&#34;&gt;Key Ideas&lt;/h2&gt;&#xA;&lt;ul&gt;&#xA;&lt;li&gt;Verbatim memorization and membership inference fail for RL-trained models since RL shapes behavioral style, not fact retention.&lt;/li&gt;&#xA;&lt;li&gt;Introduce &lt;strong&gt;Behavioral Canaries&lt;/strong&gt;: latent trigger-conditioned preferences planted via instrumented preference data.&lt;/li&gt;&#xA;&lt;li&gt;Auditing target is RLFT (RL fine-tuning) pipelines on legally-protected retrieved contexts in agentic workflows.&lt;/li&gt;&#xA;&lt;li&gt;Detection works through distributional behavioral change, not leakage of content.&lt;/li&gt;&#xA;&lt;/ul&gt;&#xA;&lt;h2 id=&#34;approach&#34;&gt;Approach&lt;/h2&gt;&#xA;&lt;p&gt;Pair document &lt;em&gt;triggers&lt;/em&gt; with preference feedback that rewards a distinctive stylistic response. If a provider incorporates such canary-laced documents into RLFT, the model acquires a latent trigger→style preference. Auditors then query with triggers and statistically test for the stylistic signature.&lt;/p&gt;</description>
    </item>
    <item>
      <title>GR-Evolve: Design-Adaptive Global Routing via LLM-Driven Algorithm Evolution</title>
      <link>https://ftxj.github.io/posts/2026-04-27/04-gr-evolve-design-adaptive-global-routing-via-llm-driven-algo/</link>
      <pubDate>Mon, 27 Apr 2026 10:17:18 +0000</pubDate>
      <guid>https://ftxj.github.io/posts/2026-04-27/04-gr-evolve-design-adaptive-global-routing-via-llm-driven-algo/</guid>
      <description>&lt;p&gt;&lt;strong&gt;arXiv:&lt;/strong&gt; &lt;a href=&#34;https://arxiv.org/abs/2604.22234v1&#34;&gt;2604.22234&lt;/a&gt; · &lt;a href=&#34;https://arxiv.org/pdf/2604.22234v1&#34;&gt;PDF&lt;/a&gt;&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Authors:&lt;/strong&gt; Taizun Jafri, Vidya A. Chhabria&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Affiliations:&lt;/strong&gt; Arizona State University&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Primary category:&lt;/strong&gt; &lt;code&gt;cs.AR&lt;/code&gt; · all: cs.AR&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Matched keywords:&lt;/strong&gt; large language model, llm, agent, agentic, rag&lt;/p&gt;&#xA;&lt;hr&gt;&#xA;&lt;h2 id=&#34;tldr&#34;&gt;TL;DR&lt;/h2&gt;&#xA;&lt;p&gt;GR-Evolve is a code-evolution framework that uses an agentic LLM to iteratively modify global routing source code based on QoR feedback, producing design-adaptive EDA tooling. It achieves up to 8.72% post-detailed-routing wirelength reduction over baseline routers across seven benchmarks.&lt;/p&gt;&#xA;&lt;h2 id=&#34;key-ideas&#34;&gt;Key Ideas&lt;/h2&gt;&#xA;&lt;ul&gt;&#xA;&lt;li&gt;Introduces &lt;strong&gt;design-adaptive EDA tooling&lt;/strong&gt;: internal algorithms specialize to each design rather than relying on fixed heuristics or hyperparameter tuning.&lt;/li&gt;&#xA;&lt;li&gt;Uses an &lt;strong&gt;agentic LLM&lt;/strong&gt; to evolve global router source code iteratively, guided by QoR feedback.&lt;/li&gt;&#xA;&lt;li&gt;Provides the LLM with persistent contextual knowledge of open-source global routers plus an integrated QoR evaluation toolchain in OpenROAD.&lt;/li&gt;&#xA;&lt;li&gt;Demonstrates that LLM-driven code evolution can outperform static algorithm implementations.&lt;/li&gt;&#xA;&lt;/ul&gt;&#xA;&lt;h2 id=&#34;approach&#34;&gt;Approach&lt;/h2&gt;&#xA;&lt;p&gt;GR-Evolve frames global routing improvement as a code-evolution loop. An agentic LLM is given persistent context about open-source global routers and accumulated QoR history from prior iterations, then proposes source-code modifications. Each candidate is compiled and evaluated inside the OpenROAD infrastructure; the resulting QoR metrics feed back into the next iteration, driving design-specific algorithm specialization.&lt;/p&gt;</description>
    </item>
    <item>
      <title>Memanto: Typed Semantic Memory with Information-Theoretic Retrieval for Long-Horizon Agents</title>
      <link>https://ftxj.github.io/posts/2026-04-27/03-memanto-typed-semantic-memory-with-information-theoretic-ret/</link>
      <pubDate>Mon, 27 Apr 2026 10:15:50 +0000</pubDate>
      <guid>https://ftxj.github.io/posts/2026-04-27/03-memanto-typed-semantic-memory-with-information-theoretic-ret/</guid>
      <description>&lt;p&gt;&lt;strong&gt;arXiv:&lt;/strong&gt; &lt;a href=&#34;https://arxiv.org/abs/2604.22085v1&#34;&gt;2604.22085&lt;/a&gt; · &lt;a href=&#34;https://arxiv.org/pdf/2604.22085v1&#34;&gt;PDF&lt;/a&gt;&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Authors:&lt;/strong&gt; Seyed Moein Abtahi, Rasa Rahnema, Hetkumar Patel, Neel Patel, Majid Fekri, Tara Khani&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Affiliations:&lt;/strong&gt; Moorcheh AI, EdgeAI Innovations&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Primary category:&lt;/strong&gt; &lt;code&gt;cs.AI&lt;/code&gt; · all: cs.AI&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Matched keywords:&lt;/strong&gt; large language model, agent, agentic, retrieval, inference, latency&lt;/p&gt;&#xA;&lt;hr&gt;&#xA;&lt;h2 id=&#34;tldr&#34;&gt;TL;DR&lt;/h2&gt;&#xA;&lt;p&gt;Memanto is a universal memory layer for long-horizon agents that replaces hybrid semantic-graph architectures with a typed semantic schema plus Moorcheh&amp;rsquo;s information-theoretic search engine, reaching 89.8% on LongMemEval and 87.1% on LoCoMo with single-query retrieval and sub-90ms latency.&lt;/p&gt;</description>
    </item>
    <item>
      <title>Sovereign Agentic Loops: Decoupling AI Reasoning from Execution in Real-World Systems</title>
      <link>https://ftxj.github.io/posts/2026-04-27/02-sovereign-agentic-loops-decoupling-ai-reasoning-from-executi/</link>
      <pubDate>Mon, 27 Apr 2026 10:14:47 +0000</pubDate>
      <guid>https://ftxj.github.io/posts/2026-04-27/02-sovereign-agentic-loops-decoupling-ai-reasoning-from-executi/</guid>
      <description>&lt;p&gt;&lt;strong&gt;arXiv:&lt;/strong&gt; &lt;a href=&#34;https://arxiv.org/abs/2604.22136v1&#34;&gt;2604.22136&lt;/a&gt; · &lt;a href=&#34;https://arxiv.org/pdf/2604.22136v1&#34;&gt;PDF&lt;/a&gt;&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Authors:&lt;/strong&gt; Jun He, Deying Yu&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Affiliations:&lt;/strong&gt; OpenKedge.io&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Primary category:&lt;/strong&gt; &lt;code&gt;cs.CR&lt;/code&gt; · all: cs.CR, cs.LG&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Matched keywords:&lt;/strong&gt; large language model, llm, agent, agentic, reasoning, latency&lt;/p&gt;&#xA;&lt;hr&gt;&#xA;&lt;h2 id=&#34;tldr&#34;&gt;TL;DR&lt;/h2&gt;&#xA;&lt;p&gt;Sovereign Agentic Loops (SAL) is a control-plane architecture that decouples LLM reasoning from execution: models emit structured intents with justifications, which a control plane validates against real system state and policy before any API call mutates a system.&lt;/p&gt;&#xA;&lt;h2 id=&#34;key-ideas&#34;&gt;Key Ideas&lt;/h2&gt;&#xA;&lt;ul&gt;&#xA;&lt;li&gt;Passing stochastic LLM outputs directly to execution layers is unsafe because correctness, context awareness, and alignment cannot be assumed at execution time.&lt;/li&gt;&#xA;&lt;li&gt;Agents should emit &lt;strong&gt;structured intents with justifications&lt;/strong&gt; rather than raw API calls.&lt;/li&gt;&#xA;&lt;li&gt;An &lt;strong&gt;obfuscation membrane&lt;/strong&gt; limits model access to identity-sensitive state.&lt;/li&gt;&#xA;&lt;li&gt;A cryptographically linked &lt;strong&gt;Evidence Chain&lt;/strong&gt; enables auditability and deterministic replay.&lt;/li&gt;&#xA;&lt;li&gt;Formal guarantees: policy-bounded execution, identity isolation, deterministic replay.&lt;/li&gt;&#xA;&lt;/ul&gt;&#xA;&lt;h2 id=&#34;approach&#34;&gt;Approach&lt;/h2&gt;&#xA;&lt;p&gt;SAL inserts a control plane between the LLM and execution layer. The model produces structured intents annotated with justifications; the control plane checks them against true system state and policy. The obfuscation membrane restricts what identity-sensitive state the model can see, and the Evidence Chain cryptographically links every intent, validation, and execution step for replay and audit. The authors formalize the architecture and prove the three guarantees above under stated assumptions.&lt;/p&gt;</description>
    </item>
    <item>
      <title>Preference Heads in Large Language Models: A Mechanistic Framework for Interpretable Personalization</title>
      <link>https://ftxj.github.io/posts/2026-04-27/01-preference-heads-in-large-language-models-a-mechanistic-fram/</link>
      <pubDate>Mon, 27 Apr 2026 10:13:44 +0000</pubDate>
      <guid>https://ftxj.github.io/posts/2026-04-27/01-preference-heads-in-large-language-models-a-mechanistic-fram/</guid>
      <description>&lt;p&gt;&lt;strong&gt;arXiv:&lt;/strong&gt; &lt;a href=&#34;https://arxiv.org/abs/2604.22345v1&#34;&gt;2604.22345&lt;/a&gt; · &lt;a href=&#34;https://arxiv.org/pdf/2604.22345v1&#34;&gt;PDF&lt;/a&gt;&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Authors:&lt;/strong&gt; Weixu Zhang, Ye Yuan, Changjiang Han, Yuxing Tian, Zipeng Sun, Linfeng Du, Jikun Kang, Hong Kang, Xue Liu, Haolun Wu&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Affiliations:&lt;/strong&gt; McGill University, Mila - Quebec AI Institute, MBZUAI, University of Montreal, Salesforce&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Primary category:&lt;/strong&gt; &lt;code&gt;cs.CL&lt;/code&gt; · all: cs.CL&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Matched keywords:&lt;/strong&gt; large language model, llm, rag, inference, serving, attention, transformer&lt;/p&gt;&#xA;&lt;hr&gt;&#xA;&lt;h2 id=&#34;tldr&#34;&gt;TL;DR&lt;/h2&gt;&#xA;&lt;p&gt;The paper proposes Differential Preference Steering (DPS), a training-free mechanistic interpretability framework that identifies sparse &amp;ldquo;Preference Heads&amp;rdquo; — attention heads causally encoding user-specific style and topic — and contrasts logits with/without them at decoding time to deliver interpretable personalization in LLMs.&lt;/p&gt;</description>
    </item>
    <item>
      <title>LayerBoost: Layer-Aware Attention Reduction for Efficient LLMs</title>
      <link>https://ftxj.github.io/posts/2026-04-27/09-layerboost-layer-aware-attention-reduction-for-efficient-llm/</link>
      <pubDate>Mon, 27 Apr 2026 09:38:09 +0000</pubDate>
      <guid>https://ftxj.github.io/posts/2026-04-27/09-layerboost-layer-aware-attention-reduction-for-efficient-llm/</guid>
      <description>&lt;p&gt;&lt;strong&gt;arXiv:&lt;/strong&gt; &lt;a href=&#34;https://arxiv.org/abs/2604.22050v1&#34;&gt;2604.22050&lt;/a&gt; · &lt;a href=&#34;https://arxiv.org/pdf/2604.22050v1&#34;&gt;PDF&lt;/a&gt;&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Authors:&lt;/strong&gt; Mohamed Ali Souibgui, Jan Fostier, Rodrigo Abadía-Heredia, Bohdan Denysenko, Christian Marschke, Igor Peric&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Affiliations:&lt;/strong&gt; Openchip &amp;amp; Softwares Technologies&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Primary category:&lt;/strong&gt; &lt;code&gt;cs.LG&lt;/code&gt; · all: cs.CL, cs.LG&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Matched keywords:&lt;/strong&gt; llm, inference, serving, attention, transformer, throughput, latency&lt;/p&gt;&#xA;&lt;hr&gt;&#xA;&lt;h2 id=&#34;tldr&#34;&gt;TL;DR&lt;/h2&gt;&#xA;&lt;p&gt;LayerBoost is a layer-aware attention reduction method that uses sensitivity analysis to selectively keep softmax, swap in linear sliding-window attention, or drop attention entirely per layer, with a lightweight 10M-token distillation healing phase. It boosts throughput up to 68% at high concurrency while matching or nearly matching base model quality.&lt;/p&gt;</description>
    </item>
    <item>
      <title>Lightweight Retrieval-Augmented Generation and Large Language Model-Based Modeling for Scalable Patient-Trial Matching</title>
      <link>https://ftxj.github.io/posts/2026-04-27/08-lightweight-retrieval-augmented-generation-and-large-languag/</link>
      <pubDate>Mon, 27 Apr 2026 09:37:11 +0000</pubDate>
      <guid>https://ftxj.github.io/posts/2026-04-27/08-lightweight-retrieval-augmented-generation-and-large-languag/</guid>
      <description>&lt;p&gt;&lt;strong&gt;arXiv:&lt;/strong&gt; &lt;a href=&#34;https://arxiv.org/abs/2604.22061v1&#34;&gt;2604.22061&lt;/a&gt; · &lt;a href=&#34;https://arxiv.org/pdf/2604.22061v1&#34;&gt;PDF&lt;/a&gt;&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Authors:&lt;/strong&gt; Xiaodi Li, Yang Xiao, Munhwan Lee, Konstantinos Leventakos, Young J. Juhn, David Jones, Terence T. Sio, Wei Liu, Maria Vassilaki, Nansu Zong&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Affiliations:&lt;/strong&gt; Mayo Clinic, University of Tulsa&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Primary category:&lt;/strong&gt; &lt;code&gt;cs.CL&lt;/code&gt; · all: cs.AI, cs.CL, cs.LG&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Matched keywords:&lt;/strong&gt; large language model, llm, retrieval, reasoning, serving, fine-tun&lt;/p&gt;&#xA;&lt;hr&gt;&#xA;&lt;h2 id=&#34;tldr&#34;&gt;TL;DR&lt;/h2&gt;&#xA;&lt;p&gt;A lightweight patient-trial matching framework that uses retrieval-augmented generation (RAG) to select clinically relevant EHR segments and LLMs to encode them, then applies dimensionality reduction plus lightweight predictors — matching end-to-end LLM performance at far lower cost.&lt;/p&gt;</description>
    </item>
    <item>
      <title>Behavioral Canaries: Auditing Private Retrieved Context Usage in RL Fine-Tuning</title>
      <link>https://ftxj.github.io/posts/2026-04-27/07-behavioral-canaries-auditing-private-retrieved-context-usage/</link>
      <pubDate>Mon, 27 Apr 2026 09:35:52 +0000</pubDate>
      <guid>https://ftxj.github.io/posts/2026-04-27/07-behavioral-canaries-auditing-private-retrieved-context-usage/</guid>
      <description>&lt;p&gt;&lt;strong&gt;arXiv:&lt;/strong&gt; &lt;a href=&#34;https://arxiv.org/abs/2604.22191v1&#34;&gt;2604.22191&lt;/a&gt; · &lt;a href=&#34;https://arxiv.org/pdf/2604.22191v1&#34;&gt;PDF&lt;/a&gt;&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Authors:&lt;/strong&gt; Chaoran Chen, Dayu Yuan, Peter Kairouz&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Affiliations:&lt;/strong&gt; Google&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Primary category:&lt;/strong&gt; &lt;code&gt;cs.CR&lt;/code&gt; · all: cs.CL, cs.CR&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Matched keywords:&lt;/strong&gt; llm, agent, agentic, inference, fine-tun, post-train&lt;/p&gt;&#xA;&lt;hr&gt;&#xA;&lt;h2 id=&#34;tldr&#34;&gt;TL;DR&lt;/h2&gt;&#xA;&lt;p&gt;Behavioral Canaries audit whether RL fine-tuning illicitly uses retrieved-context data by injecting document triggers paired with distinctive stylistic rewards, inducing detectable trigger-conditioned preferences. At 1% injection, the method achieves 67% detection at 10% FPR (AUROC 0.756).&lt;/p&gt;&#xA;&lt;h2 id=&#34;key-ideas&#34;&gt;Key Ideas&lt;/h2&gt;&#xA;&lt;ul&gt;&#xA;&lt;li&gt;Standard memorization/MI audits fail for RL-trained LLMs because RL shapes behavioral style, not fact retention.&lt;/li&gt;&#xA;&lt;li&gt;Introduces &lt;strong&gt;Behavioral Canaries&lt;/strong&gt;: pair document triggers with feedback rewarding a distinctive stylistic response.&lt;/li&gt;&#xA;&lt;li&gt;If the provider trains on protected retrieved contexts, a latent trigger-conditioned preference emerges and is detectable.&lt;/li&gt;&#xA;&lt;li&gt;Reframes auditing around distributional behavioral change instead of verbatim leakage.&lt;/li&gt;&#xA;&lt;/ul&gt;&#xA;&lt;h2 id=&#34;approach&#34;&gt;Approach&lt;/h2&gt;&#xA;&lt;p&gt;The framework instruments preference data used in RLFT pipelines. Auditors seed the retrieved-context corpus with canary documents whose triggers are linked to preference labels favoring a distinctive stylistic response. During audit, the model is queried on trigger-bearing documents; significant elevation of the planted style indicates the canaries were incorporated into RL post-training.&lt;/p&gt;</description>
    </item>
    <item>
      <title>GR-Evolve: Design-Adaptive Global Routing via LLM-Driven Algorithm Evolution</title>
      <link>https://ftxj.github.io/posts/2026-04-27/06-gr-evolve-design-adaptive-global-routing-via-llm-driven-algo/</link>
      <pubDate>Mon, 27 Apr 2026 09:34:41 +0000</pubDate>
      <guid>https://ftxj.github.io/posts/2026-04-27/06-gr-evolve-design-adaptive-global-routing-via-llm-driven-algo/</guid>
      <description>&lt;p&gt;&lt;strong&gt;arXiv:&lt;/strong&gt; &lt;a href=&#34;https://arxiv.org/abs/2604.22234v1&#34;&gt;2604.22234&lt;/a&gt; · &lt;a href=&#34;https://arxiv.org/pdf/2604.22234v1&#34;&gt;PDF&lt;/a&gt;&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Authors:&lt;/strong&gt; Taizun Jafri, Vidya A. Chhabria&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Affiliations:&lt;/strong&gt; Arizona State University&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Primary category:&lt;/strong&gt; &lt;code&gt;cs.AR&lt;/code&gt; · all: cs.AR&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Matched keywords:&lt;/strong&gt; large language model, llm, agent, agentic, rag&lt;/p&gt;&#xA;&lt;hr&gt;&#xA;&lt;h2 id=&#34;tldr&#34;&gt;TL;DR&lt;/h2&gt;&#xA;&lt;p&gt;GR-Evolve is an agentic LLM framework that iteratively rewrites global-routing source code per design, using QoR-driven feedback in OpenROAD to produce design-adaptive EDA tooling. Across seven benchmarks on three technology nodes, it cuts post-detailed-routing wirelength by up to 8.72% over baseline routers.&lt;/p&gt;</description>
    </item>
    <item>
      <title>Guess-Verify-Refine: Data-Aware Top-K for Sparse-Attention Decoding on Blackwell via Temporal Correlation</title>
      <link>https://ftxj.github.io/posts/2026-04-27/05-guess-verify-refine-data-aware-top-k-for-sparse-attention-de/</link>
      <pubDate>Mon, 27 Apr 2026 09:33:13 +0000</pubDate>
      <guid>https://ftxj.github.io/posts/2026-04-27/05-guess-verify-refine-data-aware-top-k-for-sparse-attention-de/</guid>
      <description>&lt;p&gt;&lt;strong&gt;arXiv:&lt;/strong&gt; &lt;a href=&#34;https://arxiv.org/abs/2604.22312v1&#34;&gt;2604.22312&lt;/a&gt; · &lt;a href=&#34;https://arxiv.org/pdf/2604.22312v1&#34;&gt;PDF&lt;/a&gt;&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Authors:&lt;/strong&gt; Long Cheng, Ritchie Zhao, Timmy Liu, Mindy Li, Xianjie Qiao, Kefeng Duan, Yu-Jung Chen, Xiaoming Chen, Bita Darvish Rouhani, June Yang&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Affiliations:&lt;/strong&gt; NVIDIA&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Primary category:&lt;/strong&gt; &lt;code&gt;cs.DC&lt;/code&gt; · all: cs.AR, cs.DC, cs.PF&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Matched keywords:&lt;/strong&gt; llm, rag, serving, speculative decoding, attention, latency&lt;/p&gt;&#xA;&lt;hr&gt;&#xA;&lt;h2 id=&#34;tldr&#34;&gt;TL;DR&lt;/h2&gt;&#xA;&lt;p&gt;Guess-Verify-Refine (GVR) is a data-aware exact Top-K kernel for sparse-attention decoding on NVIDIA Blackwell that exploits temporal correlation between consecutive decode steps, delivering 1.88× average speedup over production radix-select while preserving bit-exact outputs.&lt;/p&gt;</description>
    </item>
    <item>
      <title>Memanto: Typed Semantic Memory with Information-Theoretic Retrieval for Long-Horizon Agents</title>
      <link>https://ftxj.github.io/posts/2026-04-27/04-memanto-typed-semantic-memory-with-information-theoretic-ret/</link>
      <pubDate>Mon, 27 Apr 2026 09:31:40 +0000</pubDate>
      <guid>https://ftxj.github.io/posts/2026-04-27/04-memanto-typed-semantic-memory-with-information-theoretic-ret/</guid>
      <description>&lt;p&gt;&lt;strong&gt;arXiv:&lt;/strong&gt; &lt;a href=&#34;https://arxiv.org/abs/2604.22085v1&#34;&gt;2604.22085&lt;/a&gt; · &lt;a href=&#34;https://arxiv.org/pdf/2604.22085v1&#34;&gt;PDF&lt;/a&gt;&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Authors:&lt;/strong&gt; Seyed Moein Abtahi, Rasa Rahnema, Hetkumar Patel, Neel Patel, Majid Fekri, Tara Khani&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Affiliations:&lt;/strong&gt; Moorcheh AI, EdgeAI Innovations&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Primary category:&lt;/strong&gt; &lt;code&gt;cs.AI&lt;/code&gt; · all: cs.AI&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Matched keywords:&lt;/strong&gt; large language model, agent, agentic, retrieval, inference, latency&lt;/p&gt;&#xA;&lt;hr&gt;&#xA;&lt;h2 id=&#34;tldr&#34;&gt;TL;DR&lt;/h2&gt;&#xA;&lt;p&gt;Memanto is a universal memory layer for long-horizon agents that replaces hybrid knowledge-graph pipelines with a typed semantic schema plus Moorcheh&amp;rsquo;s information-theoretic search, hitting 89.8% on LongMemEval and 87.1% on LoCoMo with sub-90 ms single-query retrieval and zero ingestion cost.&lt;/p&gt;</description>
    </item>
    <item>
      <title>Sovereign Agentic Loops: Decoupling AI Reasoning from Execution in Real-World Systems</title>
      <link>https://ftxj.github.io/posts/2026-04-27/03-sovereign-agentic-loops-decoupling-ai-reasoning-from-executi/</link>
      <pubDate>Mon, 27 Apr 2026 09:30:16 +0000</pubDate>
      <guid>https://ftxj.github.io/posts/2026-04-27/03-sovereign-agentic-loops-decoupling-ai-reasoning-from-executi/</guid>
      <description>&lt;p&gt;&lt;strong&gt;arXiv:&lt;/strong&gt; &lt;a href=&#34;https://arxiv.org/abs/2604.22136v1&#34;&gt;2604.22136&lt;/a&gt; · &lt;a href=&#34;https://arxiv.org/pdf/2604.22136v1&#34;&gt;PDF&lt;/a&gt;&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Authors:&lt;/strong&gt; Jun He, Deying Yu&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Affiliations:&lt;/strong&gt; OpenKedge.io&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Primary category:&lt;/strong&gt; &lt;code&gt;cs.CR&lt;/code&gt; · all: cs.CR, cs.LG&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Matched keywords:&lt;/strong&gt; large language model, llm, agent, agentic, reasoning, latency&lt;/p&gt;&#xA;&lt;hr&gt;&#xA;&lt;h2 id=&#34;tldr&#34;&gt;TL;DR&lt;/h2&gt;&#xA;&lt;p&gt;Sovereign Agentic Loops (SAL) is a control-plane architecture that decouples LLM reasoning from execution: models emit structured intents with justifications, which are validated against true system state and policy before any mutation. A prototype blocks unsafe actions with 12.4 ms median overhead.&lt;/p&gt;&#xA;&lt;h2 id=&#34;key-ideas&#34;&gt;Key Ideas&lt;/h2&gt;&#xA;&lt;ul&gt;&#xA;&lt;li&gt;Direct coupling of stochastic LLM outputs to execution layers is unsafe; model correctness and alignment cannot be assumed at runtime.&lt;/li&gt;&#xA;&lt;li&gt;Models should emit &lt;strong&gt;structured intents with justifications&lt;/strong&gt;, not raw API calls.&lt;/li&gt;&#xA;&lt;li&gt;An &lt;strong&gt;obfuscation membrane&lt;/strong&gt; limits model access to identity-sensitive state.&lt;/li&gt;&#xA;&lt;li&gt;A cryptographically linked &lt;strong&gt;Evidence Chain&lt;/strong&gt; enables auditability and deterministic replay.&lt;/li&gt;&#xA;&lt;li&gt;Formal guarantees: policy-bounded execution, identity isolation, deterministic replay.&lt;/li&gt;&#xA;&lt;/ul&gt;&#xA;&lt;h2 id=&#34;approach&#34;&gt;Approach&lt;/h2&gt;&#xA;&lt;p&gt;SAL inserts a control plane between the LLM and execution layer. The model produces structured intents plus justifications; the control plane validates each intent against true system state and policy before dispatching it. The obfuscation membrane mediates what identity-sensitive state the model can observe, and every decision is recorded in a cryptographically chained Evidence Log that supports replay. The authors formalize the architecture and prove the safety properties under stated assumptions.&lt;/p&gt;</description>
    </item>
    <item>
      <title>Preference Heads in Large Language Models: A Mechanistic Framework for Interpretable Personalization</title>
      <link>https://ftxj.github.io/posts/2026-04-27/02-preference-heads-in-large-language-models-a-mechanistic-fram/</link>
      <pubDate>Mon, 27 Apr 2026 09:28:57 +0000</pubDate>
      <guid>https://ftxj.github.io/posts/2026-04-27/02-preference-heads-in-large-language-models-a-mechanistic-fram/</guid>
      <description>&lt;p&gt;&lt;strong&gt;arXiv:&lt;/strong&gt; &lt;a href=&#34;https://arxiv.org/abs/2604.22345v1&#34;&gt;2604.22345&lt;/a&gt; · &lt;a href=&#34;https://arxiv.org/pdf/2604.22345v1&#34;&gt;PDF&lt;/a&gt;&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Authors:&lt;/strong&gt; Weixu Zhang, Ye Yuan, Changjiang Han, Yuxing Tian, Zipeng Sun, Linfeng Du, Jikun Kang, Hong Kang, Xue Liu, Haolun Wu&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Affiliations:&lt;/strong&gt; McGill University, Mila - Quebec AI Institute, MBZUAI, University of Montreal, Salesforce&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Primary category:&lt;/strong&gt; &lt;code&gt;cs.CL&lt;/code&gt; · all: cs.CL&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Matched keywords:&lt;/strong&gt; large language model, llm, rag, inference, serving, attention, transformer&lt;/p&gt;&#xA;&lt;hr&gt;&#xA;&lt;h2 id=&#34;tldr&#34;&gt;TL;DR&lt;/h2&gt;&#xA;&lt;p&gt;The paper posits that LLM personalization is concentrated in a sparse set of &amp;ldquo;Preference Heads&amp;rdquo; and introduces Differential Preference Steering (DPS), a training-free method that identifies these heads via causal masking and contrasts logits with/without them at decoding to amplify user-aligned outputs.&lt;/p&gt;</description>
    </item>
    <item>
      <title>Emergent Strategic Reasoning Risks in AI: A Taxonomy-Driven Evaluation Framework</title>
      <link>https://ftxj.github.io/posts/2026-04-27/01-emergent-strategic-reasoning-risks-in-ai-a-taxonomy-driven-e/</link>
      <pubDate>Mon, 27 Apr 2026 09:26:31 +0000</pubDate>
      <guid>https://ftxj.github.io/posts/2026-04-27/01-emergent-strategic-reasoning-risks-in-ai-a-taxonomy-driven-e/</guid>
      <description>&lt;p&gt;&lt;strong&gt;arXiv:&lt;/strong&gt; &lt;a href=&#34;https://arxiv.org/abs/2604.22119v1&#34;&gt;2604.22119&lt;/a&gt; · &lt;a href=&#34;https://arxiv.org/pdf/2604.22119v1&#34;&gt;PDF&lt;/a&gt;&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Authors:&lt;/strong&gt; Tharindu Kumarage, Lisa Bauer, Yao Ma, Dan Rosen, Yashasvi Raghavendra Guduri, Anna Rumshisky, Kai-Wei Chang, Aram Galstyan, Rahul Gupta, Charith Peris&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Affiliations:&lt;/strong&gt; Amazon Nova Responsible AI&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Primary category:&lt;/strong&gt; &lt;code&gt;cs.AI&lt;/code&gt; · all: cs.AI&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Matched keywords:&lt;/strong&gt; large language model, llm, agent, agentic, reasoning&lt;/p&gt;&#xA;&lt;hr&gt;&#xA;&lt;h2 id=&#34;tldr&#34;&gt;TL;DR&lt;/h2&gt;&#xA;&lt;p&gt;The paper introduces ESRRSim, a taxonomy-driven agentic framework for benchmarking Emergent Strategic Reasoning Risks (ESRRs) — deception, evaluation gaming, reward hacking — in LLMs. Across 11 reasoning models, detection rates span 14.45%–72.72%, with newer generations showing dramatic safety gains.&lt;/p&gt;</description>
    </item>
  </channel>
</rss>
