<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>2026-04-28 Paper Digest on JXIN&#39;s Home</title>
    <link>https://ftxj.github.io/posts/2026-04-28/</link>
    <description>Recent content in 2026-04-28 Paper Digest on JXIN&#39;s Home</description>
    <generator>Hugo</generator>
    <language>en</language>
    <lastBuildDate>Tue, 28 Apr 2026 14:31:08 +0000</lastBuildDate>
    <atom:link href="https://ftxj.github.io/posts/2026-04-28/index.xml" rel="self" type="application/rss+xml" />
    <item>
      <title>AgenticCache: Cache-Driven Asynchronous Planning for Embodied AI Agents</title>
      <link>https://ftxj.github.io/posts/2026-04-28/10-agenticcache-cache-driven-asynchronous-planning-for-embodied/</link>
      <pubDate>Tue, 28 Apr 2026 14:31:08 +0000</pubDate>
      <guid>https://ftxj.github.io/posts/2026-04-28/10-agenticcache-cache-driven-asynchronous-planning-for-embodied/</guid>
      <description>&lt;p&gt;&lt;strong&gt;arXiv:&lt;/strong&gt; &lt;a href=&#34;https://arxiv.org/abs/2604.24039v1&#34;&gt;2604.24039&lt;/a&gt; · &lt;a href=&#34;https://arxiv.org/pdf/2604.24039v1&#34;&gt;PDF&lt;/a&gt;&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Authors:&lt;/strong&gt; Hojoon Kim, Yuheng Wu, Thierry Tambe&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Affiliations:&lt;/strong&gt; Stanford University, Harvard University&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Primary category:&lt;/strong&gt; &lt;code&gt;cs.LG&lt;/code&gt; · all: cs.AI, cs.CL, cs.LG&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Matched keywords:&lt;/strong&gt; large language model, llm, agent, agentic, multi-agent, rag, latency&lt;/p&gt;&#xA;&lt;hr&gt;&#xA;&lt;h2 id=&#34;tldr&#34;&gt;TL;DR&lt;/h2&gt;&#xA;&lt;p&gt;AgenticCache caches 2-gram plan transitions for LLM-driven embodied agents, serving most planning decisions from a local cache while a background LLM updater asynchronously validates and corrects entries. Across 4 multi-agent benchmarks × 3 GPT-5 scales, it lifts success rate by 22% on average, cuts latency 65%, and reduces tokens 50%.&lt;/p&gt;</description>
    </item>
    <item>
      <title>BitRL: Reinforcement Learning with 1-bit Quantized Language Models for Resource-Constrained Edge Deployment</title>
      <link>https://ftxj.github.io/posts/2026-04-28/09-bitrl-reinforcement-learning-with-1-bit-quantized-language-m/</link>
      <pubDate>Tue, 28 Apr 2026 14:22:14 +0000</pubDate>
      <guid>https://ftxj.github.io/posts/2026-04-28/09-bitrl-reinforcement-learning-with-1-bit-quantized-language-m/</guid>
      <description>&lt;p&gt;&lt;strong&gt;arXiv:&lt;/strong&gt; &lt;a href=&#34;https://arxiv.org/abs/2604.24273v1&#34;&gt;2604.24273&lt;/a&gt; · &lt;a href=&#34;https://arxiv.org/pdf/2604.24273v1&#34;&gt;PDF&lt;/a&gt;&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Authors:&lt;/strong&gt; Md. Ashiq Ul Islam Sajid, Mohammad Sakib Mahmood, Md. Tareq Hasan, Md Abdur Rahim, Rafat Ara, Md. Arafat Hossain&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Affiliations:&lt;/strong&gt; N/A&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Primary category:&lt;/strong&gt; &lt;code&gt;cs.LG&lt;/code&gt; · all: cs.LG&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Matched keywords:&lt;/strong&gt; large language model, llm, agent, rag, inference, quantization, latency&lt;/p&gt;&#xA;&lt;hr&gt;&#xA;&lt;h2 id=&#34;tldr&#34;&gt;TL;DR&lt;/h2&gt;&#xA;&lt;p&gt;BitRL freezes a 2B-parameter BitNet b1.58 backbone (ternary weights {−1,0,+1}) and trains only small (~50K-param) PPO policy/value heads, yielding RL agents that retain 85–98% of FP16 performance with 10–16× memory reduction and 3–5× energy savings on a Raspberry Pi 4.&lt;/p&gt;</description>
    </item>
    <item>
      <title>DepthKV: Layer-Dependent KV Cache Pruning for Long-Context LLM Inference</title>
      <link>https://ftxj.github.io/posts/2026-04-28/08-depthkv-layer-dependent-kv-cache-pruning-for-long-context-ll/</link>
      <pubDate>Tue, 28 Apr 2026 14:16:05 +0000</pubDate>
      <guid>https://ftxj.github.io/posts/2026-04-28/08-depthkv-layer-dependent-kv-cache-pruning-for-long-context-ll/</guid>
      <description>&lt;p&gt;&lt;strong&gt;arXiv:&lt;/strong&gt; &lt;a href=&#34;https://arxiv.org/abs/2604.24647v1&#34;&gt;2604.24647&lt;/a&gt; · &lt;a href=&#34;https://arxiv.org/pdf/2604.24647v1&#34;&gt;PDF&lt;/a&gt;&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Authors:&lt;/strong&gt; Zahra Dehghanighobadi, Asja Fischer&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Affiliations:&lt;/strong&gt; Ruhr University Bochum, UAR Research Center for Trustworthy Data Science and Security&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Primary category:&lt;/strong&gt; &lt;code&gt;cs.CL&lt;/code&gt; · all: cs.AI, cs.CL&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Matched keywords:&lt;/strong&gt; large language model, llm, reasoning, inference, kv cache, attention&lt;/p&gt;&#xA;&lt;hr&gt;&#xA;&lt;h2 id=&#34;tldr&#34;&gt;TL;DR&lt;/h2&gt;&#xA;&lt;p&gt;DepthKV reallocates a fixed global KV-cache budget non-uniformly across transformer layers based on per-layer sensitivity to pruning, using InfoNCE-derived importance scores. At 60% global pruning, it consistently beats uniform pruning (e.g., H₂O) across summarization, QA, and GSM-∞ reasoning on Gemma-7B, LLaMA-3.1-8B, and Qwen2.5-7B.&lt;/p&gt;</description>
    </item>
    <item>
      <title>Beyond the Attention Stability Boundary: Agentic Self-Synthesizing Reasoning Protocols</title>
      <link>https://ftxj.github.io/posts/2026-04-28/07-beyond-the-attention-stability-boundary-agentic-self-synthes/</link>
      <pubDate>Tue, 28 Apr 2026 14:07:34 +0000</pubDate>
      <guid>https://ftxj.github.io/posts/2026-04-28/07-beyond-the-attention-stability-boundary-agentic-self-synthes/</guid>
      <description>&lt;p&gt;&lt;strong&gt;arXiv:&lt;/strong&gt; &lt;a href=&#34;https://arxiv.org/abs/2604.24512v1&#34;&gt;2604.24512&lt;/a&gt; · &lt;a href=&#34;https://arxiv.org/pdf/2604.24512v1&#34;&gt;PDF&lt;/a&gt;&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Authors:&lt;/strong&gt; Dahlia Shehata, Ming Li&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Affiliations:&lt;/strong&gt; University of Waterloo&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Primary category:&lt;/strong&gt; &lt;code&gt;cs.AI&lt;/code&gt; · all: cs.AI&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Matched keywords:&lt;/strong&gt; llm, agent, agentic, retrieval, reasoning, attention, transformer&lt;/p&gt;&#xA;&lt;hr&gt;&#xA;&lt;h2 id=&#34;tldr&#34;&gt;TL;DR&lt;/h2&gt;&#xA;&lt;p&gt;The paper formalizes the &lt;em&gt;Attention Latch&lt;/em&gt; — a failure where multi-turn LLM agents stay anchored to stale goals — and proposes &lt;strong&gt;SSRP&lt;/strong&gt;, an Architect/Executive split that auto-synthesizes per-task SOPs. On MultiWOZ 2.2 (9K trajectories), SSRP lifts GPT-5.4 from 0.1% to 71.6% on 3-hop semantic hijacking.&lt;/p&gt;</description>
    </item>
    <item>
      <title>Grounding Before Generalizing: How AI Differs from Humans in Causal Transfer</title>
      <link>https://ftxj.github.io/posts/2026-04-28/06-grounding-before-generalizing-how-ai-differs-from-humans-in/</link>
      <pubDate>Tue, 28 Apr 2026 13:59:59 +0000</pubDate>
      <guid>https://ftxj.github.io/posts/2026-04-28/06-grounding-before-generalizing-how-ai-differs-from-humans-in/</guid>
      <description>&lt;p&gt;&lt;strong&gt;arXiv:&lt;/strong&gt; &lt;a href=&#34;https://arxiv.org/abs/2604.24062v1&#34;&gt;2604.24062&lt;/a&gt; · &lt;a href=&#34;https://arxiv.org/pdf/2604.24062v1&#34;&gt;PDF&lt;/a&gt;&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Authors:&lt;/strong&gt; Liangru Xiang, Yuxi Ma, Zhihao Cao, Yixin Zhu, Song-Chun Zhu&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Affiliations:&lt;/strong&gt; Tsinghua University, Peking University, State Key Laboratory of General Artificial Intelligence, Beijing Key Laboratory of Behavior and Mental Health&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Primary category:&lt;/strong&gt; &lt;code&gt;cs.AI&lt;/code&gt; · all: cs.AI&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Matched keywords:&lt;/strong&gt; large language model, llm, agent, rag, reasoning&lt;/p&gt;&#xA;&lt;hr&gt;&#xA;&lt;h2 id=&#34;tldr&#34;&gt;TL;DR&lt;/h2&gt;&#xA;&lt;p&gt;Using the OpenLock paradigm, the authors show that four frontier models (GPT-5.2, Claude-4.5-Sonnet, Gemini-3-Flash, DeepSeek-V3.2) can discover causal structures as efficiently as humans in text, but—unlike humans—fail to transfer Common Cause / Common Effect schemas to new environments until after an initial grounding solution, and are hurt rather than helped by visual input.&lt;/p&gt;</description>
    </item>
    <item>
      <title>PhysNote: Self-Knowledge Notes for Evolvable Physical Reasoning in Vision-Language Model</title>
      <link>https://ftxj.github.io/posts/2026-04-28/05-physnote-self-knowledge-notes-for-evolvable-physical-reasoni/</link>
      <pubDate>Tue, 28 Apr 2026 13:52:50 +0000</pubDate>
      <guid>https://ftxj.github.io/posts/2026-04-28/05-physnote-self-knowledge-notes-for-evolvable-physical-reasoni/</guid>
      <description>&lt;p&gt;&lt;strong&gt;arXiv:&lt;/strong&gt; &lt;a href=&#34;https://arxiv.org/abs/2604.24443v1&#34;&gt;2604.24443&lt;/a&gt; · &lt;a href=&#34;https://arxiv.org/pdf/2604.24443v1&#34;&gt;PDF&lt;/a&gt;&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Authors:&lt;/strong&gt; Sinin Zhang, Yunfei Xie, Yuxuan Cheng, Haoyu Zhang, Tong Zhang&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Affiliations:&lt;/strong&gt; The Chinese University of Hong Kong, Shenzhen, Rice University, City University of Hong Kong, Fudan University&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Primary category:&lt;/strong&gt; &lt;code&gt;cs.AI&lt;/code&gt; · all: cs.AI&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Matched keywords:&lt;/strong&gt; agent, agentic, multi-agent, reasoning, inference&lt;/p&gt;&#xA;&lt;hr&gt;&#xA;&lt;blockquote&gt;&#xA;&lt;p&gt;Automated analysis unavailable (claude CLI timeout). Showing raw abstract.&lt;/p&gt;&#xA;&lt;/blockquote&gt;&#xA;&lt;h2 id=&#34;abstract&#34;&gt;Abstract&lt;/h2&gt;&#xA;&lt;p&gt;Vision-Language Models (VLMs) have demonstrated strong performance on textbook-style physics problems, yet they frequently fail when confronted with dynamic real-world scenarios that require temporal consistency and causal reasoning across frames. We identify two fundamental challenges underlying these failures: (1) spatio-temporal identity drift, where objects lose their physical identity across successive frames and break causal chains, and (2) volatility of inference-time insights, where a model may occasionally produce correct physical reasoning but never consolidates it for future reuse. To address these challenges, we propose PhysNote, an agentic framework that enables VLMs to externalize and refine physical knowledge through self-generated &amp;ldquo;Knowledge Notes.&amp;rdquo; PhysNote stabilizes dynamic perception through spatio-temporal canonicalization, organizes self-generated insights into a hierarchical knowledge repository, and drives an iterative reasoning loop that grounds hypotheses in visual evidence before consolidating verified knowledge. Experiments on PhysBench demonstrate that PhysNote achieves 56.68% overall accuracy, a 4.96% improvement over the best multi-agent baseline, with consistent gains across all four physical reasoning domains.&lt;/p&gt;</description>
    </item>
    <item>
      <title>Stabilizing Efficient Reasoning with Step-Level Advantage Selection</title>
      <link>https://ftxj.github.io/posts/2026-04-28/04-stabilizing-efficient-reasoning-with-step-level-advantage-se/</link>
      <pubDate>Tue, 28 Apr 2026 13:44:12 +0000</pubDate>
      <guid>https://ftxj.github.io/posts/2026-04-28/04-stabilizing-efficient-reasoning-with-step-level-advantage-se/</guid>
      <description>&lt;p&gt;&lt;strong&gt;arXiv:&lt;/strong&gt; &lt;a href=&#34;https://arxiv.org/abs/2604.24003v1&#34;&gt;2604.24003&lt;/a&gt; · &lt;a href=&#34;https://arxiv.org/pdf/2604.24003v1&#34;&gt;PDF&lt;/a&gt;&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Authors:&lt;/strong&gt; Han Wang, Xiaodong Yu, Jialian Wu, Jiang Liu, Ximeng Sun, Mohit Bansal, Zicheng Liu&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Affiliations:&lt;/strong&gt; UNC Chapel Hill, Advanced Micro Devices, Inc&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Primary category:&lt;/strong&gt; &lt;code&gt;cs.CL&lt;/code&gt; · all: cs.CL, cs.LG&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Matched keywords:&lt;/strong&gt; large language model, llm, rag, reasoning, inference, post-train&lt;/p&gt;&#xA;&lt;hr&gt;&#xA;&lt;h2 id=&#34;tldr&#34;&gt;TL;DR&lt;/h2&gt;&#xA;&lt;p&gt;Step-level Advantage Selection (SAS) zeros advantages for low-confidence steps in correct GRPO rollouts and high-confidence steps in verifier-failed rollouts, stabilizing short-context post-training. On five math benchmarks it lifts Pass@1 by 0.86 points over the strongest length-aware baseline while cutting reasoning length by 16.3%.&lt;/p&gt;</description>
    </item>
    <item>
      <title>The Chameleon&#39;s Limit: Investigating Persona Collapse and Homogenization in Large Language Models</title>
      <link>https://ftxj.github.io/posts/2026-04-28/03-the-chameleon-s-limit-investigating-persona-collapse-and-hom/</link>
      <pubDate>Tue, 28 Apr 2026 13:34:35 +0000</pubDate>
      <guid>https://ftxj.github.io/posts/2026-04-28/03-the-chameleon-s-limit-investigating-persona-collapse-and-hom/</guid>
      <description>&lt;p&gt;&lt;strong&gt;arXiv:&lt;/strong&gt; &lt;a href=&#34;https://arxiv.org/abs/2604.24698v1&#34;&gt;2604.24698&lt;/a&gt; · &lt;a href=&#34;https://arxiv.org/pdf/2604.24698v1&#34;&gt;PDF&lt;/a&gt;&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Authors:&lt;/strong&gt; Yunze Xiao, Vivienne J. Zhang, Chenghao Yang, Ningshan Ma, Weihao Xuan, Jen-tse Huang&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Affiliations:&lt;/strong&gt; CMU, UChicago, MIT, 2077.ai, UTokyo, RIKEN AIP, JHU&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Primary category:&lt;/strong&gt; &lt;code&gt;cs.CL&lt;/code&gt; · all: cs.CL&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Matched keywords:&lt;/strong&gt; large language model, llm, agent, multi-agent, rag, reasoning&lt;/p&gt;&#xA;&lt;hr&gt;&#xA;&lt;h2 id=&#34;tldr&#34;&gt;TL;DR&lt;/h2&gt;&#xA;&lt;p&gt;Ten LLMs asked to role-play 1,144 richly specified personas collapse into a narrow behavioral mode — agents converge despite distinct profiles. A geometric framework (Coverage, Uniformity, Complexity on a Behavioral Trait Matrix) plus item-level diagnostics shows collapse is multi-axis and task-contingent, and that the highest-fidelity models produce the most stereotyped populations.&lt;/p&gt;</description>
    </item>
    <item>
      <title>Long-Context Aware Upcycling: A New Frontier for Hybrid LLM Scaling</title>
      <link>https://ftxj.github.io/posts/2026-04-28/02-long-context-aware-upcycling-a-new-frontier-for-hybrid-llm-s/</link>
      <pubDate>Tue, 28 Apr 2026 13:26:06 +0000</pubDate>
      <guid>https://ftxj.github.io/posts/2026-04-28/02-long-context-aware-upcycling-a-new-frontier-for-hybrid-llm-s/</guid>
      <description>&lt;p&gt;&lt;strong&gt;arXiv:&lt;/strong&gt; &lt;a href=&#34;https://arxiv.org/abs/2604.24715v1&#34;&gt;2604.24715&lt;/a&gt; · &lt;a href=&#34;https://arxiv.org/pdf/2604.24715v1&#34;&gt;PDF&lt;/a&gt;&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Authors:&lt;/strong&gt; Parsa Ashrafi Fashi, Utkarsh Saxena, Mehdi Rezagholizadeh, Aref Jafari, Akash Haridas, Mingyu Yang, Vansh Bhatia, Guihong Li, Vikram Appia, Emad Barsoum&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Affiliations:&lt;/strong&gt; AMD&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Primary category:&lt;/strong&gt; &lt;code&gt;cs.CL&lt;/code&gt; · all: cs.CL, cs.LG&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Matched keywords:&lt;/strong&gt; llm, reasoning, inference, serving, kv-cache, attention, transformer, post-train&lt;/p&gt;&#xA;&lt;hr&gt;&#xA;&lt;h2 id=&#34;tldr&#34;&gt;TL;DR&lt;/h2&gt;&#xA;&lt;p&gt;HyLo 是一套将预训练 Transformer 升级（upcycle）为 MLA + Mamba2/GDN 混合长上下文模型的训练配方，通过分阶段长上下文训练与教师蒸馏，把可用上下文扩展至 32×、KV cache 降低 &amp;gt;90%，在 RULER 上显著超越 Zebra-Llama 等现有升级基线。&lt;/p&gt;&#xA;&lt;h2 id=&#34;motivation&#34;&gt;Motivation&lt;/h2&gt;&#xA;&lt;p&gt;现有混合架构（Jamba、Samba、Qwen3-Next、Kimi-Linear）多从零预训练，成本高昂；而已有升级方法（MambaInLlama、Mohawk、Llamba、Zebra-Llama）只盯短上下文困惑度与常识基准，几乎不考虑长上下文能力保留。论文数据直接暴露问题：Zebra-Llama-1B 在 RULER-8K 仅得 12.3，32K 跌到 3.7，64K 几乎为 0（Table 2）；Llamba-1B 在 RULER 全段 ≤ 2.9。这对 vLLM/SGLang 长文档服务、长代码补全、多跳推理的运营者而言意味着混合模型&amp;quot;号称长但不能长&amp;quot;，他们被迫继续 serve 原始 Transformer 并在 64K 之后 OOM。作者的切入点是：Zebra-Llama 做了正确初始化，但训练仅到 ~24K 且未用长上下文教师蒸馏，这正是可以撬动的杠杆。HyLo 把&amp;quot;长上下文保留&amp;quot;升级为一等训练目标，并主张用一个内存友好的蒸馏栈让 8B 教师可以跑到 64K。&lt;/p&gt;</description>
    </item>
    <item>
      <title>FlashOverlap: Minimizing Tail Latency in Communication Overlap for Distributed LLM Training</title>
      <link>https://ftxj.github.io/posts/2026-04-28/01-flashoverlap-minimizing-tail-latency-in-communication-overla/</link>
      <pubDate>Tue, 28 Apr 2026 13:17:10 +0000</pubDate>
      <guid>https://ftxj.github.io/posts/2026-04-28/01-flashoverlap-minimizing-tail-latency-in-communication-overla/</guid>
      <description>&lt;p&gt;&lt;strong&gt;arXiv:&lt;/strong&gt; &lt;a href=&#34;https://arxiv.org/abs/2604.24013v1&#34;&gt;2604.24013&lt;/a&gt; · &lt;a href=&#34;https://arxiv.org/pdf/2604.24013v1&#34;&gt;PDF&lt;/a&gt;&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Authors:&lt;/strong&gt; Rezaul Karim, Austin Wen, Wang Zongzuo, Weiwei Zhang, Yang Liu, Walid Ahmed&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Affiliations:&lt;/strong&gt; Toronto Ascend Team, Huawei&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Primary category:&lt;/strong&gt; &lt;code&gt;cs.LG&lt;/code&gt; · all: cs.CV, cs.DC, cs.LG&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Matched keywords:&lt;/strong&gt; large language model, llm, inference, distributed training, parallelism, gpu, throughput, latency&lt;/p&gt;&#xA;&lt;hr&gt;&#xA;&lt;h2 id=&#34;tldr&#34;&gt;TL;DR&lt;/h2&gt;&#xA;&lt;p&gt;FlashOverlap 将 Reduce-Scatter 与 All-Gather 分解为异步 P2P 通信，并按 rank 自适应调度分片计算，使最后一块数据的计算不再依赖通信，从而消除数据切分类方案的 tail latency，在 TP=4、(b,s,d)=(32,4096,4096) 的 MLP 上把通信开销从 43.8 ms 降至 0.1 ms（99.8% 削减）。&lt;/p&gt;&#xA;&lt;h2 id=&#34;motivation&#34;&gt;Motivation&lt;/h2&gt;&#xA;&lt;p&gt;分布式 LLM 训练/推理依赖 TP、TPSP、DP、Ulysses 等并行，但 all-reduce / reduce-scatter / all-gather / all-to-all 会带来严重通信瓶颈，尤其在需要跨节点时限制了 intra-layer 并行的可扩展性。主流框架 Megatron、MindSpeed (Ascend MC2)、Ascend CoC 采用&amp;quot;数据切分 + 异步通信&amp;quot;来把中间块通信藏在计算后面——当通信比计算短时大部分可以重叠，但最后一个 chunk 的通信必然暴露，形成 tail overhead；把切片做得更细又会让 GEMM 变成 memory-bound，反而更慢。另一类&amp;quot;算法分解&amp;quot;路线（如 Google Decompose on TPU）把集合通信拆成一串非阻塞步骤，但在中间步骤仍有强同步，且总通信量上升。作者因此想要一个&amp;quot;exact&amp;quot;、无 tail、且兼容 TPSP/UP/DP 的统一方案——面向同时运行 Transformer、Mamba、Hybrid 模型的 vLLM/Megatron 型部署者，把 TP/TPSP 从&amp;quot;单节点内才划算&amp;quot;扩展到跨节点。&lt;/p&gt;</description>
    </item>
  </channel>
</rss>
