Guess-Verify-Refine: Data-Aware Top-K for Sparse-Attention Decoding on Blackwell via Temporal Correlation

arXiv: 2604.22312 · PDF

作者: Long Cheng, Ritchie Zhao, Timmy Liu, Mindy Li, Xianjie Qiao, Kefeng Duan, Yu-Jung Chen, Xiaoming Chen, Bita Darvish Rouhani, June Yang

主分类: cs.DC · 全部: cs.AR, cs.DC, cs.PF

命中关键词: llm, rag, serving, speculative decoding, attention, latency

TL;DR

GVR 利用相邻 decode 步骤 Top-K 的时间相关性，在 Blackwell 上以 guess-verify-refine 三段式加速稀疏注意力的精确 Top-K 选择，单算子平均提速 1.88×。

核心观点

Sparse-attention decoder 的 exact Top-K 阶段即使已高度优化，仍是长上下文 LLM serving 的显著延迟瓶颈。
连续 decode step 之间 Top-K 集合具有强时间相关性，可用作先验预测信号。
将 DSA indexer 分数的 Toeplitz / RoPE 结构与 Top-K 稳定性联系起来，奠定数据感知设计基础。
提出 GVR 算法在保持 bit-exact 输出前提下显著快于生产级 radix-select kernel。

方法

GVR 分三步：(1) Guess —— 以上一 step 的 Top-K 作为预测，读取预先计算的统计量；(2) Verify —— 用 secant-style counting 在 1-2 次 global pass 中收敛到合法阈值，并通过 ballot-free collector 收集候选；(3) Refine —— 在 shared memory 中完成精确选择。整个流程针对 NVIDIA Blackwell 架构特性设计，并集成进 TensorRT-LLM 的 DSA 栈。

实验

工作负载：真实 DeepSeek-V3.2 + DSA indexer。
基线：TensorRT-LLM 生产级 radix-select Top-K kernel。
部署：TEP8 min-latency，100K context，含 speculative decoding 场景。
指标：单算子 speedup、端到端 TPOT、上下文长度伸缩性、bit-exact 正确性。

结果

单算子相对 radix-select 平均 1.88× 加速，峰值 2.42×/layer/step；100K context 下端到端 TPOT 最多改善 7.52%，上下文越长收益越大，speculative decoding 下收益缩小但仍为正。Top-K 输出保持 bit-exact。

为什么重要

对长上下文 LLM serving，Top-K 选择常被忽视却是隐形瓶颈。GVR 说明：利用跨 step 的时间相关性可把精确 Top-K 降为"验证+微调"，为 sparse attention 推理栈提供一条不损精度的加速路径，对 DSA、Blackwell 部署尤其直接可用。

与已有工作的关系

建立在 DeepSeek Sparse Attention (DSA) 与 DeepSeek-V3.2 之上。
替换 TensorRT-LLM 中基于 radix-select 的 Top-K。
与 RoPE、Toeplitz 结构分析相关的注意力打分研究相呼应。
属于 sparse attention / KV 选择这一加速家族（如 H2O、Quest 等近似方法），但聚焦精确 Top-K。

尚未回答的问题

在非 DSA（如 learned indexer、非 RoPE 结构）上时间相关性假设是否仍成立。
Prefill 阶段或 batch 大、请求差异强的场景适用性如何。
在 Hopper/消费级 GPU 或无 ballot-free 原语硬件上的可移植性。
与 speculative decoding、chunked prefill 等技术组合时收益衰减的根因及缓解。
预索引统计量的内存与维护开销在大规模部署下的具体代价。

论文图表

图 1: Figure 1 (extracted from PDF)

图 1

图 2: Figure 2 (extracted from PDF)

图 2

图 3: Figure 3 (extracted from PDF)

图 3

原始摘要

Sparse-attention decoders rely on exact Top-K selection to choose the most important key-value entries for each query token. In long-context LLM serving, this Top-K stage runs once per decode query and becomes a meaningful latency bottleneck even when the indexer and attention kernels are already highly optimized. We present \textbf{Guess-Verify-Refine (GVR)}, a data-aware exact Top-K algorithm for sparse-attention decoding on NVIDIA Blackwell. GVR exploits temporal correlation across consecutive decode steps: it uses the previous step’s Top-K as a prediction signal, computes pre-indexed statistics, narrows to a valid threshold by secant-style counting in 1-2 global passes, verifies candidates with a ballot-free collector, and finishes exact selection in shared memory. We connect this behavior to the Toeplitz / RoPE structure of DeepSeek Sparse Attention (DSA) indexer scores and validate the design on real DeepSeek-V3.2 workloads integrated into TensorRT-LLM. GVR achieves an average \textbf{1.88x} single-operator speedup over the production radix-select kernel, with up to \textbf{2.42x} per layer per step, while preserving bit-exact Top-K outputs. In controlled TEP8 min-latency deployment, it improves end-to-end TPOT by up to \textbf{7.52%} at 100K context, with larger gains at longer contexts and smaller but still positive gains under speculative decoding. While implemented and validated in the current TensorRT-LLM DSA stack on Blackwell, the same principle may extend to sparse-attention decoders whose decode-phase Top-K exhibits temporal stability.