arXiv: 2604.22312 · PDF
Authors: Long Cheng, Ritchie Zhao, Timmy Liu, Mindy Li, Xianjie Qiao, Kefeng Duan, Yu-Jung Chen, Xiaoming Chen, Bita Darvish Rouhani, June Yang
Affiliations: NVIDIA
Primary category: cs.DC · all: cs.AR, cs.DC, cs.PF
Matched keywords: llm, rag, serving, speculative decoding, attention, latency
TL;DR
GVR is a data-aware exact Top-K kernel for sparse-attention decoding on NVIDIA Blackwell. By exploiting temporal correlation of consecutive decode steps in DeepSeek Sparse Attention, it delivers 1.88× average (up to 2.42×) speedup over TensorRT-LLM’s radix-select kernel while preserving bit-exact outputs.
Key Ideas
- Top-K selection in sparse-attention decoders is a real latency bottleneck once indexer/attention kernels are optimized.
- Consecutive decode steps share most Top-K entries, grounded in the Toeplitz / RoPE structure of DSA indexer scores.
- Prior-step Top-K can serve as a prediction signal to narrow thresholds before exact verification.
- Bit-exact results are preserved — this is an optimization, not an approximation.

The authors motivate GVR with this diagram: when the query advances by one position, a +1 offset in relative positions keeps ~60% of Top-K tokens while ~40% change, and the Toeplitz structure of attention scores (with non-monotonic peaks at large Δ) explains why this persistence emerges.
Approach
GVR follows four stages: (1) Guess — use the previous step’s Top-K as a prediction; (2) pre-compute indexed statistics; (3) Verify — secant-style counting in 1–2 global passes narrows to a valid threshold; (4) Refine — a ballot-free collector gathers candidates and finishes exact selection in shared memory.

The baseline invokeIndexerTopKDecode in indexerTopK.cu dispatches among insert sort (short sequences), radix sort (medium), and a multi-CTA split (very long). GVR replaces this decode-stage Top-K path.
Experiments
Integrated into TensorRT-LLM on Blackwell with DeepSeek-V3.2. Benchmarks use real DSA workloads including SWE-bench-derived LongSeqTasks (e.g., swe_bench_64k.jsonl, 2,025 generated tokens). Baseline: the production radix-select kernel. Deployment tested under TEP8 min-latency configuration at up to 100K context, including speculative decoding.
Results
- 1.88× average single-operator speedup vs. radix-select, up to 2.42× per layer per step.
- Up to 7.52% end-to-end TPOT improvement at 100K context in TEP8 min-latency.
- Gains grow with longer contexts; smaller but still positive under speculative decoding.
- Outputs remain bit-exact.

Measured on entry 1 of swe_bench_64k.jsonl (2,024 valid step-to-step comparisons), raw Top-K overlap — the strictest notion of temporal persistence — remains high across DeepSeek-V3.2 layers, empirically validating the guess-then-verify design.
Why It Matters
For long-context LLM serving, sparse-attention indexers shift cost onto the Top-K stage. GVR shows that exploiting decode-step temporal structure yields meaningful TPOT wins without accuracy loss — directly usable by anyone running DSA on Blackwell.
Connections to Prior Work
Builds on DeepSeek Sparse Attention and its RoPE/Toeplitz score geometry; contrasts with radix-select and multi-CTA Top-K kernels in TensorRT-LLM; conceptually related to speculative/predict-then-verify patterns used in decoding.
Open Questions
Does temporal stability hold for non-DSA sparse attentions or non-Blackwell hardware? How does GVR interact with larger speculative-decoding draft lengths? Behavior under batched multi-request serving and at contexts beyond 100K is not characterized.
Original abstract
Sparse-attention decoders rely on exact Top-K selection to choose the most important key-value entries for each query token. In long-context LLM serving, this Top-K stage runs once per decode query and becomes a meaningful latency bottleneck even when the indexer and attention kernels are already highly optimized. We present \textbf{Guess-Verify-Refine (GVR)}, a data-aware exact Top-K algorithm for sparse-attention decoding on NVIDIA Blackwell. GVR exploits temporal correlation across consecutive decode steps: it uses the previous step’s Top-K as a prediction signal, computes pre-indexed statistics, narrows to a valid threshold by secant-style counting in 1-2 global passes, verifies candidates with a ballot-free collector, and finishes exact selection in shared memory. We connect this behavior to the Toeplitz / RoPE structure of DeepSeek Sparse Attention (DSA) indexer scores and validate the design on real DeepSeek-V3.2 workloads integrated into TensorRT-LLM. GVR achieves an average \textbf{1.88x} single-operator speedup over the production radix-select kernel, with up to \textbf{2.42x} per layer per step, while preserving bit-exact Top-K outputs. In controlled TEP8 min-latency deployment, it improves end-to-end TPOT by up to \textbf{7.52%} at 100K context, with larger gains at longer contexts and smaller but still positive gains under speculative decoding. While implemented and validated in the current TensorRT-LLM DSA stack on Blackwell, the same principle may extend to sparse-attention decoders whose decode-phase Top-K exhibits temporal stability.