arXiv: 2604.22312 · PDF
Authors: Long Cheng, Ritchie Zhao, Timmy Liu, Mindy Li, Xianjie Qiao, Kefeng Duan, Yu-Jung Chen, Xiaoming Chen, Bita Darvish Rouhani, June Yang
Primary category: cs.DC · all: cs.AR, cs.DC, cs.PF
Matched keywords: llm, rag, serving, speculative decoding, attention, latency
TL;DR
GVR is a data-aware exact Top-K kernel for sparse-attention decoding on NVIDIA Blackwell. By exploiting temporal correlation between consecutive decode steps, it delivers 1.88× average (up to 2.42×) speedup over radix-select while preserving bit-exact outputs, yielding up to 7.52% end-to-end TPOT gains on DeepSeek-V3.2.

Key Ideas
- Top-K selection is a real latency bottleneck per decode step in long-context sparse-attention serving.
- Consecutive decode steps exhibit strong temporal correlation in Top-K sets, usable as a prediction prior.
- DSA indexer scores inherit Toeplitz/RoPE structure, motivating pre-indexed statistics.
- Exact (bit-exact) Top-K is preserved; no approximation tradeoff.
Approach
Three-phase “Guess-Verify-Refine” pipeline:
- Guess: reuse previous step’s Top-K as a threshold predictor.
- Verify: secant-style counting in 1–2 global passes narrows to a valid threshold using pre-indexed statistics.
- Refine: ballot-free candidate collection, final exact selection resolved in shared memory.
The design is tailored to Blackwell’s memory hierarchy and integrated into TensorRT-LLM’s DSA stack.

Experiments
- Workload: DeepSeek-V3.2 (DSA indexer) on real decode traces.
- Integration: TensorRT-LLM on NVIDIA Blackwell.
- Baseline: production radix-select Top-K kernel.
- Metrics: single-op speedup, end-to-end TPOT, context lengths up to 100K, with/without speculative decoding.
- Deployment: TEP8 min-latency configuration.
Results
- 1.88× average, up to 2.42× per-layer per-step single-op speedup.
- Up to 7.52% TPOT improvement at 100K context, end-to-end.
- Gains grow with longer contexts; smaller but still positive under speculative decoding.
- Outputs bit-exact vs. reference.

Why It Matters
For LLM-serving infra, Top-K had been treated as solved once radix-select shipped. GVR shows that exploiting temporal workload structure yields near-2× kernel wins that translate to real TPOT, which matters as sparse-attention (DSA-style) decoders become mainstream for long-context serving.
Connections to Prior Work
- DeepSeek Sparse Attention (DSA) and its RoPE/Toeplitz indexer structure.
- GPU Top-K literature: radix-select, bitonic, bucket-based selectors.
- Long-context sparse attention lines (H2O, Quest, StreamingLLM) which rely on Top-K or approximations.
- Temporal-locality exploitation reminiscent of KV-cache reuse and speculative decoding.
Open Questions
- Robustness when temporal correlation breaks (topic shifts, beam search divergence).
- Portability beyond Blackwell and beyond DSA indexers — does Toeplitz structure generalize?
- Interaction with larger speculative-decoding trees and batched multi-request serving.
- Memory overhead of pre-indexed statistics at very long contexts.
Original abstract
Sparse-attention decoders rely on exact Top-K selection to choose the most important key-value entries for each query token. In long-context LLM serving, this Top-K stage runs once per decode query and becomes a meaningful latency bottleneck even when the indexer and attention kernels are already highly optimized. We present \textbf{Guess-Verify-Refine (GVR)}, a data-aware exact Top-K algorithm for sparse-attention decoding on NVIDIA Blackwell. GVR exploits temporal correlation across consecutive decode steps: it uses the previous step’s Top-K as a prediction signal, computes pre-indexed statistics, narrows to a valid threshold by secant-style counting in 1-2 global passes, verifies candidates with a ballot-free collector, and finishes exact selection in shared memory. We connect this behavior to the Toeplitz / RoPE structure of DeepSeek Sparse Attention (DSA) indexer scores and validate the design on real DeepSeek-V3.2 workloads integrated into TensorRT-LLM. GVR achieves an average \textbf{1.88x} single-operator speedup over the production radix-select kernel, with up to \textbf{2.42x} per layer per step, while preserving bit-exact Top-K outputs. In controlled TEP8 min-latency deployment, it improves end-to-end TPOT by up to \textbf{7.52%} at 100K context, with larger gains at longer contexts and smaller but still positive gains under speculative decoding. While implemented and validated in the current TensorRT-LLM DSA stack on Blackwell, the same principle may extend to sparse-attention decoders whose decode-phase Top-K exhibits temporal stability.