Guess-Verify-Refine: Data-Aware Top-K for Sparse-Attention Decoding on Blackwell via Temporal Correlation

arXiv: 2604.22312 · PDF

Authors: Long Cheng, Ritchie Zhao, Timmy Liu, Mindy Li, Xianjie Qiao, Kefeng Duan, Yu-Jung Chen, Xiaoming Chen, Bita Darvish Rouhani, June Yang

Primary category: cs.DC · all: cs.AR, cs.DC, cs.PF

Matched keywords: llm, rag, serving, speculative decoding, attention, latency

TL;DR

GVR is a data-aware exact Top-K kernel for sparse-attention decoding on NVIDIA Blackwell. By exploiting temporal correlation between consecutive decode steps, it delivers 1.88× average (up to 2.42×) speedup over radix-select while preserving bit-exact outputs, yielding up to 7.52% end-to-end TPOT gains on DeepSeek-V3.2.

Key Ideas

Top-K selection is a real latency bottleneck per decode step in long-context sparse-attention serving.
Consecutive decode steps exhibit strong temporal correlation in Top-K sets, usable as a prediction prior.
DSA indexer scores inherit Toeplitz/RoPE structure, motivating pre-indexed statistics.
Exact (bit-exact) Top-K is preserved; no approximation tradeoff.

Approach

Three-phase “Guess-Verify-Refine” pipeline:

Guess: reuse previous step’s Top-K as a threshold predictor.
Verify: secant-style counting in 1–2 global passes narrows to a valid threshold using pre-indexed statistics.
Refine: ballot-free candidate collection, final exact selection resolved in shared memory.

The design is tailored to Blackwell’s memory hierarchy and integrated into TensorRT-LLM’s DSA stack.

Experiments

Workload: DeepSeek-V3.2 (DSA indexer) on real decode traces.
Integration: TensorRT-LLM on NVIDIA Blackwell.
Baseline: production radix-select Top-K kernel.
Metrics: single-op speedup, end-to-end TPOT, context lengths up to 100K, with/without speculative decoding.
Deployment: TEP8 min-latency configuration.

Results

1.88× average, up to 2.42× per-layer per-step single-op speedup.
Up to 7.52% TPOT improvement at 100K context, end-to-end.
Gains grow with longer contexts; smaller but still positive under speculative decoding.
Outputs bit-exact vs. reference.

Why It Matters

For LLM-serving infra, Top-K had been treated as solved once radix-select shipped. GVR shows that exploiting temporal workload structure yields near-2× kernel wins that translate to real TPOT, which matters as sparse-attention (DSA-style) decoders become mainstream for long-context serving.

Connections to Prior Work

DeepSeek Sparse Attention (DSA) and its RoPE/Toeplitz indexer structure.
GPU Top-K literature: radix-select, bitonic, bucket-based selectors.
Long-context sparse attention lines (H2O, Quest, StreamingLLM) which rely on Top-K or approximations.
Temporal-locality exploitation reminiscent of KV-cache reuse and speculative decoding.

Open Questions

Robustness when temporal correlation breaks (topic shifts, beam search divergence).
Portability beyond Blackwell and beyond DSA indexers — does Toeplitz structure generalize?
Interaction with larger speculative-decoding trees and batched multi-request serving.
Memory overhead of pre-indexed statistics at very long contexts.

Original abstract

Sparse-attention decoders rely on exact Top-K selection to choose the most important key-value entries for each query token. In long-context LLM serving, this Top-K stage runs once per decode query and becomes a meaningful latency bottleneck even when the indexer and attention kernels are already highly optimized. We present \textbf{Guess-Verify-Refine (GVR)}, a data-aware exact Top-K algorithm for sparse-attention decoding on NVIDIA Blackwell. GVR exploits temporal correlation across consecutive decode steps: it uses the previous step’s Top-K as a prediction signal, computes pre-indexed statistics, narrows to a valid threshold by secant-style counting in 1-2 global passes, verifies candidates with a ballot-free collector, and finishes exact selection in shared memory. We connect this behavior to the Toeplitz / RoPE structure of DeepSeek Sparse Attention (DSA) indexer scores and validate the design on real DeepSeek-V3.2 workloads integrated into TensorRT-LLM. GVR achieves an average \textbf{1.88x} single-operator speedup over the production radix-select kernel, with up to \textbf{2.42x} per layer per step, while preserving bit-exact Top-K outputs. In controlled TEP8 min-latency deployment, it improves end-to-end TPOT by up to \textbf{7.52%} at 100K context, with larger gains at longer contexts and smaller but still positive gains under speculative decoding. While implemented and validated in the current TensorRT-LLM DSA stack on Blackwell, the same principle may extend to sparse-attention decoders whose decode-phase Top-K exhibits temporal stability.