arXiv: 2604.22312 · PDF

Authors: Long Cheng, Ritchie Zhao, Timmy Liu, Mindy Li, Xianjie Qiao, Kefeng Duan, Yu-Jung Chen, Xiaoming Chen, Bita Darvish Rouhani, June Yang

Primary category: cs.DC · all: cs.AR, cs.DC, cs.PF

Matched keywords: llm, rag, serving, speculative decoding, attention, latency


TL;DR

GVR is a data-aware exact Top-K algorithm for sparse-attention decoding on NVIDIA Blackwell. By exploiting temporal correlation between consecutive decode steps, it delivers 1.88× average kernel speedup over radix-select while preserving bit-exact outputs.

Key Ideas

  • Top-K selection is a real latency bottleneck in long-context sparse-attention decoding, even after indexer/attention are optimized.
  • Consecutive decode steps’ Top-K sets are temporally correlated — the previous step predicts the next.
  • DeepSeek Sparse Attention (DSA) indexer scores have Toeplitz/RoPE structure that justifies the prediction signal.
  • A guess-verify-refine pipeline can reach exact Top-K in 1–2 global passes plus shared-memory finishing.

Approach

GVR proceeds in three stages: (1) Guess — use the prior step’s Top-K plus pre-indexed statistics to predict a threshold; (2) Verify — narrow to a valid threshold via secant-style counting over 1–2 global passes, then collect candidates with a ballot-free collector; (3) Refine — finalize exact selection in shared memory. The design is grounded in the Toeplitz/RoPE structure of the DSA indexer and is integrated into TensorRT-LLM on Blackwell.

Experiments

  • Workloads: real DeepSeek-V3.2 inference on NVIDIA Blackwell, via TensorRT-LLM DSA stack.
  • Baseline: the production radix-select Top-K kernel.
  • Deployment: TEP8 min-latency configuration, contexts up to 100K, with and without speculative decoding.
  • Metrics: per-operator speedup, end-to-end TPOT, and bit-exactness check.

Results

  • 1.88× average single-operator speedup, up to 2.42× per layer per step.
  • 7.52% end-to-end TPOT improvement at 100K context under TEP8.
  • Gains grow with context length; smaller but still positive under speculative decoding.
  • Outputs remain bit-exact vs. radix-select.

Why It Matters

For long-context LLM serving, optimized Top-K was the next bottleneck after attention/indexer work. GVR removes that overhead without accuracy loss on Blackwell production stacks, and shows that decode-phase temporal locality is exploitable infra-side — useful for anyone tuning DSA-style sparse decoders.

Connections to Prior Work

Builds on DeepSeek Sparse Attention and its RoPE/Toeplitz indexer; extends long-context sparse-attention decoding lines (H2O, Quest, MInference). Contrasts with general GPU selection (radix-select, bitonic Top-K) by being data-aware. Complements TensorRT-LLM decode optimizations and speculative decoding.

Open Questions

  • Does temporal stability hold for non-DSA indexers or non-RoPE position encodings?
  • Behavior under bursty/multi-tenant serving where step-to-step correlation may break.
  • Prefill-phase and training-time applicability.
  • Sensitivity to K, batch size, and GPU architectures beyond Blackwell.
  • Interaction with more aggressive speculative or multi-token decoding schemes.

Figures

Figure 1: Figure 1 (extracted from PDF)

Figure 1

Figure 2: Figure 2 (extracted from PDF)

Figure 2

Figure 3: Figure 3 (extracted from PDF)

Figure 3


Original abstract

Sparse-attention decoders rely on exact Top-K selection to choose the most important key-value entries for each query token. In long-context LLM serving, this Top-K stage runs once per decode query and becomes a meaningful latency bottleneck even when the indexer and attention kernels are already highly optimized. We present \textbf{Guess-Verify-Refine (GVR)}, a data-aware exact Top-K algorithm for sparse-attention decoding on NVIDIA Blackwell. GVR exploits temporal correlation across consecutive decode steps: it uses the previous step’s Top-K as a prediction signal, computes pre-indexed statistics, narrows to a valid threshold by secant-style counting in 1-2 global passes, verifies candidates with a ballot-free collector, and finishes exact selection in shared memory. We connect this behavior to the Toeplitz / RoPE structure of DeepSeek Sparse Attention (DSA) indexer scores and validate the design on real DeepSeek-V3.2 workloads integrated into TensorRT-LLM. GVR achieves an average \textbf{1.88x} single-operator speedup over the production radix-select kernel, with up to \textbf{2.42x} per layer per step, while preserving bit-exact Top-K outputs. In controlled TEP8 min-latency deployment, it improves end-to-end TPOT by up to \textbf{7.52%} at 100K context, with larger gains at longer contexts and smaller but still positive gains under speculative decoding. While implemented and validated in the current TensorRT-LLM DSA stack on Blackwell, the same principle may extend to sparse-attention decoders whose decode-phase Top-K exhibits temporal stability.