arXiv: 2604.18170 · PDF

Authors: Ziyang Liu

Primary category: cs.CL · all: cs.AI, cs.CL

Matched keywords: llm, rag, serving, kv cache, speculative decoding, fine-tun


TL;DR

Copy-as-Decode reframes LLM text/code editing as grammar-constrained decoding over two primitives (<copy> and <gen>), letting copy spans be filled via a single parallel-prefill forward instead of N autoregressive steps, yielding large theoretical speedups without end-to-end training.

Key Ideas

  • Most edit outputs are verbatim copies of the input, so regenerating them autoregressively is wasteful.
  • A two-primitive grammar (<copy lines="i-j"/>, <gen>...</gen>) with a token-level FSM guarantees syntactic validity.
  • Copy spans reuse the speculative-decoding parallel-forward kernel, but with input tokens as the “draft” and grammar-enforced (not probabilistic) acceptance.
  • Paper gives an upper-bound analysis — no training required — separating kernel speedup, copy coverage ceiling, and pipeline losslessness.

Approach

At decode time the model emits grammar tokens; a deterministic resolver expands <copy> tags by issuing one parallel-prefill forward that updates the KV cache for the whole span, while <gen> falls back to standard autoregressive decoding. An FSM enforces legal token transitions. Line-level and finer token-level primitives are both analyzed.

Experiments

  • Models: Qwen2.5-1.5B / 7B, Qwen2.5-Coder-1.5B; A100 80GB bf16.
  • Benchmarks: ProbeEdit, HumanEvalPack-Fix (Python, JS); 482 oracle cases.
  • Measures: kernel latency vs autoregressive, gold-token coverage under copy primitive, round-trip EM under oracle and perturbed span selection, a small fine-tuning pilot.

Results

  • Kernel: 6.8×–303× speedup copying N∈[8,512] tokens.
  • Coverage: 74–98% line-level, 91–99% token-level.
  • Closed-form wall-clock bounds: 29.0× / 3.4× / 4.2× per corpus, 13.0× pooled (4.5×–6.5× floors token-level).
  • Oracle programs round-trip losslessly on all 482 cases; off-by-one span noise collapses EM to 15.48%.
  • Fine-tuning pilot lifts HEvalFix-Py EM from 0/33 to 12–17% — a learnability signal, not a deployed system.

Why It Matters

Edit-heavy workloads (code assistants, refactoring, document rewriting) dominate real LLM serving cost. This work identifies copying as the bottleneck and provides a serving-layer primitive that composes with existing speculative-decoding kernels, pointing to substantial latency wins without retraining the base model.

Connections to Prior Work

Speculative decoding (Leviathan et al., Medusa), constrained/grammar-guided decoding (Outlines, XGrammar), retrieval/copy mechanisms (CopyNet, pointer networks), and edit-as-diff code models (CodeEditor, InCoder). Distinctive in using the input itself as a deterministic draft with program-level acceptance.

Open Questions

  • Can a policy reliably pick correct spans end-to-end? Off-by-one fragility is severe.
  • Batched-serving and multi-file edits remain unimplemented.
  • Interaction with standard speculative decoding and KV quantization is unexplored.
  • Training signal beyond the tiny pilot — data, loss, and scaling behavior — is open.

Figures

Figure 1: Figure 1 (extracted from PDF)

Figure 1

Figure 2: Figure 2 (extracted from PDF)

Figure 2

Figure 3: Figure 3 (extracted from PDF)

Figure 3


Original abstract

LLMs edit text and code by autoregressively regenerating the full output, even when most tokens appear verbatim in the input. We study Copy-as-Decode, a decoding-layer mechanism that recasts edit generation as structured decoding over a two-primitive grammar: references an input line range, emits new content. A token-level FSM guarantees syntactic validity, and a serving-layer primitive updates the KV cache for each copy span via a single parallel-prefill forward rather than $N$ autoregressive steps – sharing the parallel-forward kernel of speculative decoding but with input tokens as the draft and program-enforced acceptance replacing probabilistic verification. We report an upper-bound analysis that requires no end-to-end training. (i) Kernel speedup: on Qwen2.5-{1.5B, 7B}, copying $N$ tokens via parallel prefill is $6.8\times$–$303\times$ faster than autoregressive ($N \in [8, 512]$, A100 80GB bf16). (ii) Copy ceiling: on ProbeEdit and HumanEvalPack-Fix (Py/JS), $74$–$98%$ of gold tokens are reachable under the line-level primitive; composed with the empirical kernel over each corpus’s span histogram this yields a closed-form wall-clock bound of $29.0\times / 3.4\times / 4.2\times$ ($13.0\times$ pooled). A token-level extension reaches $91$–$99%$ coverage with $4.5\times$–$6.5\times$ floors. (iii) Pipeline losslessness: oracle programs round-trip through the deterministic resolver on all $482$ cases, localizing any downstream failure to span selection rather than the mechanism. A perturbation study shows pooled EM drops from $100%$ to $15.48%$ under off-by-one noise. A fine-tuning pilot on Qwen2.5-Coder-1.5B lifts HEvalFix-Py EM from $0/33$ (untrained) to $12$–$17%$, a learnability signal, not a production selector. Batched-serving integration and multi-file coverage are scoped as follow-up.