arXiv: 2604.24715 · PDF
作者: Parsa Ashrafi Fashi, Utkarsh Saxena, Mehdi Rezagholizadeh, Aref Jafari, Akash Haridas, Mingyu Yang, Vansh Bhatia, Guihong Li, Vikram Appia, Emad Barsoum
单位: AMD
主分类: cs.CL · 全部: cs.CL, cs.LG
命中关键词: llm, reasoning, inference, serving, kv-cache, attention, transformer, post-train
自动分析不可用(claude CLI timeout)。展示原始摘要。
摘要
Hybrid sequence models that combine efficient Transformer components with linear sequence modeling blocks are a promising alternative to pure Transformers, but most are still pretrained from scratch and therefore fail to reuse existing Transformer checkpoints. We study upcycling as a practical path to convert pretrained Transformer LLMs into hybrid architectures while preserving short-context quality and improving long-context capability. We call our solution \emph{HyLo} (HYbrid LOng-context): a long-context upcycling recipe that combines architectural adaptation with efficient Transformer blocks, Multi-Head Latent Attention (MLA), and linear blocks (Mamba2 or Gated DeltaNet), together with staged long-context training and teacher-guided distillation for stable optimization. HyLo extends usable context length by up to $32\times$ through efficient post-training and reduces KV-cache memory by more than $90%$, enabling up to 2M-token prefill and decoding in our \texttt{vLLM} inference stack, while comparable Llama baselines run out of memory beyond 64K context. Across 1B- and 3B-scale settings (Llama- and Qwen-based variants), HyLo delivers consistently strong short- and long-context performance and significantly outperforms state-of-the-art upcycled hybrid baselines on long-context evaluations such as RULER. Notably, at similar scale, HyLo-Qwen-1.7B trained on only 10B tokens significantly outperforms JetNemotron (trained on 400B tokens) on GSM8K, Lm-Harness common sense reasoning and RULER-64K.
论文图表
图 1: Figure 1 : Short-context math performance and average RULER accuracy across 8K, 16K, 32K and 64K context lengths. HyLo models achieve competitive short context performance while outperforming baselines on long-context benchmark in a limited upcycling data budget.

图 2: Figure 2 : Evaluation on synthetic needle in haystack benchmark demonstrates that our upcycled hybrid 4MLA12M2 model (at only 3.9 % 3.9% KV cache footprint) achieves comparable performance to Llama-3.2-1B and surpasses Zebra-Llama. Furthermore, finetuning at 64K sequence length surpasses performance compared to 8K sequence length showcasing the need for long context finetuning.

图 3: Figure 3 : Impact of training sequence length and position interpolation using Yarn. Applying Yarn extension improves long context performance with a slight degradation in short context commonsense reasoning abilities. Furthermore, training at longer context preserves the long context abilities to a greater extent.

图 4: Figure 4 : Impact of size of teacher at long context knowledge distillation. Larger teacher improves both short-context common sense reasoning tasks as well as long context ability.

图 5: Figure 5 : TTFT and TPOT comparison for 3B models with backbone model Llama-3.2-3B on vLLM.

图 6: Figure 6 : Overview of MLA initialization from a pretrained Transformer attention block.
