arXiv: 2604.22050 · PDF

Authors: Mohamed Ali Souibgui, Jan Fostier, Rodrigo Abadía-Heredia, Bohdan Denysenko, Christian Marschke, Igor Peric

Affiliations: Openchip & Softwares Technologies

Primary category: cs.LG · all: cs.CL, cs.LG

Matched keywords: llm, inference, serving, attention, transformer, throughput, latency


TL;DR

LayerBoost is a layer-aware attention reduction method that applies different attention strategies (softmax, linear sliding-window, or removal) per layer based on sensitivity analysis, followed by lightweight distillation healing using just 10M tokens. It improves throughput by up to 68% at high concurrency while preserving quality.

Key Ideas

  • Uniform attention linearization across all layers causes large quality drops or requires heavy retraining.
  • Transformer layers have heterogeneous sensitivity — some tolerate aggressive attention reduction, others do not.
  • Three-tier strategy: keep softmax in highly sensitive layers, use linear sliding-window in moderately sensitive layers, remove attention entirely in low-sensitivity layers.
  • A lightweight distillation-based healing phase (only 10M tokens) recovers post-surgery quality.

Approach

  1. Run a systematic per-layer sensitivity analysis on a pretrained model, measuring benchmark degradation when each layer’s attention is altered/removed.
  2. Rank layers and assign one of three treatments — retain softmax, swap to linear sliding window attention, or drop attention entirely.
  3. Apply the resulting hybrid architecture, then run a short distillation “healing” pass with 10M additional training tokens to restore quality.

Experiments

The abstract is thin on specifics: it references standard benchmarks and comparisons against base models plus state-of-the-art attention linearization methods, measured via inference latency, throughput (TPS), GPU memory, and accuracy. Figures indicate evaluation at concurrency levels 50/100/200 and decoding-latency/memory profiling on an A10 24GB GPU with batch size 16.

Results

  • Up to 68% throughput improvement at high concurrency.
  • Matches base model performance on several benchmarks; minor degradation on others.
  • Significantly outperforms state-of-the-art attention linearization baselines in the efficiency–quality trade-off.
  • Reduced decoding latency and GPU memory footprint across varying decoding lengths.

Figure 1 Figure 2 Figure 3

Why It Matters

For practitioners serving LLMs under high concurrency or on constrained hardware, LayerBoost offers sizable throughput and memory wins without the multi-billion-token retraining cost of prior linearization work — a practical retrofit path for existing pretrained models.

Connections to Prior Work

Builds on linear attention (Linformer, Performer), sliding-window attention (Longformer, Mistral), hybrid architectures (Jamba, Zamba), attention linearization/distillation methods (SUPRA, MambaInLlama), and layer-importance / pruning literature for transformers.

Open Questions

  • Which exact base models and benchmarks were evaluated, and at what scales?
  • How robust is the sensitivity ranking across tasks, domains, and longer contexts?
  • Can 10M-token healing scale to trillion-parameter models or very long-context regimes?
  • How does the hybrid scheme interact with KV-cache compression and speculative decoding?

Original abstract

Transformers are mostly relying on softmax attention, which introduces quadratic complexity with respect to sequence length and remains a major bottleneck for efficient inference. Prior work on linear or hybrid attention typically replaces softmax attention uniformly across all layers, often leading to significant performance degradation or requiring extensive retraining to recover model quality. This work proposes LayerBoost, a layer-aware attention reduction method that selectively modifies the attention mechanism based on the sensitivity of individual transformer layers. It first performs a systematic sensitivity analysis on a pretrained model to identify layers that are critical for maintaining performance. Guided by this analysis, three distinct strategies can be applied: retaining standard softmax attention in highly sensitive layers, replacing it with linear sliding window attention in moderately sensitive layers, and removing attention entirely in layers that exhibit low sensitivity. To recover performance after these architectural modifications, we introduce a lightweight distillation-based healing phase requiring only 10M additional training tokens. LayerBoost reduces inference latency and improves throughput by up to 68% at high concurrency, while maintaining competitive model quality. It matches base model performance on several benchmarks, exhibits only minor degradations on others, and significantly outperforms state-of-the-art attention linearization methods. These efficiency gains make our method particularly well-suited for high-concurrency serving and hardware-constrained deployment scenarios, where inference cost and memory footprint are critical bottlenecks.