Preference Heads in Large Language Models: A Mechanistic Framework for Interpretable Personalization

arXiv: 2604.22345 · PDF

Authors: Weixu Zhang, Ye Yuan, Changjiang Han, Yuxing Tian, Zipeng Sun, Linfeng Du, Jikun Kang, Hong Kang, Xue Liu, Haolun Wu

Primary category: cs.CL · all: cs.CL

Matched keywords: large language model, llm, rag, inference, serving, attention, transformer

TL;DR

The paper hypothesizes that LLM personalization is driven by a sparse set of “Preference Heads” — specific attention heads encoding user style/topic preferences. It introduces Differential Preference Steering (DPS), a training-free decoding method that identifies these heads via causal masking and amplifies their effect at inference.

Key Ideas

Personalization in LLMs is localized in a sparse subset of attention heads (“Preference Heads”).
A Preference Contribution Score (PCS) quantifies each head’s causal impact on user-aligned outputs.
Contrastive decoding between “with” and “without” Preference Heads amplifies preference-aligned logits.
Fully training-free and interpretable — no user-data fine-tuning.

Approach

DPS works in two stages. First, causal masking analysis ablates attention heads and measures the shift in user-aligned outputs, yielding a PCS per head and selecting top-scoring ones as Preference Heads. Second, at decoding, the model runs twice — once normally and once with Preference Heads masked — and the logit difference is amplified to steer generation toward personalized continuations.

Experiments

The abstract mentions “widely used personalization benchmarks across multiple LLMs” but does not name specific datasets, baselines, or metrics. Presumably compared against prompt engineering and fine-tuning baselines on personalization fidelity and coherence metrics — details thin in the abstract.

Results

Claims “consistent gains in personalization fidelity while preserving content coherence and low computational overhead.” No concrete numbers are provided in the abstract, so the magnitude of improvement cannot be verified from the text alone.

Why It Matters

Offers a training-free, interpretable route to personalization — useful for deployments where per-user fine-tuning is infeasible or privacy-sensitive. For infra practitioners, the contrastive two-pass decoding is cheap relative to fine-tuning but doubles inference cost per token. Also contributes to mechanistic interpretability by localizing a behavioral property (personalization) to specific heads.

Connections to Prior Work

Builds on mechanistic interpretability work identifying functional attention heads (induction heads, retrieval heads, truthfulness heads). The decoding strategy is reminiscent of Contrastive Decoding and DoLa, and of activation steering / representation engineering. Contrasts with personalization via RLHF, PEFT, or prompt-based methods.

Open Questions

How many heads qualify as Preference Heads, and how stable across models/users?
Does amplification degrade factuality or induce sycophancy?
How is user preference represented to compute PCS — requires labeled user data?
Runtime overhead of dual-pass decoding at scale?
Generalization to multi-user / conflicting-preference settings unexplored.

Original abstract

Large Language Models (LLMs) exhibit strong implicit personalization ability, yet most existing approaches treat this behavior as a black box, relying on prompt engineering or fine tuning on user data. In this work, we adopt a mechanistic interpretability perspective and hypothesize the existence of a sparse set of Preference Heads, attention heads that encode user specific stylistic and topical preferences and exert a causal influence on generation. We introduce Differential Preference Steering (DPS), a training free framework that (1) identifies Preference Heads through causal masking analysis and (2) leverages them for controllable and interpretable personalization at inference time. DPS computes a Preference Contribution Score (PCS) for each attention head, directly measuring its causal impact on user aligned outputs. During decoding, we contrast model predictions with and without Preference Heads, amplifying the difference between personalized and generic logits to selectively strengthen preference aligned continuations. Experiments on widely used personalization benchmarks across multiple LLMs demonstrate consistent gains in personalization fidelity while preserving content coherence and low computational overhead. Beyond empirical improvements, DPS provides a mechanistic explanation of where and how personalization emerges within transformer architectures. Our implementation is publicly available.