Preference Heads in Large Language Models: A Mechanistic Framework for Interpretable Personalization

arXiv: 2604.22345 · PDF

Authors: Weixu Zhang, Ye Yuan, Changjiang Han, Yuxing Tian, Zipeng Sun, Linfeng Du, Jikun Kang, Hong Kang, Xue Liu, Haolun Wu

Affiliations: McGill University, Mila - Quebec AI Institute, MBZUAI, University of Montreal, Salesforce

Primary category: cs.CL · all: cs.CL

Matched keywords: large language model, llm, rag, inference, serving, attention, transformer

TL;DR

This paper identifies sparse “Preference Heads” in LLMs—attention heads causally responsible for user-specific stylistic and topical preferences—and proposes Differential Preference Steering (DPS), a training-free inference-time framework that amplifies their contribution for interpretable, controllable personalization.

Key Ideas

Hypothesis: personalization is concentrated in a sparse subset of attention heads (“Preference Heads”).
Preference Contribution Score (PCS) quantifies each head’s causal impact on user-aligned outputs via causal masking.
Differential Preference Steering (DPS) contrasts logits with and without Preference Heads to amplify personalized continuations.
DPS is training-free, low-overhead, and offers a mechanistic explanation of where personalization emerges in transformers.

Approach

DPS proceeds in two stages. First, it runs causal masking analysis across all attention heads to compute each head’s PCS—its measured causal impact on outputs aligned with a target user. Second, during decoding, it performs a contrastive forward pass: one with Preference Heads active and one with them masked, then amplifies the logit difference to bias sampling toward preference-aligned continuations. No parameter updates or per-user fine-tuning are required.

Experiments

Evaluation uses widely adopted personalization benchmarks (unspecified in the abstract) across multiple LLM backbones. Metrics cover personalization fidelity, content coherence, and computational overhead. Baselines implicitly include prompt-engineering and fine-tuning approaches that treat personalization as a black box.

Results

The abstract reports consistent gains in personalization fidelity while preserving coherence and keeping overhead low, though no specific numbers are disclosed.

Per-user PCS heatmaps confirm that high-PCS heads are sparse and unevenly scattered across layers, supporting the Preference Head hypothesis.

Pairwise Jaccard overlap between users’ top-K head sets is near zero, showing Preference Heads are largely user-specific and motivating cluster-aware head discovery.

Accuracy and F1 remain stable across a wide K range and saturate at moderate values, indicating personalization signal concentrates in a limited subset of heads.

Why It Matters

DPS gives practitioners a lightweight, training-free knob for personalization that also exposes which heads drive user-specific behavior—useful for auditing, debugging, and building controllable assistant stacks without per-user fine-tuning pipelines.

Connections to Prior Work

Builds on mechanistic interpretability (circuit/head attribution, induction heads), contrastive decoding and classifier-free guidance-style logit steering, and personalization via prompt tuning or RLHF/fine-tuning on user data.

Open Questions

Which benchmarks and LLMs were used, and what are the absolute numbers?
How stable are Preference Heads across domains, languages, and time?
Can head sets be shared across user clusters to reduce per-user calibration cost?
Risks of over-steering toward stereotyped or privacy-sensitive user traits remain unexplored.

Original abstract

Large Language Models (LLMs) exhibit strong implicit personalization ability, yet most existing approaches treat this behavior as a black box, relying on prompt engineering or fine tuning on user data. In this work, we adopt a mechanistic interpretability perspective and hypothesize the existence of a sparse set of Preference Heads, attention heads that encode user specific stylistic and topical preferences and exert a causal influence on generation. We introduce Differential Preference Steering (DPS), a training free framework that (1) identifies Preference Heads through causal masking analysis and (2) leverages them for controllable and interpretable personalization at inference time. DPS computes a Preference Contribution Score (PCS) for each attention head, directly measuring its causal impact on user aligned outputs. During decoding, we contrast model predictions with and without Preference Heads, amplifying the difference between personalized and generic logits to selectively strengthen preference aligned continuations. Experiments on widely used personalization benchmarks across multiple LLMs demonstrate consistent gains in personalization fidelity while preserving content coherence and low computational overhead. Beyond empirical improvements, DPS provides a mechanistic explanation of where and how personalization emerges within transformer architectures. Our implementation is publicly available.