Preference Heads in Large Language Models: A Mechanistic Framework for Interpretable Personalization

arXiv: 2604.22345 · PDF

Authors: Weixu Zhang, Ye Yuan, Changjiang Han, Yuxing Tian, Zipeng Sun, Linfeng Du, Jikun Kang, Hong Kang, Xue Liu, Haolun Wu

Affiliations: McGill University, Mila - Quebec AI Institute, MBZUAI, University of Montreal, Salesforce

Primary category: cs.CL · all: cs.CL

Matched keywords: large language model, llm, rag, inference, serving, attention, transformer

TL;DR

The paper proposes Differential Preference Steering (DPS), a training-free mechanistic interpretability framework that identifies sparse “Preference Heads” — attention heads causally encoding user-specific style and topic — and contrasts logits with/without them at decoding time to deliver interpretable personalization in LLMs.

Key Ideas

Hypothesizes a sparse set of Preference Heads inside transformer attention that causally drive user-specific stylistic and topical preferences.
Introduces a Preference Contribution Score (PCS) via causal masking to quantify each head’s impact on user-aligned outputs.
Proposes Differential Preference Steering (DPS): contrast logits with vs. without Preference Heads to amplify personalized continuations at inference — no fine-tuning.
Frames personalization as a mechanistic, interpretable phenomenon rather than a black-box prompt/FT behavior.

Approach

DPS has two stages. (1) Head discovery: for each attention head, mask it and measure the drop in user-aligned output likelihood, producing PCS; heads with top PCS are labeled Preference Heads per user. (2) Decoding: run two forward passes — full model vs. model with Preference Heads masked — and take a contrastive combination of their logits to up-weight tokens whose probability depends on those heads, yielding stronger preference-aligned continuations while keeping the rest of generation unchanged. The method is training-free and adds only a second partial forward pass.

Experiments

Evaluated on widely used personalization benchmarks across multiple LLMs (specific datasets/model families not named in the abstract). Metrics cover personalization fidelity, content coherence, and computational overhead. Baselines are implied to be prompt-engineering and fine-tuning based personalization approaches, though the abstract does not enumerate them.

Results

The abstract reports “consistent gains” in personalization fidelity while preserving coherence at low overhead; no specific numbers are given. Analyses also show Preference Heads are sparse, per-user-specific (limited cross-user overlap), and that performance is stable across a wide range of selected-head counts K.

Why It Matters

Gives practitioners a plug-in, training-free knob for controllable personalization and a mechanistic lens on where user preference lives inside a transformer — useful for auditing, editing, and safely toggling personalization in deployed agents.

Connections to Prior Work

Extends circuit-level mechanistic interpretability (induction heads, function-specific attention heads), contrastive decoding and classifier-free-guidance-style logit steering, activation patching / causal mediation analysis, and the broader LLM personalization literature based on prompt tuning or user-data fine-tuning.

Open Questions

Which benchmarks and baselines exactly, and what are the headline numbers? How stable are Preference Heads across sessions, domains, and model scales? Can they be adversarially hijacked or leak private user signal? How does DPS compose with RLHF/DPO-style preference tuning, and does contrastive decoding hurt factuality?

Original abstract

Large Language Models (LLMs) exhibit strong implicit personalization ability, yet most existing approaches treat this behavior as a black box, relying on prompt engineering or fine tuning on user data. In this work, we adopt a mechanistic interpretability perspective and hypothesize the existence of a sparse set of Preference Heads, attention heads that encode user specific stylistic and topical preferences and exert a causal influence on generation. We introduce Differential Preference Steering (DPS), a training free framework that (1) identifies Preference Heads through causal masking analysis and (2) leverages them for controllable and interpretable personalization at inference time. DPS computes a Preference Contribution Score (PCS) for each attention head, directly measuring its causal impact on user aligned outputs. During decoding, we contrast model predictions with and without Preference Heads, amplifying the difference between personalized and generic logits to selectively strengthen preference aligned continuations. Experiments on widely used personalization benchmarks across multiple LLMs demonstrate consistent gains in personalization fidelity while preserving content coherence and low computational overhead. Beyond empirical improvements, DPS provides a mechanistic explanation of where and how personalization emerges within transformer architectures. Our implementation is publicly available.