Aligning Dense Retrievers with LLM Utility via DistillationAligning Dense Retrievers with LLM Utility via Distillation

Authors: Rajinder Sandhu, Di Mu, Cheng Chang, Md Shahriar Tasjid, Himanshu Rai, Maksims Volkovs, Ga Wu

Affiliations: Layer 6 AI, Dalhousie University

Primary category: cs.IR · all: cs.AI, cs.IR, cs.LG

Matched keywords: llm, retrieval, rag, inference, serving

TL;DR

Utility-Aligned Embeddings (UAE) distills an LLM’s perplexity-reduction utility distribution into a bi-encoder via a Utility-Modulated InfoNCE objective, delivering re-ranker-quality retrieval at embedding-search speed.

Key Ideas

Reframe dense retrieval as distribution matching against an LLM-derived utility signal, not just semantic similarity.
Utility-Modulated InfoNCE injects graded utility (from perplexity reduction) into contrastive training.
Pushes LLM re-ranking quality into the embedding space, so no test-time LLM inference is needed.

Approach

Train a bi-encoder to imitate a teacher utility distribution computed from perplexity reduction when candidate passages are fed to an LLM. The contrastive InfoNCE loss is modulated by these graded utility scores, producing embeddings whose similarity rankings mirror LLM utility without requiring re-ranking at query time.

Experiments

Benchmark: QASPER. Baseline: BGE-Base (strong semantic bi-encoder), plus comparison against efficient LLM re-ranking methods. Metrics: Recall@1, MAP, Token F1, and inference-time speedup.

Results

On QASPER, UAE improves Recall@1 by 30.59%, MAP by 30.16%, and Token F1 by 17.3% over BGE-Base. It is >180× faster than efficient LLM re-rankers while retaining competitive quality — suggesting the distillation preserves most of the utility signal.

Why It Matters

Lets RAG pipelines get re-ranker-level context quality at bi-encoder cost: no extra LLM call per query, just a drop-in embedding model. For agent/LLM infra, this shrinks latency and GPU budget for retrieval stacks while raising answer quality.

Connections to Prior Work

Dense retrieval / bi-encoders: DPR, Contriever, BGE.
Utility-based / LLM re-ranking: UPR, RankGPT, perplexity-based passage scoring.
Distillation into retrievers: RocketQA, Distill-DPR, cross-encoder→bi-encoder distillation.
Contrastive objectives: InfoNCE and its many extensions.

Open Questions

Does UAE transfer beyond QASPER (open-domain QA, multi-hop, code, enterprise RAG)?
Sensitivity to the teacher LLM’s size and to perplexity noise — does a weak teacher degrade or help?
How does it compose with downstream cross-encoder re-rankers — complement or redundant?
Robustness to domain shift and long-context corpora where perplexity signals get noisier.
Training cost of computing per-(query, passage) utility scores at scale.

Original abstract

Dense vector retrieval is the practical backbone of Retrieval- Augmented Generation (RAG), but similarity search can suffer from precision limitations. Conversely, utility-based approaches leveraging LLM re-ranking often achieve superior performance but are computationally prohibitive and prone to noise inherent in perplexity estimation. We propose Utility-Aligned Embeddings (UAE), a framework designed to merge these advantages into a practical, high-performance retrieval method. We formulate retrieval as a distribution matching problem, training a bi-encoder to imitate a utility distribution derived from perplexity reduction using a Utility-Modulated InfoNCE objective. This approach injects graded utility signals directly into the embedding space without requiring test-time LLM inference. On the QASPER benchmark, UAE improves retrieval Recall@1 by 30.59%, MAP by 30.16% and Token F1 by 17.3% over the strong semantic baseline BGE-Base. Crucially, UAE is over 180x faster than the efficient LLM re-ranking methods preserving competitive performance, demonstrating that aligning retrieval with generative utility yields reliable contexts at scale.