Lightweight Retrieval-Augmented Generation and Large Language Model-Based Modeling for Scalable Patient-Trial Matching

arXiv: 2604.22061 · PDF

Authors: Xiaodi Li, Yang Xiao, Munhwan Lee, Konstantinos Leventakos, Young J. Juhn, David Jones, Terence T. Sio, Wei Liu, Maria Vassilaki, Nansu Zong

Affiliations: Mayo Clinic, University of Tulsa

Primary category: cs.CL · all: cs.AI, cs.CL, cs.LG

Matched keywords: large language model, llm, retrieval, reasoning, serving, fine-tun

TL;DR

A lightweight patient-trial matching framework that uses retrieval-augmented generation to extract relevant EHR segments and LLMs to encode them, achieving performance comparable to end-to-end LLM pipelines at substantially lower compute cost.

Key Ideas

Decouples retrieval (finding clinically relevant EHR snippets) from representation (LLM encoding) for scalable patient-trial matching.
Frozen LLMs suffice for structured clinical data; fine-tuning is essential for unstructured narratives.
Dimensionality reduction plus lightweight predictors replace expensive end-to-end LLM inference.
Validated across public benchmarks and a real-world Mayo Clinic multimodal dataset.

Approach

The pipeline has three stages: (1) RAG retrieves eligibility-relevant segments from long EHRs, reducing input length; (2) an LLM (frozen or fine-tuned depending on data type) encodes these segments into embeddings; (3) embeddings are compressed via dimensionality reduction and fed into lightweight classifiers for trial-eligibility prediction. Separating retrieval from encoding keeps token counts small and modular.

Experiments

Evaluated on four public benchmarks (n2c2, SIGIR, TREC 2021, TREC 2022) and the multimodal Mayo Clinic Patient-Trial Matching Dataset (MCPMD). Baselines implicitly include end-to-end LLM approaches and traditional ML methods. Metrics are not named in the abstract.

Results

Abstract reports qualitative wins: retrieval-based selection preserves clinical signal while cutting compute; frozen LLMs are strong for structured inputs; fine-tuning is required for narratives; overall the lightweight pipeline matches end-to-end LLM performance at substantially lower cost. No specific numbers provided.

Why It Matters

Shows that clinical LLM applications over long EHRs can avoid full-document inference by pairing RAG with compact encoders and classical predictors — a blueprint for cost-efficient deployment in hospital settings where compute and latency constraints rule out end-to-end LLM pipelines.

Connections to Prior Work

Builds on retrieval-augmented generation (Lewis et al.), LLM embedding probing, and prior patient-trial matching systems like TrialGPT and Criteria2Query. The frozen-vs-fine-tuned split echoes findings from clinical NLP benchmarks (e.g., BioBERT, ClinicalBERT) on structured vs unstructured data.

Open Questions

Which concrete metrics and deltas justify “comparable” performance?
How does retrieval quality degrade on noisy or multilingual EHRs?
Is the lightweight predictor robust to distribution shift across hospitals beyond Mayo?
How are eligibility criteria themselves parsed — is that also retrieval-mediated?
What are failure modes when criteria span multiple disjoint EHR sections?

Original abstract

Patient-trial matching requires reasoning over long, heterogeneous electronic health records (EHRs) and complex eligibility criteria, posing significant challenges for scalability, generalization, and computational efficiency. Existing approaches either rely on full-document processing with large language models (LLMs), which is computationally expensive, or use traditional machine learning methods that struggle to capture unstructured clinical narratives. In this work, we propose a lightweight framework that combines retrieval-augmented generation and large language model-based modeling for scalable patient-trial matching. The framework explicitly separates two key components: retrieval-augmented generation is used to identify clinically relevant segments from long EHRs, reducing input complexity, while large language models are used to encode these selected segments into informative representations. These representations are further refined through dimensionality reduction and modeled using lightweight predictors, enabling efficient and scalable downstream classification. We evaluate the proposed approach on multiple public benchmarks (n2c2, SIGIR, TREC 2021/2022) and a real-world multimodal dataset from Mayo Clinic (MCPMD). Results show that retrieval-based information selection significantly reduces computational burden while preserving clinically meaningful signals. We further demonstrate that frozen LLMs provide strong representations for structured clinical data, whereas fine-tuning is essential for modeling unstructured clinical narratives. Importantly, the proposed lightweight pipeline achieves performance comparable to end-to-end LLM approaches with substantially lower computational cost.