arXiv: 2604.22061 · PDF
Authors: Xiaodi Li, Yang Xiao, Munhwan Lee, Konstantinos Leventakos, Young J. Juhn, David Jones, Terence T. Sio, Wei Liu, Maria Vassilaki, Nansu Zong
Affiliations: Mayo Clinic, University of Tulsa
Primary category: cs.CL · all: cs.AI, cs.CL, cs.LG
Matched keywords: large language model, llm, retrieval, reasoning, serving, fine-tun
TL;DR
A lightweight patient-trial matching framework that uses retrieval-augmented generation (RAG) to select clinically relevant EHR segments and LLMs to encode them, then applies dimensionality reduction plus lightweight predictors — matching end-to-end LLM performance at far lower cost.
Key Ideas
- Decouple retrieval (RAG over long EHRs) from representation (LLM encoding) from prediction (lightweight classifier).
- Frozen LLMs suffice for structured clinical data; fine-tuning is necessary for unstructured narratives.
- Retrieval-based segment selection cuts computational load while preserving clinically meaningful signal.
- Scalable pipeline generalizes across public benchmarks and a real-world multimodal Mayo Clinic dataset.
Approach
A three-stage pipeline: (1) RAG identifies eligibility-relevant snippets from long, heterogeneous EHRs, shrinking input length; (2) an LLM encodes the selected segments into dense representations (frozen for structured data, fine-tuned for unstructured narratives); (3) dimensionality reduction compresses embeddings, which feed a lightweight predictor for the final patient-trial matching classification.
Experiments
Evaluated on public benchmarks n2c2, SIGIR, and TREC 2021/2022, plus MCPMD — a real-world multimodal dataset from Mayo Clinic. Baselines implicitly include full-document LLM pipelines and traditional ML methods for unstructured clinical text. Metrics not specified in the abstract.
Results
The lightweight pipeline reaches performance comparable to end-to-end LLM approaches at substantially lower computational cost. Retrieval-based selection preserves clinical signal while reducing burden. Frozen LLMs perform strongly on structured inputs; fine-tuning materially helps on unstructured narratives. Exact numbers not reported in the abstract.
Why It Matters
For clinical AI infra, it shows you don’t need to push entire EHRs through giant LLMs: a RAG-first, encode-then-classify recipe can scale trial matching to real hospital workloads. Signals when frozen vs. fine-tuned LLM embeddings are worth the compute — useful guidance for other long-document medical reasoning tasks.
Connections to Prior Work
Builds on retrieval-augmented generation (Lewis et al.), LLM-as-encoder paradigms (Sentence-BERT, E5-style embedding models), and prior patient-trial matching work on n2c2/TREC Clinical Trials tracks (e.g., COMPOSE, TrialGPT). Aligns with the broader “frozen foundation model + lightweight head” trend in clinical NLP.
Open Questions
- Which retriever and chunking strategy drive the gains, and how sensitive is the pipeline to them?
- Concrete metric deltas vs. end-to-end LLM baselines are not disclosed in the abstract.
- How does it handle criteria requiring temporal reasoning or cross-document aggregation that retrieval may fragment?
- Does performance transfer across institutions beyond Mayo, and what are the fairness / subgroup implications?
Figures
Figure 1: Page 2 (rendered)

Figure 2: Page 3 (rendered)

Figure 3: Page 4 (rendered)

Original abstract
Patient-trial matching requires reasoning over long, heterogeneous electronic health records (EHRs) and complex eligibility criteria, posing significant challenges for scalability, generalization, and computational efficiency. Existing approaches either rely on full-document processing with large language models (LLMs), which is computationally expensive, or use traditional machine learning methods that struggle to capture unstructured clinical narratives. In this work, we propose a lightweight framework that combines retrieval-augmented generation and large language model-based modeling for scalable patient-trial matching. The framework explicitly separates two key components: retrieval-augmented generation is used to identify clinically relevant segments from long EHRs, reducing input complexity, while large language models are used to encode these selected segments into informative representations. These representations are further refined through dimensionality reduction and modeled using lightweight predictors, enabling efficient and scalable downstream classification. We evaluate the proposed approach on multiple public benchmarks (n2c2, SIGIR, TREC 2021/2022) and a real-world multimodal dataset from Mayo Clinic (MCPMD). Results show that retrieval-based information selection significantly reduces computational burden while preserving clinically meaningful signals. We further demonstrate that frozen LLMs provide strong representations for structured clinical data, whereas fine-tuning is essential for modeling unstructured clinical narratives. Importantly, the proposed lightweight pipeline achieves performance comparable to end-to-end LLM approaches with substantially lower computational cost.