Statistics, Not Scale: Modular Medical Dialogue with Bayesian Belief Engine

arXiv: 2604.20022 · PDF

Authors: Yusuf Kesmen, Fay Elhassan, Jiayi Ma, Julien Stalhandske, David Sasu, Alexandra Kulinkina, Akhil Arora, Lars Klein, Mary-Anne Hartley

Primary category: cs.LG · all: cs.AI, cs.CL, cs.LG

Matched keywords: large language model, llm, agent, rag, reasoning, inference

TL;DR

BMBE splits medical dialogue into an LLM “sensor” that parses utterances and a deterministic Bayesian engine that handles all diagnostic inference, yielding calibrated, private, and robust diagnosis that beats frontier standalone LLMs at a fraction of the cost.

Key Ideas

LLMs conflate language understanding with probabilistic reasoning; this is an architectural flaw.
Strict modular separation: LLM only parses/verbalises; Bayesian engine owns all inference.
Patient data never enters the LLM → private by construction.
Swappable statistical backend per population, no retraining required.
Delivers calibrated selective diagnosis with tunable accuracy-coverage tradeoff.
Claims a “statistical separation gap”: cheap-sensor + engine > frontier standalone LLM.
Robust to adversarial/atypical patient communication styles.

Approach

BMBE (Bayesian Medical Belief Engine) uses an LLM purely as a sensor — extracting structured evidence from free-text patient replies and rendering follow-up questions in natural language. A deterministic, auditable Bayesian inference module maintains belief over diagnoses, selects next questions, and decides when to commit. The statistical backend is a standalone module, decoupled from the LLM’s weights.

Experiments

Evaluated against frontier standalone LLM “doctors” from the same family, using both empirical and LLM-generated medical knowledge bases. Metrics implied: diagnostic accuracy, coverage, calibration, cost, and robustness under adversarial patient communication styles. Specific datasets, model names, and numeric baselines are not stated in the abstract.

Results

Abstract reports qualitative findings: (1) calibrated selective diagnosis with a continuously adjustable accuracy-coverage curve; (2) a cheap LLM sensor + Bayesian engine outperforms a frontier same-family standalone model at much lower cost; (3) standalone LLM doctors collapse under adversarial phrasing while BMBE remains robust. No headline numbers are disclosed in the abstract.

Why It Matters

Shows that for high-stakes dialogue agents, pairing a small LLM with an explicit probabilistic backend can beat scaling alone — cheaper, auditable, privacy-preserving, and population-swappable. A concrete template for neurosymbolic medical/agentic systems where regulators demand traceable reasoning.

Connections to Prior Work

Neurosymbolic AI and tool-augmented LLMs; Bayesian diagnostic networks (QMR-DT, Internist-1); LLM-as-judge vs LLM-as-sensor framings; selective prediction / abstention literature; calibration work on medical LLMs (Med-PaLM, AMIE); prior critiques of LLMs as probabilistic reasoners (e.g., faithful CoT, Toolformer).

Open Questions

How is the Bayesian knowledge base constructed and maintained at scale?
Quantitative accuracy, calibration, and cost numbers vs named baselines?
Does the sensor LLM introduce systematic parsing biases that corrupt the belief state?
Scaling beyond diagnosis to triage, treatment, or multi-turn longitudinal care?
Performance on real clinical transcripts versus simulated patients?

Figures

Figure 1: Page 2 (rendered)

Figure 2: Page 3 (rendered)

Figure 3: Page 4 (rendered)

Original abstract

Large language models are increasingly deployed as autonomous diagnostic agents, yet they conflate two fundamentally different capabilities: natural-language communication and probabilistic reasoning. We argue that this conflation is an architectural flaw, not an engineering shortcoming. We introduce BMBE (Bayesian Medical Belief Engine), a modular diagnostic dialogue framework that enforces a strict separation between language and reasoning: an LLM serves only as a sensor, parsing patient utterances into structured evidence and verbalising questions, while all diagnostic inference resides in a deterministic, auditable Bayesian engine. Because patient data never enters the LLM, the architecture is private by construction; because the statistical backend is a standalone module, it can be replaced per target population without retraining. This separation yields three properties no autonomous LLM can offer: calibrated selective diagnosis with a continuously adjustable accuracy-coverage tradeoff, a statistical separation gap where even a cheap sensor paired with the engine outperforms a frontier standalone model from the same family at a fraction of the cost, and robustness to adversarial patient communication styles that cause standalone doctors to collapse. We validate across empirical and LLM-generated knowledge bases against frontier LLMs, confirming the advantage is architectural, not informational.