Statistics, Not Scale: Modular Medical Dialogue with Bayesian Belief Engine

arXiv: 2604.20022 · PDF

作者: Yusuf Kesmen, Fay Elhassan, Jiayi Ma, Julien Stalhandske, David Sasu, Alexandra Kulinkina, Akhil Arora, Lars Klein, Mary-Anne Hartley

主分类: cs.LG · 全部: cs.AI, cs.CL, cs.LG

命中关键词: large language model, llm, agent, rag, reasoning, inference

TL;DR

BMBE 把 LLM 降级为"传感器"，把医疗诊断推理交给可审计的贝叶斯引擎，模块化架构在精度、隐私和鲁棒性上超过独立 frontier LLM。

核心观点

LLM 把"自然语言沟通"与"概率推理"混在一起是架构缺陷，而非工程瑕疵。
应严格分离语言层与推理层：LLM 只做解析和措辞，贝叶斯引擎承担全部诊断推断。
由于患者数据不进入 LLM，架构天然私密；统计后端可按人群替换，无需重训。
带来三大独有特性：可调的 selective diagnosis 精度-覆盖权衡、“统计分离 gap”，以及对抗性患者语气下的鲁棒性。

方法

提出 BMBE (Bayesian Medical Belief Engine)，模块化对话诊断框架：

LLM as sensor：解析患者自由文本为结构化证据，并把引擎要问的问题口语化。
Bayesian engine：确定性、可审计的概率推理核心，基于知识库维护疾病后验，决定下一步询问和何时给出诊断。
Selective diagnosis：通过阈值连续调节 accuracy–coverage 折中。
统计后端可独立替换以适配不同人群。

实验

在经验知识库与 LLM 生成知识库两类设置上评测。
基线为同家族的 frontier standalone LLM（autonomous diagnostic agent）。
指标覆盖诊断准确率、覆盖率、成本，以及对抗性沟通风格下的稳健性。具体数据集名称摘要未披露。

结果

廉价 sensor + 贝叶斯引擎可超越同家族 frontier 独立模型，成本只是后者一小部分，呈现"统计分离 gap"。
可连续调节精度-覆盖曲线，独立 LLM 无此能力。
在对抗性患者语气下，独立模型性能崩溃，BMBE 保持稳健。
具体数值摘要未给出，需看正文。

为什么重要

对医疗 agent、隐私合规系统和 LLM infra 从业者：展示了一条"不靠 scale、靠架构"的路线——把不确定性推理从 LLM 里抽出来交给可验证模块，可同时拿到隐私、可审计性、成本优势和可控的弃答机制，对高风险领域部署有直接借鉴意义。

与已有工作的关系

延续 LLM-as-diagnostic-agent 方向（如 AMIE、Med-PaLM 式对话诊断）。
呼应 neuro-symbolic 与 tool-augmented LLM 思路，把确定性推理外置。
贝叶斯引擎部分承袭经典 medical expert system / Bayesian network 诊断（QMR-DT 等）。
selective prediction / conformal 风格的 accuracy-coverage 折中亦有先例。

尚未回答的问题

在真实临床数据和多语种、多文化人群上的外部效度？
知识库构建与维护成本，尤其贝叶斯网络结构与条件概率的获取与更新？
解析式 LLM sensor 的错误如何传播进后验，是否需要不确定性估计？
面向多轮长程病史、多病共存、检验结果整合的扩展性？
与临床工作流、医生监督的交互形态与监管合规路径。

论文图表

图 1: Page 2 (rendered)

图 1

图 2: Page 3 (rendered)

图 2

图 3: Page 4 (rendered)

图 3

原始摘要

Large language models are increasingly deployed as autonomous diagnostic agents, yet they conflate two fundamentally different capabilities: natural-language communication and probabilistic reasoning. We argue that this conflation is an architectural flaw, not an engineering shortcoming. We introduce BMBE (Bayesian Medical Belief Engine), a modular diagnostic dialogue framework that enforces a strict separation between language and reasoning: an LLM serves only as a sensor, parsing patient utterances into structured evidence and verbalising questions, while all diagnostic inference resides in a deterministic, auditable Bayesian engine. Because patient data never enters the LLM, the architecture is private by construction; because the statistical backend is a standalone module, it can be replaced per target population without retraining. This separation yields three properties no autonomous LLM can offer: calibrated selective diagnosis with a continuously adjustable accuracy-coverage tradeoff, a statistical separation gap where even a cheap sensor paired with the engine outperforms a frontier standalone model from the same family at a fraction of the cost, and robustness to adversarial patient communication styles that cause standalone doctors to collapse. We validate across empirical and LLM-generated knowledge bases against frontier LLMs, confirming the advantage is architectural, not informational.