Cooperative Profiles Predict Multi-Agent LLM Team Performance in AI for Science Workflows

arXiv: 2604.20658 · PDF

作者: Shivani Kumar, Adarsh Bharathwaj, David Jurgens

主分类: cs.CL · 全部: cs.CL

命中关键词: large language model, llm, agent, multi-agent, reasoning, gpu

TL;DR

用行为经济学博弈测出的"合作画像"能稳健预测 LLM 多智能体团队在 AI-for-Science 协作任务中的表现，可作为部署前的廉价筛选工具。

核心观点

合作倾向是 LLM 一种独立、可测量的属性，不能被通用能力所解释。
六个行为经济学博弈得出的合作画像，可稳健预测下游多智能体科研任务表现。
善于协调、愿意投资乘性团队生产（而非贪婪策略）的模型，产出更好的科学报告。
提供部署前快速、低成本的"合作适配度"诊断框架。

方法

在 35 个开源权重 LLM 上运行 6 个经典行为经济学博弈（涵盖不同合作机制），形成每个模型的合作行为 profile。
构建 AI-for-Science 多智能体任务：LLM 团队在共享预算（GPU/credit）约束下协作分析数据、建模并撰写科学报告。
用博弈画像作为特征，回归预测下游任务三项结果（accuracy、quality、completion），并控制多种混淆因素（如通用能力）。

实验

模型：35 个开源权重 LLM。
诊断任务：6 个行为经济学博弈。
下游任务：AI-for-Science 协作 workflow，包含数据分析、建模、报告生成，带共享预算约束。
指标：科学报告的 accuracy、quality、completion。
基线：控制模型通用能力等因素的回归对照。

结果

博弈衍生的合作画像在三项下游指标上均显著相关。
偏好乘性团队生产、善于协调的模型，报告质量系统性更高。
相关性在控制通用能力后仍成立，说明不是能力的副产物。
摘要未披露具体数值（相关系数、ΔR² 等）。

为什么重要

为多智能体 LLM 部署提供轻量前置筛选：用几个博弈替代昂贵的端到端评测。
将"合作性"作为除能力、对齐之外的独立维度纳入模型选型。
对 AI 基础设施：在共享资源约束（GPU、credits）下，选择更"懂合作"的 agent 能提升产出与资源利用率。

与已有工作的关系

延续用博弈论 / 行为经济学评测 LLM 的路线（囚徒困境、公共品博弈等既有工作）。
补充 AutoGen、MetaGPT 等 multi-agent LLM 协作框架的评测缺口。
与 AI-for-Science agent（如 ChemCrow、数据科学 agent）任务设置呼应。
拓展 LLM 社会行为研究，从孤立博弈延伸到真实协作产出。

尚未回答的问题

博弈画像与下游表现的因果机制是什么，哪类合作特征最具预测力？
在闭源大模型（GPT、Claude、Gemini）上结论是否成立？
合作画像能否通过 prompt / fine-tune 改造，从而提升团队产出？
不同任务域（非科研协作，如代码、运维、商业决策）是否同样适用？
团队规模、异质组合、通信协议对预测力的影响如何？

论文图表

图 1: Page 2 (rendered)

图 1

图 2: Page 3 (rendered)

图 2

图 3: Page 4 (rendered)

图 3

原始摘要

Multi-agent systems built from teams of large language models (LLMs) are increasingly deployed for collaborative scientific reasoning and problem-solving. These systems require agents to coordinate under shared constraints, such as GPUs or credit balances, where cooperative behavior matters. Behavioral economics provides a rich toolkit of games that isolate distinct cooperation mechanisms, yet it remains unknown whether a model’s behavior in these stylized settings predicts its performance in realistic collaborative tasks. Here, we benchmark 35 open-weight LLMs across six behavioral economics games and show that game-derived cooperative profiles robustly predict downstream performance in AI-for-Science tasks, where teams of LLM agents collaboratively analyze data, build models, and produce scientific reports under shared budget constraints. Models that effectively coordinate games and invest in multiplicative team production (rather than greedy strategies) produce better scientific reports across three outcomes, accuracy, quality, and completion. These associations hold after controlling for multiple factors, indicating that cooperative disposition is a distinct, measurable property of LLMs not reducible to general ability. Our behavioral games framework thus offers a fast and inexpensive diagnostic for screening cooperative fitness before costly multi-agent deployment.