arXiv: 2604.20658 · PDF

Authors: Shivani Kumar, Adarsh Bharathwaj, David Jurgens

Primary category: cs.CL · all: cs.CL

Matched keywords: large language model, llm, agent, multi-agent, reasoning, gpu


TL;DR

Authors benchmark 35 open-weight LLMs on six behavioral-economics games and show that the resulting “cooperative profiles” predict downstream team performance in AI-for-Science workflows under shared budget constraints, offering a cheap diagnostic for multi-agent deployment.

Key Ideas

  • Cooperative disposition is a distinct, measurable LLM property, not reducible to general capability.
  • Behavioral-economics games isolate cooperation mechanisms that transfer to realistic multi-agent science tasks.
  • Models favoring multiplicative team production over greedy strategies yield better scientific reports.
  • Game-based screening can precede expensive multi-agent rollouts.

Approach

  • Evaluate 35 open-weight LLMs across six behavioral-economics games targeting distinct cooperation mechanisms (coordination, investment, resource sharing).
  • Derive per-model “cooperative profiles” from game behavior.
  • Deploy LLM teams in an AI-for-Science pipeline: collaboratively analyze data, build models, and write scientific reports under shared budgets (e.g., GPU/credit caps).
  • Regress downstream outcomes on cooperative profile features while controlling for confounds (likely model size, general ability benchmarks).

Experiments

  • Models: 35 open-weight LLMs.
  • Games: six behavioral-economics tasks (abstract not specific, but likely includes public-goods, trust, coordination variants).
  • Downstream task: multi-agent AI-for-Science workflow with shared constraints.
  • Metrics: report accuracy, quality, and completion.
  • Baselines / controls: general-ability factors partialled out.

Results

  • Cooperative profiles robustly predict downstream accuracy, quality, and completion.
  • Effect persists after controlling for multiple confounding factors.
  • Headline numerical effect sizes not given in the abstract.

Why It Matters

  • Provides a fast, inexpensive screening tool for multi-agent LLM deployments where coordination and budget-sharing matter.
  • Reframes multi-agent selection beyond raw benchmark scores toward cooperative disposition.
  • Useful for agent/infra teams building scientific, engineering, or tool-using LLM collectives.

Connections to Prior Work

  • Behavioral-economics probes of LLMs (trust games, ultimatum, public-goods studies).
  • Multi-agent LLM frameworks (AutoGen, MetaGPT, ChatDev, AI-Scientist).
  • Work on LLM “personality” / social-preference elicitation.
  • Emergent cooperation and game-theoretic evaluations in RL agents.
  • Scientific-writing and data-analysis agent benchmarks.

Open Questions

  • Which specific games carry the most predictive signal, and do they generalize beyond AI-for-Science?
  • Does cooperative profile stay stable under prompting, fine-tuning, or RLHF interventions?
  • Are closed-weight frontier models (GPT-4.x, Claude, Gemini) consistent with the 35-model findings?
  • Can cooperative disposition be deliberately trained or aligned, and at what cost to single-agent capability?
  • How do heterogeneous teams (mixing cooperators and defectors) behave versus homogeneous ones?

Figures

Figure 1: Page 2 (rendered)

Figure 1

Figure 2: Page 3 (rendered)

Figure 2

Figure 3: Page 4 (rendered)

Figure 3


Original abstract

Multi-agent systems built from teams of large language models (LLMs) are increasingly deployed for collaborative scientific reasoning and problem-solving. These systems require agents to coordinate under shared constraints, such as GPUs or credit balances, where cooperative behavior matters. Behavioral economics provides a rich toolkit of games that isolate distinct cooperation mechanisms, yet it remains unknown whether a model’s behavior in these stylized settings predicts its performance in realistic collaborative tasks. Here, we benchmark 35 open-weight LLMs across six behavioral economics games and show that game-derived cooperative profiles robustly predict downstream performance in AI-for-Science tasks, where teams of LLM agents collaboratively analyze data, build models, and produce scientific reports under shared budget constraints. Models that effectively coordinate games and invest in multiplicative team production (rather than greedy strategies) produce better scientific reports across three outcomes, accuracy, quality, and completion. These associations hold after controlling for multiple factors, indicating that cooperative disposition is a distinct, measurable property of LLMs not reducible to general ability. Our behavioral games framework thus offers a fast and inexpensive diagnostic for screening cooperative fitness before costly multi-agent deployment.