arXiv: 2604.22748 · PDF
作者: Meng Chu, Xuan Billy Zhang, Kevin Qinghong Lin, Lingdong Kong, Jize Zhang, Teng Tu, Weijian Ma, Ziqi Huang, Senqiao Yang, Wei Huang, Yeying Jin, Zhefan Rao, Jinhui Ye, Xinyu Lin, Xichen Zhang, Qisheng Hu, Shuai Yang, Leyang Shen, Wei Chow, Yifei Dong, Fengyi Wu, Quanyu Long, Bin Xia, Shaozuo Yu, Mingkang Zhu, Wenhu Zhang, Jiehui Huang, Haokun Gui, Haoxuan Che, Long Chen, Qifeng Chen, Wenxuan Zhang, Wenya Wang, Xiaojuan Qi, Yang Deng, Yanwei Li, Mike Zheng Shou, Zhi-Qi Cheng, See-Kiong Ng, Ziwei Liu, Philip Torr, Jiaya Jia
主分类: cs.AI · 全部: cs.AI
命中关键词: agent, agentic, multi-agent, ai system
TL;DR
本文提出「levels × laws」分类法,把 world model 按能力分成 L1 Predictor / L2 Simulator / L3 Evolver,按约束分成 physical / digital / social / scientific 四类,系统综述 400+ 相关工作。

核心观点
- World model 在不同社区含义分裂,需要统一框架。
- 提出二维分类:能力级别(L1 单步预测 / L2 多步 action-conditioned rollout / L3 自我修正)× 律则域(物理 / 数字 / 社会 / 科学)。
- 覆盖 model-based RL、视频生成、web/GUI agent、multi-agent 社会仿真、AI4Science 五大方向。
- 提出以 decision-centric 为核心的评测原则,以及最小可复现评测包。
- 输出架构建议、开放问题与 governance 挑战,形成跨社区路线图。
方法

- 沿「能力」轴:L1 学一步局部 transition;L2 将 L1 组合成尊重领域律则的多步 rollout;L3 当预测失败时自动修订模型本身。
- 沿「律则」轴:physical(连续动力学)、digital(软件 / UI 状态机)、social(多智能体、规范)、scientific(假设—实验—修正闭环)。律则决定约束与典型失败模式。
- 用这组 level×regime 网格对每个 cell 梳理代表方法、失败模式、评测惯例。
- 方法论为 survey + taxonomy,非新算法。
实验
- 无传统意义上的训练 / benchmark 实验。
- 综述 400+ 文献,选出 100+ 代表系统放入 level-regime 矩阵。
- 提出一个 “minimal reproducible evaluation package”,强调决策导向指标(不仅是下一步预测精度)。
结果

- 观察:大多数现有系统停留在 L1–L2,L3 Evolver 级别的系统极少且集中在 AI4Science。
- 各 regime 失败模式可归类:physical 易违反守恒律;digital 对 GUI 状态漂移敏感;social 难以校准规范;scientific 受限于实验反馈循环。
- 由于是综述,主张的「站得住脚」性主要靠 coverage(400+ 篇)和分类一致性支撑。
为什么重要
- 给 agent / LLM 从业者一张地图:评估自家系统到底在 L? × ? regime,选评测与失败模式就有据可依。
- 指出"被动 next-token 预测 → 主动仿真并重塑环境"的递进方向,对 agent infra(尤其 GUI agent、multi-agent sim、自动科研)是路线图级的指引。
- 评测建议可直接用于 MBRL / video world model / web agent 的 benchmark 设计。
与已有工作的关系
- 延伸 Ha & Schmidhuber 的 World Models、Dreamer 系列 MBRL。
- 吸收 Sora / Genie 等生成式 world model 工作到 L2 Simulator 视角。
- 把 WebArena、OSWorld、GUI agent 放进 digital regime。
- 将 Generative Agents、Concordia 等归入 social regime。
- AI4Science(如 AI-driven discovery)作为 L3 Evolver 的主要样本。
尚未回答的问题
- L3 Evolver 的形式化判据与可扩展训练方法仍缺失。
- 跨 regime 的统一评测指标(决策价值 vs 预测精度)如何落地。
- 跨 level 的迁移:L2 simulator 如何自动升级到 L3。
- Governance:自改写的 world model 带来的安全与审计边界如何设定。
原始摘要
As AI systems move from generating text to accomplishing goals through sustained interaction, the ability to model environment dynamics becomes a central bottleneck. Agents that manipulate objects, navigate software, coordinate with others, or design experiments require predictive environment models, yet the term world model carries different meanings across research communities. We introduce a “levels x laws” taxonomy organized along two axes. The first defines three capability levels: L1 Predictor, which learns one-step local transition operators; L2 Simulator, which composes them into multi-step, action-conditioned rollouts that respect domain laws; and L3 Evolver, which autonomously revises its own model when predictions fail against new evidence. The second identifies four governing-law regimes: physical, digital, social, and scientific. These regimes determine what constraints a world model must satisfy and where it is most likely to fail. Using this framework, we synthesize over 400 works and summarize more than 100 representative systems spanning model-based reinforcement learning, video generation, web and GUI agents, multi-agent social simulation, and AI-driven scientific discovery. We analyze methods, failure modes, and evaluation practices across level-regime pairs, propose decision-centric evaluation principles and a minimal reproducible evaluation package, and outline architectural guidance, open problems, and governance challenges. The resulting roadmap connects previously isolated communities and charts a path from passive next-step prediction toward world models that can simulate, and ultimately reshape, the environments in which agents operate.