arXiv: 2604.22748 · PDF

Authors: Meng Chu, Xuan Billy Zhang, Kevin Qinghong Lin, Lingdong Kong, Jize Zhang, Teng Tu, Weijian Ma, Ziqi Huang, Senqiao Yang, Wei Huang, Yeying Jin, Zhefan Rao, Jinhui Ye, Xinyu Lin, Xichen Zhang, Qisheng Hu, Shuai Yang, Leyang Shen, Wei Chow, Yifei Dong, Fengyi Wu, Quanyu Long, Bin Xia, Shaozuo Yu, Mingkang Zhu, Wenhu Zhang, Jiehui Huang, Haokun Gui, Haoxuan Che, Long Chen, Qifeng Chen, Wenxuan Zhang, Wenya Wang, Xiaojuan Qi, Yang Deng, Yanwei Li, Mike Zheng Shou, Zhi-Qi Cheng, See-Kiong Ng, Ziwei Liu, Philip Torr, Jiaya Jia

Affiliations: Hong Kong University of Science and Technology, National University of Singapore, University of Oxford, Nanyang Technological University, Chinese University of Hong Kong, University of Hong Kong, University of Washington, Hong Kong University of Science and Technology (Guangzhou), Singapore University of Technology and Design, Singapore Management University

Primary category: cs.AI · all: cs.AI

Matched keywords: agent, agentic, multi-agent, ai system


TL;DR

A survey proposing a “levels × laws” taxonomy for agentic world models, organizing 400+ works along three capability levels (L1 Predictor, L2 Simulator, L3 Evolver) and four governing-law regimes (physical, digital, social, scientific), with decision-centric evaluation principles and a reproducible evaluation package.

Key Ideas

  • Two-axis taxonomy: levels (L1 one-step predictor → L2 action-conditioned multi-step simulator → L3 self-revising evolver) crossed with laws (physical, digital, social, scientific regimes).
  • Each regime dictates distinct constraints and failure modes that world models must satisfy.
  • Synthesizes 400+ works and 100+ representative systems across model-based RL, video generation, web/GUI agents, multi-agent social simulation, and AI-driven scientific discovery.
  • Proposes decision-centric evaluation principles plus a minimal reproducible evaluation package.

The four law regimes span embodied manipulation, software surfaces, networked agents, and instrumented experimentation — each demanding different invariants.

Figure 1

The roadmap anchors 70 surveyed systems along the capability ladder, capping five per year-level cell to illustrate how the field has progressed from L1 predictors toward L2 simulators and nascent L3 evolvers between 2018 and 2026.

Figure 2

The same timeline, rendered in an alternate layout, underscores that most deployed systems still sit at L1-L2, with L3 autonomous self-revision remaining sparsely populated.

Figure 3

Approach

The authors define each level formally — L1 learns one-step local transition operators; L2 composes them into action-conditioned rollouts respecting domain laws; L3 autonomously revises its model when predictions fail against new evidence. They then map each of over 100 surveyed systems onto a (level, regime) cell, analyzing methods, failure modes, and evaluation practices per pair, and distill architectural guidance from the cross-cutting patterns.

Experiments

No empirical experiments — this is a survey. The “evaluation” contribution is a proposed minimal reproducible evaluation package and decision-centric principles intended to standardize how future world models are benchmarked across regimes.

Results

No headline numbers. Deliverables: the levels×laws framework, 400+ work synthesis, 100+ system catalog, evaluation principles/package, and a roadmap of open problems and governance challenges.

Why It Matters

Gives agent and LLM-infra practitioners a shared vocabulary to compare a video-generation world model against a GUI-agent simulator or a scientific-discovery engine, and concretizes what L3 “self-revising” world models would require — reframing world modeling as the central bottleneck for goal-directed agents.

Connections to Prior Work

Bridges model-based RL (Dreamer-line), video/physics generators (Sora-line), web/GUI agents, generative agents for social simulation, and AI-for-science autonomous experimentation — communities that have largely evolved in isolation.

Open Questions

How to operationalize L3 self-revision; whether the minimal evaluation package generalizes across regimes; governance mechanisms for simulators that reshape their environments; and how to handle cross-regime agents (e.g., embodied + social).


Original abstract

As AI systems move from generating text to accomplishing goals through sustained interaction, the ability to model environment dynamics becomes a central bottleneck. Agents that manipulate objects, navigate software, coordinate with others, or design experiments require predictive environment models, yet the term world model carries different meanings across research communities. We introduce a “levels x laws” taxonomy organized along two axes. The first defines three capability levels: L1 Predictor, which learns one-step local transition operators; L2 Simulator, which composes them into multi-step, action-conditioned rollouts that respect domain laws; and L3 Evolver, which autonomously revises its own model when predictions fail against new evidence. The second identifies four governing-law regimes: physical, digital, social, and scientific. These regimes determine what constraints a world model must satisfy and where it is most likely to fail. Using this framework, we synthesize over 400 works and summarize more than 100 representative systems spanning model-based reinforcement learning, video generation, web and GUI agents, multi-agent social simulation, and AI-driven scientific discovery. We analyze methods, failure modes, and evaluation practices across level-regime pairs, propose decision-centric evaluation principles and a minimal reproducible evaluation package, and outline architectural guidance, open problems, and governance challenges. The resulting roadmap connects previously isolated communities and charts a path from passive next-step prediction toward world models that can simulate, and ultimately reshape, the environments in which agents operate.