StepPO: Step-Aligned Policy Optimization for Agentic Reinforcement Learning

作者: Daoyu Wang, Qingchuan Li, Mingyue Cheng, Jie Ouyang, Shuo Yu, Qi Liu, Enhong Chen

主分类: cs.CL · 全部: cs.CL

命中关键词: large language model, llm, agent, agentic, tool use, reasoning, post-train, rlhf

TL;DR

StepPO 主张把 Agentic RL 从 token 级 MDP 升级为 step 级 MDP，以 step 作为 LLM agent 的动作粒度，并提出相应的 step-level credit assignment 来对齐策略优化与 agent 决策。

核心观点

传统 token 级 RL（RLHF/RLVR）不足以刻画多轮交互的 agent 行为。
应将 MDP 粒度从 token 提升到 step，把一个 step（决策/工具调用）视为 agent 的动作。
与此配套，奖励传播与信用分配也应在 step 级完成。
Agentic RL 面临奖励稀疏延迟、上下文长而变动等新挑战，step 级抽象更自然。
提出 StepPO 作为 step-aligned policy optimization 的立场论文（position paper）。

方法

作者重新表述 agent 交互为 step-level MDP：每个 step 封装模型一次推理 + 一次工具/环境交互，作为策略的动作单位。在此基础上定义 step-level credit assignment，把延迟奖励回传到对应 step 而非 token，使策略梯度、价值估计与 agent 的决策粒度对齐。论文还讨论了实现 step 级 Agentic RL 所需的系统设计（轨迹组织、reward shaping、长上下文支持等）。

实验

摘要仅提到"preliminary experiments"提供初步证据，未披露具体数据集、基线或指标细节。

结果

摘要未给出定量数字，仅声称初步实验支持 step 级视角的有效性；严格的结论需查阅正文。

为什么重要

对 agent harness（如 Claude Code 类系统）的后训练链路而言，step 级建模让 RL 目标与真实 agent 行为对齐，有望缓解稀疏/延迟奖励下的训练难题，并为工具使用、多轮决策等核心能力提供更可扩展的优化框架。

与已有工作的关系

承接 RLHF、RLVR 等 token 级 LLM RL 脉络。
与 ReAct、ToolLLM 等多轮 agent 训练方法互补。
呼应 hierarchical RL、option/macro-action 思想在 LLM agent 上的再表达。
与近期 agentic RL / trajectory-level 优化研究（如 ArCHer 风格方法）同向。

尚未回答的问题

“step” 的精确边界如何自动、鲁棒地切分？
step 级 credit assignment 的具体算法形式与收敛性质？
在真实 benchmark（SWE-bench、WebArena 等）上相对 token 级基线的量化收益？
如何与长上下文、KV cache、并行 rollout 等系统层协同工程化？

论文图表

图 1: Figure 1 (extracted from PDF)

图 1

图 2: Figure 2 (extracted from PDF)

图 2

原始摘要

General agents have given rise to phenomenal applications such as OpenClaw and Claude Code. As these agent systems (a.k.a. Harnesses) strive for bolder goals, they demand increasingly stronger agentic capabilities from foundation Large Language Models (LLMs). Agentic Reinforcement Learning (RL) is emerging as a central post-training paradigm for empowering LLMs with these capabilities and is playing an increasingly pivotal role in agent training. Unlike single-turn token-level alignment or reasoning enhancement, as in RLHF and RLVR, Agentic RL targets multi-turn interactive settings, where the goal is to optimize core agentic capabilities such as decision making and tool use while addressing new challenges including delayed and sparse rewards, as well as long and variable context. As a result, the token-centric modeling and optimization paradigm inherited from traditional LLM RL is becoming increasingly inadequate for capturing real LLM agent behavior. In this paper, we present StepPO as a position on step-level Agentic RL. We argue that the conventional token-level Markov Decision Process (MDP) should be advanced to a step-level MDP formulation, and that the step, rather than the token, should be regarded as the proper action representation for LLM agents. We then propose step-level credit assignment as the natural optimization counterpart of this formulation, thereby aligning policy optimization and reward propagation with the granularity of agent decisions. Finally, we discuss the key systems designs required to realize step-level Agentic RL in practice and preliminary experiments provide initial evidence for the effectiveness of this perspective. We hope that the step-aligned, step-level paradigm embodied in StepPO offers the Agentic RL community a useful lens for understanding agent behavior and helps advance LLMs toward stronger general-agent capabilities.