StepPO: Step-Aligned Policy Optimization for Agentic Reinforcement Learning

Authors: Daoyu Wang, Qingchuan Li, Mingyue Cheng, Jie Ouyang, Shuo Yu, Qi Liu, Enhong Chen

Primary category: cs.CL · all: cs.CL

Matched keywords: large language model, llm, agent, agentic, tool use, reasoning, post-train, rlhf

TL;DR

StepPO argues that Agentic RL for LLMs should move from token-level to step-level MDPs, treating each agent step (not token) as the action unit and doing credit assignment at that granularity. The paper is a position piece with preliminary experiments.

Key Ideas

Token-level MDPs inherited from RLHF/RLVR poorly fit multi-turn agent settings with delayed, sparse rewards and long contexts.
The “step” (a coherent agent decision / tool-use turn) is the proper action abstraction for LLM agents.
Policy optimization and reward propagation should align to step granularity via step-level credit assignment.
Realizing this requires rethinking RL system design (rollout, reward routing, context handling).

Approach

Conceptual reformulation: lift the MDP from token actions to step actions, where a step aggregates the tokens comprising one agent decision (reasoning + action/tool call). Credit assignment then propagates rewards across steps rather than tokens, aligning gradient signal with decision boundaries. The paper also sketches systems requirements (trajectory handling, variable-length context, reward attribution) needed to implement StepPO, but the abstract keeps algorithmic details light.

Experiments

The abstract only mentions “preliminary experiments” providing “initial evidence.” No datasets, baselines, or metrics are specified — this is a position paper, so concrete experimental scaffolding is thin.

Results

No headline numbers disclosed in the abstract; authors claim initial empirical support for the step-level perspective but do not quantify gains here.

Why It Matters

For agent / LLM-infra practitioners building harnesses (Claude Code-style), step-aligned RL could yield cleaner credit assignment, more stable training over long tool-use trajectories, and better decision-level capabilities than token-level PPO/GRPO variants — potentially reshaping post-training stacks for general agents.

Connections to Prior Work

RLHF and RLVR (token-level alignment/reasoning RL) — the baseline paradigm being critiqued.
PPO/GRPO and process-reward models (PRMs) — related attempts at finer-grained credit.
Hierarchical RL and options framework — step-as-action echoes temporally extended actions.
Agent frameworks (ReAct, tool-use agents) that already operate at step granularity operationally but train at token level.

Open Questions

How is a “step” precisely defined and segmented in practice (fixed boundaries vs. learned)?
Concrete StepPO objective: variance, bias, and convergence vs. token-level PPO?
Benchmarks and baselines — does it beat GRPO/RLOO on standard agent suites (SWE-bench, WebArena)?
Reward model design for step-level signals with sparse outcomes.
Scaling behavior: does step-level credit assignment hold up as trajectories grow to hundreds of steps?

Figures

Figure 1: Figure 1 (extracted from PDF)

Figure 2: Figure 2 (extracted from PDF)

Original abstract

General agents have given rise to phenomenal applications such as OpenClaw and Claude Code. As these agent systems (a.k.a. Harnesses) strive for bolder goals, they demand increasingly stronger agentic capabilities from foundation Large Language Models (LLMs). Agentic Reinforcement Learning (RL) is emerging as a central post-training paradigm for empowering LLMs with these capabilities and is playing an increasingly pivotal role in agent training. Unlike single-turn token-level alignment or reasoning enhancement, as in RLHF and RLVR, Agentic RL targets multi-turn interactive settings, where the goal is to optimize core agentic capabilities such as decision making and tool use while addressing new challenges including delayed and sparse rewards, as well as long and variable context. As a result, the token-centric modeling and optimization paradigm inherited from traditional LLM RL is becoming increasingly inadequate for capturing real LLM agent behavior. In this paper, we present StepPO as a position on step-level Agentic RL. We argue that the conventional token-level Markov Decision Process (MDP) should be advanced to a step-level MDP formulation, and that the step, rather than the token, should be regarded as the proper action representation for LLM agents. We then propose step-level credit assignment as the natural optimization counterpart of this formulation, thereby aligning policy optimization and reward propagation with the granularity of agent decisions. Finally, we discuss the key systems designs required to realize step-level Agentic RL in practice and preliminary experiments provide initial evidence for the effectiveness of this perspective. We hope that the step-aligned, step-level paradigm embodied in StepPO offers the Agentic RL community a useful lens for understanding agent behavior and helps advance LLMs toward stronger general-agent capabilities.