StepPO: Step-Aligned Policy Optimization for Agentic Reinforcement Learning
Abstract
StepPO introduces a step-centric approach for agentic reinforcement learning that aligns policy optimization with agent decision granularity, outperforming existing token-centric methods in multi-turn interaction tasks.
Agentic reinforcement learning (RL) is emerging as a critical post-training paradigm for improving LLM agent capabilities. Existing RL algorithms for LLMs largely follow the token-centric paradigm as in RLHF and RLVR, where tokens serve as the basic units for modeling and optimization. However, this paradigm introduces a granularity mismatch in agentic RL, as it optimizes token-level predictions while LLM agents make step-level decisions through cycles of environmental observations and actions. To bridge this gap, we propose StepPO, a step-centric paradigm for agentic RL via step-aligned policy optimization. Specifically, we reformulate agentic RL from a token-level Markov Decision Process (MDP) into a step-level MDP, where interaction steps serve as the basic trajectory representations. We further propose step-level credit assignment to align policy optimization with the natural granularity of agent decisions. Together, StepPO optimizes agent policies at the step level for multi-turn agent-environment interaction. Experiments across multi-hop QA, academic paper search, and text-world action tasks show that StepPO consistently outperforms various RL algorithms. Further analyses provide insights into how step-centric paradigm improves agent training. We hope this step-centric paradigm offers a useful lens for understanding agent behavior and a practical path for training more capable LLM agents.
Community
đ Excited to share StepPO, a step-aligned policy optimization method for agentic RL!
Project: https://agentr1.github.io/steppo/
Code: https://github.com/AgentR1/StepPO
đ§Š Motivation: LLM agents do not act token by token. They interact in steps: observe, act, get feedback, and continue. This makes token-centric RL a poor fit for agent training.
⨠Method: StepPO reformulates agentic RL as a step-level MDP, with step-native trajectory records, step-level credit assignment, and step-level importance sampling.
đ ď¸ Implementation & Results: We build a practical step-native training framework for multi-turn agent-environment interaction. Across multi-hop QA, academic paper search, ALFWorld, and WebShop, StepPO consistently outperforms PPO/GRPO-style baselines.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- GAGPO: Generalized Advantage Grouped Policy Optimization (2026)
- Segment-Aligned Policy Optimization for Multi-Modal Reasoning (2026)
- Resolving Action Bottleneck: Agentic Reinforcement Learning Informed by Token-Level Energy (2026)
- StepOPSD: Step-Aware Online Preference Distillation for Agent Reinforcement Learning (2026)
- What and When to Distill: Selective Hindsight Distillation for Multi-Turn Agents (2026)
- GEAR: Granularity-Adaptive Advantage Reweighting for LLM Agents via Self-Distillation (2026)
- SAPO: Step-Aligned Policy Optimization for Reasoning-Based Generative Recommendation (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2604.18401 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper