arxiv:2604.18401

StepPO: Step-Aligned Policy Optimization for Agentic Reinforcement Learning

Published on Jun 5

· Submitted by

Authors:

Abstract

StepPO introduces a step-centric approach for agentic reinforcement learning that aligns policy optimization with agent decision granularity, outperforming existing token-centric methods in multi-turn interaction tasks.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Agentic reinforcement learning (RL) is emerging as a critical post-training paradigm for improving LLM agent capabilities. Existing RL algorithms for LLMs largely follow the token-centric paradigm as in RLHF and RLVR, where tokens serve as the basic units for modeling and optimization. However, this paradigm introduces a granularity mismatch in agentic RL, as it optimizes token-level predictions while LLM agents make step-level decisions through cycles of environmental observations and actions. To bridge this gap, we propose StepPO, a step-centric paradigm for agentic RL via step-aligned policy optimization. Specifically, we reformulate agentic RL from a token-level Markov Decision Process (MDP) into a step-level MDP, where interaction steps serve as the basic trajectory representations. We further propose step-level credit assignment to align policy optimization with the natural granularity of agent decisions. Together, StepPO optimizes agent policies at the step level for multi-turn agent-environment interaction. Experiments across multi-hop QA, academic paper search, and text-world action tasks show that StepPO consistently outperforms various RL algorithms. Further analyses provide insights into how step-centric paradigm improves agent training. We hope this step-centric paradigm offers a useful lens for understanding agent behavior and a practical path for training more capable LLM agents.

View arXiv page View PDF Project page GitHub 15 Add to collection

Community

Melmaphother

Paper submitter 10 days ago

🚀 Excited to share StepPO, a step-aligned policy optimization method for agentic RL!
Project: https://agentr1.github.io/steppo/
Code: https://github.com/AgentR1/StepPO

🧩 Motivation: LLM agents do not act token by token. They interact in steps: observe, act, get feedback, and continue. This makes token-centric RL a poor fit for agent training.

✨ Method: StepPO reformulates agentic RL as a step-level MDP, with step-native trajectory records, step-level credit assignment, and step-level importance sampling.

🛠️ Implementation & Results: We build a practical step-native training framework for multi-turn agent-environment interaction. Across multi-hop QA, academic paper search, ALFWorld, and WebShop, StepPO consistently outperforms PPO/GRPO-style baselines.

librarian-bot

9 days ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2604.18401

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2604.18401 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2604.18401 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2604.18401 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.