Post
532
TRL v0.27.0 is out!! π₯³
It includes GDPO, the latest variant of GRPO for multi-reward RL β¨
GDPO decouples reward normalization to avoid reward collapse and improve per-reward convergence β developed by
@sliuau @SimonX et al.
Explore the paper: GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization (2601.05242)
Explore the full set of changes here:
https://github.com/huggingface/trl/releases/tag/v0.27.0
It includes GDPO, the latest variant of GRPO for multi-reward RL β¨
GDPO decouples reward normalization to avoid reward collapse and improve per-reward convergence β developed by
@sliuau @SimonX et al.
Explore the paper: GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization (2601.05242)
Explore the full set of changes here:
https://github.com/huggingface/trl/releases/tag/v0.27.0