Zone of Proximal Policy Optimization: Teacher in Prompts, Not Gradients Paper • 2606.18216 • Published 4 days ago • 50
Prompt-Level Distillation: A Non-Parametric Alternative to Model Fine-Tuning for Efficient Reasoning Paper • 2602.21103 • Published 18 days ago • 4
CORE: Contrastive Reflection Enables Rapid Improvements in Reasoning Paper • 2605.28742 • Published 24 days ago • 4
Reinforcement Learning from Rich Feedback with Distributional DAgger Paper • 2606.05152 • Published 17 days ago • 3
On the Scaling of PEFT: Towards Million Personal Models of Trillion Parameters Paper • 2606.02437 • Published 19 days ago • 231
Why Larger Models Learn More: Effects of Capacity, Interference, and Rare-Task Retention Paper • 2605.29548 • Published 23 days ago • 11
SkillOpt: Executive Strategy for Self-Evolving Agent Skills Paper • 2605.23904 • Published 29 days ago • 240
RationalRewards: Reasoning Rewards Scale Visual Generation Both Training and Test Time Paper • 2604.11626 • Published Apr 13 • 102
You Only Judge Once: Multi-response Reward Modeling in a Single Forward Pass Paper • 2604.10966 • Published Apr 13 • 12
Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation Paper • 2604.13010 • Published Apr 14 • 18
Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe Paper • 2604.13016 • Published Apr 14 • 111
ThinkTwice: Jointly Optimizing Large Language Models for Reasoning and Self-Refinement Paper • 2604.01591 • Published Apr 2 • 42
Embarrassingly Simple Self-Distillation Improves Code Generation Paper • 2604.01193 • Published Apr 1 • 56