Revisiting On-Policy Distillation: Empirical Failure Modes and Simple Fixes
Abstract
On-policy distillation for large language models faces challenges in long-horizon settings due to token-level signal fragility, which is addressed through improved estimation methods and implementation techniques.
On-policy distillation (OPD) is appealing for large language model (LLM) post-training because it evaluates teacher feedback on student-generated rollouts rather than fixed teacher traces. In long-horizon settings, however, the common sampled-token variant is fragile: it reduces distribution matching to a one-token signal and becomes increasingly unreliable as rollouts drift away from prefixes the teacher commonly visits. We revisit OPD from the estimator and implementation sides. Theoretically, token-level OPD is biased relative to sequence-level reverse-KL, but it has a much tighter worst-case variance bound; our toy study shows the same tradeoff empirically, with stronger future-reward coupling producing higher gradient variance and less stable learning. Empirically, we identify three failure modes of sampled-token OPD: an imbalanced one-token signal, unreliable teacher guidance on student-generated prefixes, and distortions caused by tokenizer or special-token mismatch. We address these issues with teacher top-K local support matching, implemented as truncated reverse-KL with top-p rollout sampling and special-token masking. Across single-task math reasoning and multi-task agentic-plus-math training, this objective yields more stable optimization and better downstream performance than sampled-token OPD.
Community
- On-policy distillation (OPD) trains a student on its own rollouts using teacher feedback[1][2][3]. In long-horizon LLM post-training, the common sampled-token implementation can be brittle.
- From a bias-variance perspective, token-level OPD is biased relative to sequence-level reverse-KL, but it admits a much tighter worst-case variance bound. Our toy study shows that stronger future-reward coupling substantially increases gradient variance and destabilizes optimization.
- In practice, brittleness comes from three sources: an imbalanced one-token learning signal, unreliable teacher guidance on student-generated prefixes, and tokenizer/special-token mismatch.
- We replace the one-sample comparison with a teacher top-K truncated reverse-KL over local support, together with top-p rollouts and special-token masking. This yields more stable training and better results on both reasoning and agentic multi-task benchmarks.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Entropy-Aware On-Policy Distillation of Language Models (2026)
- Reinforcement-aware Knowledge Distillation for LLM Reasoning (2026)
- Fast and Effective On-policy Distillation from Reasoning Prefixes (2026)
- A Step Back: Prefix Importance Ratio Stabilizes Policy Optimization (2026)
- Learning beyond Teacher: Generalized On-Policy Distillation with Reward Extrapolation (2026)
- Not all tokens are needed(NAT): token efficient reinforcement learning (2026)
- Reinforcement Learning via Self-Distillation (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2603.25562 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper