The Illusion of Certainty: Decoupling Capability and Calibration in On-Policy Distillation
Abstract
On-policy distillation suffers from miscalibration due to information mismatch between training and deployment contexts, which is addressed through a calibration-aware framework that improves both performance and confidence reliability.
On-policy distillation (OPD) is an increasingly important paradigm for post-training language models. However, we identify a pervasive Scaling Law of Miscalibration: while OPD effectively improves task accuracy, it systematically traps models in severe overconfidence. We trace this failure to an information mismatch: teacher supervision is formed under privileged context available during training, whereas the deployed model must report confidence using only deployment-time information. We formalize this perspective theoretically, showing that teacher-conditioned success is generally not a valid target for deployment-time confidence and that helpful privileged context induces entropy collapse and a systematic optimism bias. To address this, we propose a calibration-aware OPD framework, CaOPD, that estimates empirical confidence from model rollouts, replaces self-reported confidence with this student-grounded target, and distills the revised response through the same self-distillation pipeline. Experiments across various models and domains show that CaOPD achieves Pareto-optimal calibration while maintaining competitive capability, generalizing robustly under out-of-distribution and continual learning. Our findings highlight that capability distillation does not imply calibrated confidence, and that confidence should be treated as an essential objective in post-training. Code: https://github.com/SalesforceAIResearch/CaOPD
Community
The Illusion of Certainty: Decoupling Capability and Calibration in On-Policy Distillation:
On-policy distillation (OPD) improves task accuracy but systematically traps models in severe overconfidence. We trace this to an information mismatch between training and deployment, and introduce CaOPD to fix it.
→ Identifies a pervasive Scaling Law of Miscalibration: even frontier LLMs exhibit massive calibration gaps that scale does not resolve
→ Formalizes how privileged teacher context induces entropy collapse and optimism bias in the student
→ Replaces self-reported confidence with a student-grounded empirical target, decoupling what the model answers from how certain it should be
→ Achieves Pareto-optimal calibration without the capability tax of RL-based methods, enabling a compact 8B model to rival frontier LLMs on reliability
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Self-Distilled RLVR (2026)
- SCOPE: Signal-Calibrated On-Policy Distillation Enhancement with Dual-Path Adaptive Weighting (2026)
- Entropy-Aware On-Policy Distillation of Language Models (2026)
- Scaling Reasoning Efficiently via Relaxed On-Policy Distillation (2026)
- HDPO: Hybrid Distillation Policy Optimization via Privileged Self-Distillation (2026)
- Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe (2026)
- Demystifying OPD: Length Inflation and Stabilization Strategies for Large Language Models (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2604.16830 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper