Protecting Language Models Against Unauthorized Distillation through Trace Rewriting
Abstract
Techniques for modifying teacher-generated reasoning traces to prevent unauthorized knowledge distillation while maintaining answer correctness and enabling detectable watermarks are presented.
Knowledge distillation is a widely adopted technique for transferring capabilities from LLMs to smaller, more efficient student models. However, unauthorized use of knowledge distillation takes unfair advantage of the considerable effort and cost put into developing frontier models. We investigate methods for modifying teacher-generated reasoning traces to achieve two objectives that deter unauthorized distillation: (1) anti-distillation, or degrading the training usefulness of query responses, and (2) API watermarking, which embeds verifiable signatures in student models. We introduce several approaches for dynamically rewriting a teacher's reasoning outputs while preserving answer correctness and semantic coherence. Two of these leverage the rewriting capabilities of LLMs, while others use gradient-based techniques. Our experiments show that a simple instruction-based rewriting approach achieves a strong anti-distillation effect while maintaining or even improving teacher performance. Furthermore, we show that our rewriting approach also enables embedding watermarks that can be reliably detected with essentially no false alarms. Our code is available at https://github.com/xhOwenMa/trace-rewriting.
Community
We show four variants of reasoning trace rewriting methods -- two gradient-based and two easy-to-use instruction-based -- to achieve anti-distillation and easily verifiable watermarks that are also stealthy.
Accepted to ACL 2026^^
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- DistillGuard: Evaluating Defenses Against LLM Knowledge Distillation (2026)
- Robust Safety Monitoring of Language Models via Activation Watermarking (2026)
- MirageBackdoor: A Stealthy Attack that Induces Think-Well-Answer-Wrong Reasoning (2026)
- Compiling Activation Steering into Weights via Null-Space Constraints for Stealthy Backdoors (2026)
- RLSpoofer: A Lightweight Evaluator for LLM Watermark Spoofing Resilience (2026)
- Critical-CoT: A Robust Defense Framework against Reasoning-Level Backdoor Attacks in Large Language Models (2026)
- Backdoor4Good: Benchmarking Beneficial Uses of Backdoors in LLMs (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2602.15143 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper