ProtGPT2-Distilled-Tiny
A compact protein language model distilled from ProtGPT2 using complementary-regularizer distillation---a method that combines uncertainty-aware position weighting with calibration-aware label smoothing to achieve 87% better perplexity than standard knowledge distillation at 20x compression.
Preprint: Distilling Protein Language Models with Complementary Regularizers (bioRxiv 2026) Code: github.com/ewijaya/protein-lm-distill
Model Summary
| Property | Value |
|---|---|
| Parameters | ~37M |
| Architecture | GPT-2 (4 layers, 4 heads, 512 embedding dim) |
| Compression | 20x (vs. 738M teacher) |
| Perplexity ratio | 5.06 (87% better than baseline KD) |
| Expected calibration error | 0.183 (47% better than baseline) |
| Inference speedup | 5.3x over ProtGPT2 |
| GPU memory | 170 MB (19x reduction from teacher) |
| Throughput | ~111 sequences/min on NVIDIA L40S |
Quick Start
from transformers import GPT2LMHeadModel, GPT2Tokenizer, pipeline
model = GPT2LMHeadModel.from_pretrained("littleworth/protgpt2-distilled-tiny")
tokenizer = GPT2Tokenizer.from_pretrained("littleworth/protgpt2-distilled-tiny")
generator = pipeline("text-generation", model=model, tokenizer=tokenizer, device=0)
sequences = generator(
"<|endoftext|>",
max_length=256,
do_sample=True,
top_k=950,
repetition_penalty=1.2,
num_return_sequences=5,
eos_token_id=0,
pad_token_id=0,
truncation=True,
)
for i, seq in enumerate(sequences):
protein = seq["generated_text"].replace("<|endoftext|>", "").replace("\n", "")
protein = "".join(c for c in protein if c.isalpha())
print(f">Generated_{i}\n{protein}")
How It Works
This model was trained using complementary-regularizer distillation, which augments standard temperature-scaled knowledge distillation (Hinton et al., 2015) with two protein-specific enhancements:
Uncertainty-aware position weighting --- Uses teacher entropy to emphasize biologically variable regions (loops, surface residues) during distillation, directing learning capacity toward positions where the teacher's distributional knowledge is richest.
Calibration-aware label smoothing --- Applies confidence-dependent smoothing to teacher distributions, acting as a noise filter that removes miscalibration artifacts while preserving genuine amino acid substitution preferences.
The key finding: Each enhancement individually degrades distillation quality (+95% and +109% perplexity increase, respectively), yet their combination yields a 53% perplexity improvement over baseline---a phenomenon we call complementary regularizers. Smoothing removes the noise that weighting would amplify, while weighting compensates for the signal attenuation that smoothing introduces.
Performance
Compared to Baseline Knowledge Distillation
| Method | PPL Ratio | ECE | KL Divergence |
|---|---|---|---|
| Baseline KD | 39.91 | 0.345 | 3.16 |
| This model (complementary regularizers) | 5.06 | 0.183 | 1.34 |
| Improvement | 87% | 47% | 58% |
Model Family Comparison
| Model | Params | Compression | PPL Ratio | Speedup | GPU Memory |
|---|---|---|---|---|---|
| ProtGPT2 (teacher) | 738M | 1x | 1.00 | 1.0x | 3,211 MB |
| Tiny (this model) | 37M | 20x | 5.06 | 5.3x | 170 MB |
| Small | 78M | 9.4x | 7.05 | 4.1x | 343 MB |
| Medium | 194M | 3.8x | 2.58 | 2.4x | 836 MB |
Biological Validity
Generated sequences produce amino acid distributions closely matching natural proteins (KL divergence from UniProt < 0.015), confirming that compressed models preserve biologically realistic sequence statistics.
When to Use This Model
- High-throughput screening: 111 seq/min enables scoring ~10^6 candidates in ~6 GPU-hours on consumer hardware
- Resource-constrained deployment: 170 MB GPU memory fits on shared lab workstations
- On-premise inference: Run locally without sending proprietary sequences to cloud APIs
- Antibody/enzyme engineering: Fast iteration in ML-guided design-build-test cycles
For applications where perplexity matters more than speed, consider the Medium variant (2.58 PPL ratio, 2.4x speedup).
Training Details
| Parameter | Value |
|---|---|
| Teacher model | nferruz/ProtGPT2 (738M) |
| Training data | 10% UniProt subset (Parquet) |
| Temperature (T) | 2.0 |
| Alpha | 0.5 |
| Learning rate | 5e-4 (with 500-step linear warmup) |
| Epochs | 3 |
| Batch size | 32 (effective) |
| Optimizer | AdamW |
| Precision | FP16 |
| Uncertainty weighting | Enabled |
| Calibration smoothing | Enabled (lambda=0.1) |
Citation
@article{Wijaya2026.02.17.706304,
author = {Wijaya, Edward},
title = {Distilling Protein Language Models with Complementary Regularizers},
year = {2026},
doi = {10.64898/2026.02.17.706304},
publisher = {Cold Spring Harbor Laboratory},
journal = {bioRxiv}
}
Related Models
- ProtGPT2 --- the teacher model
- protgpt2-distilled-small --- 78M parameters, 9.4x compression
- protgpt2-distilled-medium --- 194M parameters, 3.8x compression
License
Apache 2.0
- Downloads last month
- 73
Model tree for littleworth/protgpt2-distilled-tiny
Base model
nferruz/ProtGPT2