ProtGPT2-Distilled-Tiny

A compact protein language model distilled from ProtGPT2 using complementary-regularizer distillation---a method that combines uncertainty-aware position weighting with calibration-aware label smoothing to achieve 87% better perplexity than standard knowledge distillation at 20x compression.

Preprint: Distilling Protein Language Models with Complementary Regularizers (bioRxiv 2026) Code: github.com/ewijaya/protein-lm-distill

Model Summary

Property	Value
Parameters	~37M
Architecture	GPT-2 (4 layers, 4 heads, 512 embedding dim)
Compression	20x (vs. 738M teacher)
Perplexity ratio	5.06 (87% better than baseline KD)
Expected calibration error	0.183 (47% better than baseline)
Inference speedup	5.3x over ProtGPT2
GPU memory	170 MB (19x reduction from teacher)
Throughput	~111 sequences/min on NVIDIA L40S

Quick Start

from transformers import GPT2LMHeadModel, GPT2Tokenizer, pipeline

model = GPT2LMHeadModel.from_pretrained("littleworth/protgpt2-distilled-tiny")
tokenizer = GPT2Tokenizer.from_pretrained("littleworth/protgpt2-distilled-tiny")

generator = pipeline("text-generation", model=model, tokenizer=tokenizer, device=0)

sequences = generator(
    "<|endoftext|>",
    max_length=256,
    do_sample=True,
    top_k=950,
    repetition_penalty=1.2,
    num_return_sequences=5,
    eos_token_id=0,
    pad_token_id=0,
    truncation=True,
)

for i, seq in enumerate(sequences):
    protein = seq["generated_text"].replace("<|endoftext|>", "").replace("\n", "")
    protein = "".join(c for c in protein if c.isalpha())
    print(f">Generated_{i}\n{protein}")

How It Works

This model was trained using complementary-regularizer distillation, which augments standard temperature-scaled knowledge distillation (Hinton et al., 2015) with two protein-specific enhancements:

Uncertainty-aware position weighting --- Uses teacher entropy to emphasize biologically variable regions (loops, surface residues) during distillation, directing learning capacity toward positions where the teacher's distributional knowledge is richest.
Calibration-aware label smoothing --- Applies confidence-dependent smoothing to teacher distributions, acting as a noise filter that removes miscalibration artifacts while preserving genuine amino acid substitution preferences.

The key finding: Each enhancement individually degrades distillation quality (+95% and +109% perplexity increase, respectively), yet their combination yields a 53% perplexity improvement over baseline---a phenomenon we call complementary regularizers. Smoothing removes the noise that weighting would amplify, while weighting compensates for the signal attenuation that smoothing introduces.

Performance

Compared to Baseline Knowledge Distillation

Method	PPL Ratio	ECE	KL Divergence
Baseline KD	39.91	0.345	3.16
This model (complementary regularizers)	5.06	0.183	1.34
Improvement	87%	47%	58%

Model Family Comparison

Model	Params	Compression	PPL Ratio	Speedup	GPU Memory
ProtGPT2 (teacher)	738M	1x	1.00	1.0x	3,211 MB
Tiny (this model)	37M	20x	5.06	5.3x	170 MB
Small	78M	9.4x	7.05	4.1x	343 MB
Medium	194M	3.8x	2.58	2.4x	836 MB

Biological Validity

Generated sequences produce amino acid distributions closely matching natural proteins (KL divergence from UniProt < 0.015), confirming that compressed models preserve biologically realistic sequence statistics.

When to Use This Model

High-throughput screening: 111 seq/min enables scoring ~10^6 candidates in ~6 GPU-hours on consumer hardware
Resource-constrained deployment: 170 MB GPU memory fits on shared lab workstations
On-premise inference: Run locally without sending proprietary sequences to cloud APIs
Antibody/enzyme engineering: Fast iteration in ML-guided design-build-test cycles

For applications where perplexity matters more than speed, consider the Medium variant (2.58 PPL ratio, 2.4x speedup).

Training Details

Parameter	Value
Teacher model	nferruz/ProtGPT2 (738M)
Training data	10% UniProt subset (Parquet)
Temperature (T)	2.0
Alpha	0.5
Learning rate	5e-4 (with 500-step linear warmup)
Epochs	3
Batch size	32 (effective)
Optimizer	AdamW
Precision	FP16
Uncertainty weighting	Enabled
Calibration smoothing	Enabled (lambda=0.1)

Citation

@article{Wijaya2026.02.17.706304,
    author = {Wijaya, Edward},
    title = {Distilling Protein Language Models with Complementary Regularizers},
    year = {2026},
    doi = {10.64898/2026.02.17.706304},
    publisher = {Cold Spring Harbor Laboratory},
    journal = {bioRxiv}
}

Related Models

ProtGPT2 --- the teacher model
protgpt2-distilled-small --- 78M parameters, 9.4x compression
protgpt2-distilled-medium --- 194M parameters, 3.8x compression

License

Apache 2.0

Downloads last month: 73

Safetensors

Model size

38.9M params

Tensor type

F32

Model tree for littleworth/protgpt2-distilled-tiny

Base model

nferruz/ProtGPT2

Finetuned

(22)

this model

Dataset used to train littleworth/protgpt2-distilled-tiny

Paper for littleworth/protgpt2-distilled-tiny

Distilling the Knowledge in a Neural Network

Paper • 1503.02531 • Published Mar 9, 2015 • 2