DART Student Backbones
Distilled lightweight backbones for DART (Detect Anything in Real Time), a training-free framework that converts SAM3 into a real-time open-vocabulary multi-class detector.
For more details, see the paper: Detect Anything in Real Time: From Single-Prompt Segmentation to Multi-Class Detection.
These student backbones replace SAM3's 439M-parameter ViT-H/14 backbone with lightweight alternatives via adapter distillation: a small FPN adapter (~5M trainable parameters) is trained to project student features into the ViT-H feature space while the SAM3 encoder-decoder remains frozen.
Models
| Model | Backbone Params | COCO AP | AP50 | AP75 | AP_S | AP_L | BB (ms) | FPS (4 cls) |
|---|---|---|---|---|---|---|---|---|
| DART (ViT-H teacher) | 439M | 55.8 | 73.4 | 61.5 | 40.3 | 70.7 | 55 | 13.5 |
| DART-Pruned-16 | 220M | 53.6 | 70.6 | 58.8 | 37.7 | 68.8 | 34 | 19.1 |
| DART-Pruned-20 | 177M | 52.4 | 69.2 | 57.4 | 36.4 | 67.9 | 29 | 21.6 |
| DART-Pruned-22 | 149M | 50.2 | 66.6 | 55.1 | 33.9 | 65.7 | 27 | 22.1 |
| DART-Pruned-24 | 121M | 40.8 | 55.3 | 44.4 | 24.3 | 54.9 | 25 | 23.3 |
| DART-RepViT-M2.3 | 8.2M | 38.7 | 53.1 | 42.3 | 22.6 | 49.9 | 16 | 30.2 |
| DART-TinyViT-21M | 21M | 30.1 | 42.4 | 32.6 | 17.4 | 37.8 | 15 | 31.3 |
| DART-EfficientViT-L2 | 9.2M | 21.7 | 31.5 | 23.5 | 13.7 | 24.2 | 13 | 33.4 |
| DART-EfficientViT-L1 | 5.3M | 16.3 | 24.2 | 17.4 | 10.6 | 17.3 | 13 | 33.4 |
FPS vs. class count (1008px, RTX 4080, TRT FP16, sequential)
| Backbone | 1 cls | 2 cls | 4 cls | 8 cls | 16 cls | 80 cls |
|---|---|---|---|---|---|---|
| ViT-H (full) | 15.7 | 15.1 | 13.5 | 10.9 | 7.7 | 2.4 |
| ViT-H Pruned-16 | 24.9 | 22.4 | 19.1 | 14.4 | 9.3 | 2.6 |
| ViT-H Pruned-22 | 27.1 | 26.7 | 22.1 | 15.9 | 10.1 | — |
| RepViT-M2.3 | 44.6 | 38.9 | 30.2 | 19.7 | 11.3 | 2.7 |
| TinyViT-21M | 47.9 | 41.8 | 31.3 | 20.2 | 11.5 | 2.7 |
| EfficientViT-L2 | 51.8 | 44.9 | 33.4 | 21.2 | 11.8 | 2.7 |
| EfficientViT-L1 | 52.4 | 44.9 | 33.4 | 21.1 | 11.6 | 2.7 |
| E-D only (ms) | 7 | 10 | 17 | 35 | 73 | 357 |
All results on COCO val2017 (5,000 images, 80 classes, 1008x1008 resolution) using TRT FP16 backbone + encoder-decoder on a single RTX 4080. FPS measured over 100 frames of traffic video, sequential mode. At 80 classes, encoder-decoder is batched in chunks of 16. Teacher uses training-free multi-class detection (no detection training); students use adapter distillation with frozen encoder-decoder; Pruned models use self-distillation with 16–24 of 32 ViT blocks removed.
Pruned Backbone
DART-Pruned-16 removes 16 of 32 ViT-H blocks and recovers quality via self-distillation. The full backbone serves as a frozen teacher while the pruned copy is trained with MSE loss on FPN features.
Training
# 8xH100 via SLURM
srun --ntasks=1 torchrun --nproc_per_node=8 scripts/distill.py \
--data-dir /path/to/coco/train2017 \
--checkpoint sam3.pt \
--phase prune \
--skip-blocks "5,10,12,14,17,18,19,20,21,22,24,25,26,27,28,30" \
--epochs 100 --batch-size 32 --lr 1e-4 \
--output-dir skipblocks_distill
Export and evaluate
# Export pruned backbone via HF path (fused attention kernels)
PYTHONIOENCODING=utf-8 python scripts/export_hf_backbone.py \
--image x.jpg \
--output-onnx onnx_hf_backbone_1008_pruned/hf_backbone.onnx \
--output-engine hf_backbone_1008_pruned_fp16.engine \
--skip-blocks "5,10,12,14,17,18,19,20,21,22,24,25,26,27,28,30"
# Evaluate on COCO val2017
PYTHONIOENCODING=utf-8 python scripts/eval_coco_official.py \
--images-dir D:/val2017 \
--ann-file D:/coco2017labels/coco/annotations/instances_val2017.json \
--checkpoint sam3.pt \
--pruned-checkpoint distilled/pruned_16blocks.pt \
--configs "pruned16_1008=trt:hf_backbone_1008_pruned_fp16.engine;encdec:enc_dec_1008_c16_presence_fp16.engine;imgsz:1008"
Block selection
Blocks were selected by greedy importance analysis (scripts/analyze_block_importance.py). Blocks in the later layers (17-30) are least important, while early blocks (0-8) and global attention blocks (7, 15, 23, 31) are critical. The pruned checkpoint stores skip_blocks metadata so it auto-applies during loading.
Distilled Student Architecture
Each student model consists of:
- A frozen ImageNet-pretrained backbone from timm (
features_only=True, 3 stages) - A trained FPN adapter (3 levels of Conv1x1 + bilinear interpolation + Conv3x3) that maps backbone features to SAM3's expected FPN format:
(B, 256, 288, 288),(B, 256, 144, 144),(B, 256, 72, 72) - The original frozen SAM3 encoder-decoder (unchanged)
Usage
Loading a student model
import torch
from sam3.distillation.sam3_student import build_sam3_student_model
model = build_sam3_student_model(
backbone_config="repvit_m2_3", # or efficientvit_l1, efficientvit_l2, tiny_vit_21m
teacher_checkpoint="sam3.pt", # SAM3 weights (encoder-decoder)
device="cuda",
)
ckpt = torch.load("distilled/repvit_m2_3_distilled.pt", map_location="cuda")
model.backbone.student_backbone.load_state_dict(ckpt["student_state_dict"])
model.eval()
Inference
from sam3.model.sam3_multiclass_fast import Sam3MultiClassPredictorFast
predictor = Sam3MultiClassPredictorFast(model, device="cuda")
predictor.set_classes(["person", "car", "dog"])
state = predictor.set_image(image) # PIL Image
results = predictor.predict(state, confidence_threshold=0.3)
# results: dict with 'boxes', 'scores', 'class_ids', 'class_names'
COCO evaluation
PYTHONIOENCODING=utf-8 python scripts/eval_all_students.py
This runs scripts/eval_coco.py for all four student models and produces coco_eval_all_students.json.
Training
All adapters were trained on COCO train2017 (118K unlabeled images, no annotations used) for 5 epochs with AdamW (lr=1e-3, weight decay=0.01, cosine schedule) using multi-scale MSE loss between student and teacher FPN features (level weights: 0.15, 0.20, 0.65). Training takes approximately 2 GPU-hours on a single RTX 4080.
python scripts/distill.py \
--data-dir /path/to/coco/train2017 \
--checkpoint sam3.pt \
--backbone repvit_m2_3 \
--epochs 5 --batch-size 2 --lr 1e-3
Supported backbones
| Config name | timm model | Stages |
|---|---|---|
efficientvit_l1 |
efficientvit_l1.r224_in1k |
(0, 1, 2) |
efficientvit_l2 |
efficientvit_l2.r384_in1k |
(0, 1, 2) |
repvit_m2_3 |
repvit_m2_3.dist_450e_in1k |
(0, 1, 2) |
tiny_vit_21m |
tiny_vit_21m_224.dist_in22k_ft_in1k |
(0, 1, 2) |
vit_base |
vit_base_patch16_224.augreg2_in21k_ft_in1k |
(0, 1, 2) |
vit_base_dinov3 |
vit_base_patch16_dinov3.lvd1689m |
(0, 1, 2) |
Checkpoint format
Each .pt file contains:
{
"epoch": 5,
"loss": float,
"adapter_state_dict": { ... }, # FPN adapter weights only
"student_state_dict": { ... }, # Full student backbone + adapter state
}
TRT export
python scripts/export_student_trt.py --models repvit_m2_3 --imgsz 1008
Produces ONNX and TRT FP16 engine files. The encoder-decoder is exported separately (split-engine design) to preserve open-vocabulary flexibility.
Requirements
- PyTorch >= 2.7.0
- timm
- SAM3 checkpoint (
sam3.pt) - TensorRT >= 10.9 (for TRT deployment)
Citation
@article{dart2026,
title={Detect Anything in Real Time: From Single-Prompt Segmentation to Multi-Class Detection},
author={Turkcan, Mehmet Kerem},
journal={arXiv preprint},
year={2026}
}
License
The student adapter weights are released under the same license as SAM3. The underlying backbone weights (RepViT, TinyViT, EfficientViT) retain their original licenses from timm.
- Downloads last month
- 52