DART Student Backbones

Distilled lightweight backbones for DART (Detect Anything in Real Time), a training-free framework that converts SAM3 into a real-time open-vocabulary multi-class detector.

For more details, see the paper: Detect Anything in Real Time: From Single-Prompt Segmentation to Multi-Class Detection.

These student backbones replace SAM3's 439M-parameter ViT-H/14 backbone with lightweight alternatives via adapter distillation: a small FPN adapter (~5M trainable parameters) is trained to project student features into the ViT-H feature space while the SAM3 encoder-decoder remains frozen.

Models

Model	Backbone Params	COCO AP	AP50	AP75	AP_S	AP_L	BB (ms)	FPS (4 cls)
DART (ViT-H teacher)	439M	55.8	73.4	61.5	40.3	70.7	55	13.5
DART-Pruned-16	220M	53.6	70.6	58.8	37.7	68.8	34	19.1
DART-Pruned-20	177M	52.4	69.2	57.4	36.4	67.9	29	21.6
DART-Pruned-22	149M	50.2	66.6	55.1	33.9	65.7	27	22.1
DART-Pruned-24	121M	40.8	55.3	44.4	24.3	54.9	25	23.3
DART-RepViT-M2.3	8.2M	38.7	53.1	42.3	22.6	49.9	16	30.2
DART-TinyViT-21M	21M	30.1	42.4	32.6	17.4	37.8	15	31.3
DART-EfficientViT-L2	9.2M	21.7	31.5	23.5	13.7	24.2	13	33.4
DART-EfficientViT-L1	5.3M	16.3	24.2	17.4	10.6	17.3	13	33.4

FPS vs. class count (1008px, RTX 4080, TRT FP16, sequential)

Backbone	1 cls	2 cls	4 cls	8 cls	16 cls	80 cls
ViT-H (full)	15.7	15.1	13.5	10.9	7.7	2.4
ViT-H Pruned-16	24.9	22.4	19.1	14.4	9.3	2.6
ViT-H Pruned-22	27.1	26.7	22.1	15.9	10.1	—
RepViT-M2.3	44.6	38.9	30.2	19.7	11.3	2.7
TinyViT-21M	47.9	41.8	31.3	20.2	11.5	2.7
EfficientViT-L2	51.8	44.9	33.4	21.2	11.8	2.7
EfficientViT-L1	52.4	44.9	33.4	21.1	11.6	2.7
E-D only (ms)	7	10	17	35	73	357

All results on COCO val2017 (5,000 images, 80 classes, 1008x1008 resolution) using TRT FP16 backbone + encoder-decoder on a single RTX 4080. FPS measured over 100 frames of traffic video, sequential mode. At 80 classes, encoder-decoder is batched in chunks of 16. Teacher uses training-free multi-class detection (no detection training); students use adapter distillation with frozen encoder-decoder; Pruned models use self-distillation with 16–24 of 32 ViT blocks removed.

Pruned Backbone

DART-Pruned-16 removes 16 of 32 ViT-H blocks and recovers quality via self-distillation. The full backbone serves as a frozen teacher while the pruned copy is trained with MSE loss on FPN features.

Training

# 8xH100 via SLURM
srun --ntasks=1 torchrun --nproc_per_node=8 scripts/distill.py \
    --data-dir /path/to/coco/train2017 \
    --checkpoint sam3.pt \
    --phase prune \
    --skip-blocks "5,10,12,14,17,18,19,20,21,22,24,25,26,27,28,30" \
    --epochs 100 --batch-size 32 --lr 1e-4 \
    --output-dir skipblocks_distill

Export and evaluate

# Export pruned backbone via HF path (fused attention kernels)
PYTHONIOENCODING=utf-8 python scripts/export_hf_backbone.py \
    --image x.jpg \
    --output-onnx onnx_hf_backbone_1008_pruned/hf_backbone.onnx \
    --output-engine hf_backbone_1008_pruned_fp16.engine \
    --skip-blocks "5,10,12,14,17,18,19,20,21,22,24,25,26,27,28,30"

# Evaluate on COCO val2017
PYTHONIOENCODING=utf-8 python scripts/eval_coco_official.py \
    --images-dir D:/val2017 \
    --ann-file D:/coco2017labels/coco/annotations/instances_val2017.json \
    --checkpoint sam3.pt \
    --pruned-checkpoint distilled/pruned_16blocks.pt \
    --configs "pruned16_1008=trt:hf_backbone_1008_pruned_fp16.engine;encdec:enc_dec_1008_c16_presence_fp16.engine;imgsz:1008"

Block selection

Blocks were selected by greedy importance analysis (scripts/analyze_block_importance.py). Blocks in the later layers (17-30) are least important, while early blocks (0-8) and global attention blocks (7, 15, 23, 31) are critical. The pruned checkpoint stores skip_blocks metadata so it auto-applies during loading.

Distilled Student Architecture

Each student model consists of:

A frozen ImageNet-pretrained backbone from timm (features_only=True, 3 stages)
A trained FPN adapter (3 levels of Conv1x1 + bilinear interpolation + Conv3x3) that maps backbone features to SAM3's expected FPN format: (B, 256, 288, 288), (B, 256, 144, 144), (B, 256, 72, 72)
The original frozen SAM3 encoder-decoder (unchanged)

Usage

Loading a student model

import torch
from sam3.distillation.sam3_student import build_sam3_student_model

model = build_sam3_student_model(
    backbone_config="repvit_m2_3",       # or efficientvit_l1, efficientvit_l2, tiny_vit_21m
    teacher_checkpoint="sam3.pt",        # SAM3 weights (encoder-decoder)
    device="cuda",
)

ckpt = torch.load("distilled/repvit_m2_3_distilled.pt", map_location="cuda")
model.backbone.student_backbone.load_state_dict(ckpt["student_state_dict"])
model.eval()

Inference

from sam3.model.sam3_multiclass_fast import Sam3MultiClassPredictorFast

predictor = Sam3MultiClassPredictorFast(model, device="cuda")
predictor.set_classes(["person", "car", "dog"])
state = predictor.set_image(image)  # PIL Image
results = predictor.predict(state, confidence_threshold=0.3)
# results: dict with 'boxes', 'scores', 'class_ids', 'class_names'

COCO evaluation

PYTHONIOENCODING=utf-8 python scripts/eval_all_students.py

This runs scripts/eval_coco.py for all four student models and produces coco_eval_all_students.json.

Training

All adapters were trained on COCO train2017 (118K unlabeled images, no annotations used) for 5 epochs with AdamW (lr=1e-3, weight decay=0.01, cosine schedule) using multi-scale MSE loss between student and teacher FPN features (level weights: 0.15, 0.20, 0.65). Training takes approximately 2 GPU-hours on a single RTX 4080.

python scripts/distill.py \
    --data-dir /path/to/coco/train2017 \
    --checkpoint sam3.pt \
    --backbone repvit_m2_3 \
    --epochs 5 --batch-size 2 --lr 1e-3

Supported backbones

Config name	timm model	Stages
`efficientvit_l1`	`efficientvit_l1.r224_in1k`	(0, 1, 2)
`efficientvit_l2`	`efficientvit_l2.r384_in1k`	(0, 1, 2)
`repvit_m2_3`	`repvit_m2_3.dist_450e_in1k`	(0, 1, 2)
`tiny_vit_21m`	`tiny_vit_21m_224.dist_in22k_ft_in1k`	(0, 1, 2)
`vit_base`	`vit_base_patch16_224.augreg2_in21k_ft_in1k`	(0, 1, 2)
`vit_base_dinov3`	`vit_base_patch16_dinov3.lvd1689m`	(0, 1, 2)

Checkpoint format

Each .pt file contains:

{
    "epoch": 5,
    "loss": float,
    "adapter_state_dict": { ... },   # FPN adapter weights only
    "student_state_dict": { ... },   # Full student backbone + adapter state
}

TRT export

python scripts/export_student_trt.py --models repvit_m2_3 --imgsz 1008

Produces ONNX and TRT FP16 engine files. The encoder-decoder is exported separately (split-engine design) to preserve open-vocabulary flexibility.

Requirements

PyTorch >= 2.7.0
timm
SAM3 checkpoint (sam3.pt)
TensorRT >= 10.9 (for TRT deployment)

Citation

@article{dart2026,
  title={Detect Anything in Real Time: From Single-Prompt Segmentation to Multi-Class Detection},
  author={Turkcan, Mehmet Kerem},
  journal={arXiv preprint},
  year={2026}
}

License

The student adapter weights are released under the same license as SAM3. The underlying backbone weights (RepViT, TinyViT, EfficientViT) retain their original licenses from timm.

Downloads last month: 52

Inference Providers NEW

Zero-Shot Object Detection

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for mehmetkeremturkcan/DART

Detect Anything in Real Time: From Single-Prompt Segmentation to Multi-Class Detection

Paper • 2603.11441 • Published 19 days ago