TrianguLang: Geometry-Aware Semantic Consensus for Pose-Free 3D Localization
Paper: arXiv:2603.08096 Project Page: cwru-aism.github.io/triangulang Code: github.com/bryceag11/triangulang Training Data & Caches: huggingface.co/datasets/bag100/triangulang-scannetpp-cache
Bryce Grant, Aryeh Rothenberg, Atri Banerjee, Peng Wang Case Western Reserve University
Overview
TrianguLang is a feed-forward, pose-free method for language-guided 3D localization from multi-view images. Given unposed images and a text query, it produces per-view segmentation masks and camera-relative 3D locations at ~18 FPS for 5 classes.
Checkpoints
| Checkpoint | Description |
|---|---|
checkpoints/ma_v10_config_245s_100ep/best.pt |
Single-object (v10), ScanNet++ in-domain (62.4 mIoU) |
checkpoints/mo_v11_text_spatial_245s_8v_100ep/best.pt |
Multi-object (text + spatial) |
checkpoints/gasa_generalist/best.pt |
Generalist for zero-shot open-vocab benchmarks (uCO3D / LERF-OVS / 3D-OVS / Mip-NeRF360) |
checkpoints/gasa_E_box_camframe_230s_100ep_bs8/best.pt |
Strongest ScanNet++ (74.3 mIoU), camera-frame |
Each checkpoint directory also contains last.pt (for resuming training) and config.json.
Architecture
- Frozen: SAM3 (841M) + DA3-NESTED-GIANT-LARGE (1.69B) = ~2.5B params
- Trainable: GASA Decoder (~13.5M params)
Results
Single-Object (text-only)
| Benchmark | Setting | mIoU | mAcc / Loc. Acc. |
|---|---|---|---|
| ScanNet++ | In-domain | 62.4% | 77.4% mAcc |
| uCO3D | In-domain | 94.6% | 98.3% mAcc |
| uCO3D | Cross-domain (ScanNet++ → uCO3D) | 75.7% | 79.6% mAcc |
| LERF-OVS | Zero-shot (no LERF training) | 59.2% | 89.1% Loc. Acc. |
| NVOS | Zero-shot | 93.5% | — |
| SPIn-NeRF | Zero-shot | 91.4% | — |
Multi-Object (text-only, ScanNet++)
| Setting | mIoU | mAcc |
|---|---|---|
| Text-only (multi-object) | 65.2% | 79.1% |
LERF-OVS Per-Scene (zero-shot)
| Method | Ramen | Teatime | Kitchen | Figurines | Overall mIoU | Overall Loc. Acc. |
|---|---|---|---|---|---|---|
| LERF | 28.2 | 45.0 | 37.9 | 38.6 | 37.4 | 73.6 |
| LangSplat | 51.2 | 65.1 | 44.5 | 44.7 | 51.4 | 84.3 |
| LangSplat-V2 | 51.8 | 72.2 | 59.1 | 56.4 | 59.9 | 84.1 |
| TrianguLang | 51.1 | 58.9 | 62.4 | 62.1 | 59.2 | 89.1 |
Note: Per-scene methods (LERF, LangSplat) require calibrated poses and 10-45 min per-scene optimization. TrianguLang runs feed-forward in ~58ms.
Citation
@article{grant2026triangulang,
title={TrianguLang: Geometry-Aware Semantic Consensus for Pose-Free 3D Localization},
author={Grant, Bryce and Rothenberg, Aryeh and Banerjee, Atri and Wang, Peng},
journal={arXiv preprint arXiv:2603.08096},
year={2026}
}