Uni-NaVid: A Video-based Vision-Language-Action Model for Unifying Embodied Navigation Tasks
Paper
•
2412.06224
•
Published
Pretrained Weights of Uni-NaVid: A Video-based Vision-Language-Action Model for Unifying Embodied Navigation Tasks (RSS 2025)
Paper: https://arxiv.org/pdf/2412.06224
The model is trained on samples collected from the training splits of VLN-CE R2R and RxR, EVT-Bench, ObjectNav, EQA.
| Evaliation Benchmark | TL | NE | OS | SR | SPL |
|---|---|---|---|---|---|
| VLN-CE R2R Val. | 9.22 | 4.96 | 57.4 | 51.8 | 47.7 |
| VLN-CE RxR Val. | 18.4 | 5.67 | 64.4 | 66.4 | 44.5 |
The related inference code can be found in here