Instructions to use zlab-princeton/Vero-Qwen3T-8B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use zlab-princeton/Vero-Qwen3T-8B with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="zlab-princeton/Vero-Qwen3T-8B") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("zlab-princeton/Vero-Qwen3T-8B") model = AutoModelForImageTextToText.from_pretrained("zlab-princeton/Vero-Qwen3T-8B") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use zlab-princeton/Vero-Qwen3T-8B with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "zlab-princeton/Vero-Qwen3T-8B" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "zlab-princeton/Vero-Qwen3T-8B", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/zlab-princeton/Vero-Qwen3T-8B
- SGLang
How to use zlab-princeton/Vero-Qwen3T-8B with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "zlab-princeton/Vero-Qwen3T-8B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "zlab-princeton/Vero-Qwen3T-8B", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "zlab-princeton/Vero-Qwen3T-8B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "zlab-princeton/Vero-Qwen3T-8B", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use zlab-princeton/Vero-Qwen3T-8B with Docker Model Runner:
docker model run hf.co/zlab-princeton/Vero-Qwen3T-8B
Vero-Qwen3T-8B
Vero is an open RL model family for general visual reasoning. It releases models, data, evaluation, and training code for broad multimodal reasoning across charts, STEM, spatial reasoning, knowledge, grounding, counting, and instruction following.
Models
| Model | HF repo | Base model | Params |
|---|---|---|---|
Vero-Qwen3I-8B |
gsarch/Vero-Qwen3I-8B |
Qwen3-VL-8B-Instruct |
8B |
Vero-Qwen3T-8B |
gsarch/Vero-Qwen3T-8B |
Qwen3-VL-8B-Thinking |
8B |
Vero-MiMo-7B |
gsarch/Vero-MiMo-7B |
MiMo-VL-7B-SFT-2508 |
7B |
Vero-Qwen25-7B |
gsarch/Vero-Qwen25-7B |
Qwen2.5-VL-7B-Instruct |
7B |
Highlights
- Fully open release of models, training code, evaluation, and the
Vero-600Kdataset. - 600K curated RL samples from 59 datasets across 6 visual reasoning categories.
- Trained for broad transfer across chart and OCR, STEM, spatial and action, knowledge and recognition, grounding and counting, and captioning and instruction following.
- SOTA 8B on
VeroEval, a 30-benchmark suite for general visual reasoning. - Improves performance across multiple base model families, including Qwen2.5-VL, Qwen3-VL, and MiMo-VL.
Usage
Example for gsarch/Vero-Qwen3T-8B:
from transformers import AutoProcessor, Qwen2_5_VLForConditionalGeneration
from qwen_vl_utils import process_vision_info
model_path = "gsarch/Vero-Qwen3T-8B"
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
model_path,
torch_dtype="auto",
device_map="auto",
)
processor = AutoProcessor.from_pretrained(model_path)
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": "path/to/image.jpg"},
{"type": "text", "text": "What is the x axis value with the largest population?"},
],
}
]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
).to(model.device)
generated_ids = model.generate(**inputs, max_new_tokens=2048)
output = processor.batch_decode(
generated_ids[:, inputs.input_ids.shape[1]:],
skip_special_tokens=True,
)[0]
print(output)
Vero models generate a reasoning trace in <think> tags followed by a final answer in <answer> tags. For downstream use, parse the final response from <answer>.
Recommended sampling parameters, following the Qwen3.5 defaults:
- Thinking mode for general tasks:
temperature=1.0,top_p=0.95,top_k=20,min_p=0.0,presence_penalty=1.5,repetition_penalty=1.0,max_new_tokens=16384.
Citation
@article{sarch2026vero,
title = {Vero: An Open RL Recipe for General Visual Reasoning},
author = {Sarch, Gabriel and Cai, Linrong and Wang, Qunzhong and Wu, Haoyang and Chen, Danqi and Liu, Zhuang},
year = {2026},
journal = {arXiv preprint arXiv:2604.04917},
}
License
Vero is released under the Apache-2.0 license. Users should also review the licenses and usage terms of the underlying base models and any upstream datasets.
- Downloads last month
- 5