Instructions to use VisuLogic/qwen2_5vl_7b_rloo_80steps_hf with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use VisuLogic/qwen2_5vl_7b_rloo_80steps_hf with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="VisuLogic/qwen2_5vl_7b_rloo_80steps_hf")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("VisuLogic/qwen2_5vl_7b_rloo_80steps_hf")
model = AutoModelForImageTextToText.from_pretrained("VisuLogic/qwen2_5vl_7b_rloo_80steps_hf")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use VisuLogic/qwen2_5vl_7b_rloo_80steps_hf with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "VisuLogic/qwen2_5vl_7b_rloo_80steps_hf"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "VisuLogic/qwen2_5vl_7b_rloo_80steps_hf",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/VisuLogic/qwen2_5vl_7b_rloo_80steps_hf

SGLang

How to use VisuLogic/qwen2_5vl_7b_rloo_80steps_hf with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "VisuLogic/qwen2_5vl_7b_rloo_80steps_hf" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "VisuLogic/qwen2_5vl_7b_rloo_80steps_hf",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "VisuLogic/qwen2_5vl_7b_rloo_80steps_hf" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "VisuLogic/qwen2_5vl_7b_rloo_80steps_hf",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use VisuLogic/qwen2_5vl_7b_rloo_80steps_hf with Docker Model Runner:
```
docker model run hf.co/VisuLogic/qwen2_5vl_7b_rloo_80steps_hf
```

VisuLogic: A Benchmark for Evaluating Visual Reasoning in Multi-modal Large Language Models

A Challenging Visual-centric Benchmark for Evaluating Multimodal Reasoning in MLLMs!

This is the Qwen2.5-VL-7B-Instruct-RL model of VisuLogic.

For more details, please refer to the project page with dataset exploration and visualization tools: https://visulogic-benchmark.github.io/VisuLogic/.

VisuLogic Resouces

🌐 Homepage | 🏆 Leaderboard | 📖 Paper | 🤗 Benchmark | 🤗 Train Data

💻 Eval Code | 💻 Train Code | 🤗 Checkpoint (7B) | 🤗 Checkpoint (38B)

🔔News

🔥[2025-04-26] VisuLogic has been merged into VLMEvalkit. You can evaluate your model on VisuLogic with it ! Usage see VLMEvalkit ! 🚀
🔥[2025-04-22] Release the paper, training data and training code! 🚀
🔥[2025-04-08] Release the benchmark and the code! 🚀

✅ To-do

Release the benchmark dataset and eval code
Release training code
Release the paper
Release the training dataset
Release model ckpts

📖 Introduction

VisuLogic is a newly designed benchmark aimed at evaluating the visual reasoning capabilities of Multi-modal Large Language Models (MLLMs), independent of textual reasoning processes. It features carefully constructed visual reasoning tasks spanning multiple categories, divided into six types based on required reasoning skills (e.g., Quantitative Reasoning, which involves understanding and deducing changes in the quantity of elements in images). Unlike existing benchmarks, VisuLogic is a challenging visual reasoning benchmark that is inherently difficult to articulate using language, providing a more rigorous evaluation of the visual reasoning capabilities of MLLMs. Most models score below 30% accuracy—only slightly above the 25% random baseline and far below the 51.4% achieved by humans—revealing significant gaps in visual reasoning.

🌟 Key Features

🚀 Visuo-Logical Challenge
The first benchmark to integrate visual perception with logical reasoning, enabling authentic multimodal evaluation. Most models score below 30% accuracy—only slightly above the 25% random baseline and far below the 51.4% achieved by humans—revealing significant gaps in visual reasoning.
🛠️ Rigorous Design
Includes 1,000 meticulously curated questions, spanning 6 domains and 24 subcategories, for comprehensive performance evaluation.
📝 Anti-Linguistic Shortcut
Designed to avoid linguistic reasoning, ensuring tasks rely on genuine visual reasoning rather than shortcuts.
💡 RL Exploration
We identify the RL technique as a promising direction for improving the visual reasoning capabilities of MLLMs. Through RL method, models reach SOTA in VisuLogic!
✅ Fully Open-source
We open-source all the evaluation code, training scripts, and datasets associated with this work to promote further research and innovation.

🖼️ Examples of VisuLogic

📊 Eval

Please refer to VisuLogic-Eval for eval code.

📦 Training

Please refer to VisuLogic-Train for training code.

📩 Contact

Weiye Xu: ustcxwy0271@mail.ustc.edu.cn
Jiahao Wang: wjhwdscience@stu.xjtu.edu.cn

📜 Citation

BibTeX:

@article{xu2025visulogic,
  title={VisuLogic: A Benchmark for Evaluating Visual Reasoning in Multi-modal Large Language Models},
  author={Xu, Weiye and Wang, Jiahao and Wang, Weiyun and Chen, Zhe and Zhou, Wengang and Yang, Aijun and Lu, Lewei and Li, Houqiang and Wang, Xiaohua and Zhu, Xizhou and Wang, Wenhai and Dai, Jifeng and Zhu, Jinguo},
  journal={arXiv preprint arXiv:2504.15279},
  year={2025},
  url={https://arxiv.org/abs/2504.15279}
}

🎉 Thank you for your interest in VisuLogic! We hope this benchmark helps drive advancements in multimodal visual reasoning! 🚀