Qwen-3-VL Collection
Collection
Quantized Qwen3-VL models for efficient image-text understanding (AutoRound W4A16). • 9 items • Updated
How to use Vishva007/Qwen3-VL-2B-Instruct-W4A16-AutoRound-GPTQ with vLLM:
# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Vishva007/Qwen3-VL-2B-Instruct-W4A16-AutoRound-GPTQ"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
-H "Content-Type: application/json" \
--data '{
"model": "Vishva007/Qwen3-VL-2B-Instruct-W4A16-AutoRound-GPTQ",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Describe this image in one sentence."
},
{
"type": "image_url",
"image_url": {
"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
}
}
]
}
]
}'docker model run hf.co/Vishva007/Qwen3-VL-2B-Instruct-W4A16-AutoRound-GPTQ
How to use Vishva007/Qwen3-VL-2B-Instruct-W4A16-AutoRound-GPTQ with SGLang:
# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
--model-path "Vishva007/Qwen3-VL-2B-Instruct-W4A16-AutoRound-GPTQ" \
--host 0.0.0.0 \
--port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
-H "Content-Type: application/json" \
--data '{
"model": "Vishva007/Qwen3-VL-2B-Instruct-W4A16-AutoRound-GPTQ",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Describe this image in one sentence."
},
{
"type": "image_url",
"image_url": {
"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
}
}
]
}
]
}'docker run --gpus all \
--shm-size 32g \
-p 30000:30000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HF_TOKEN=<secret>" \
--ipc=host \
lmsysorg/sglang:latest \
python3 -m sglang.launch_server \
--model-path "Vishva007/Qwen3-VL-2B-Instruct-W4A16-AutoRound-GPTQ" \
--host 0.0.0.0 \
--port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
-H "Content-Type: application/json" \
--data '{
"model": "Vishva007/Qwen3-VL-2B-Instruct-W4A16-AutoRound-GPTQ",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Describe this image in one sentence."
},
{
"type": "image_url",
"image_url": {
"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
}
}
]
}
]
}'How to use Vishva007/Qwen3-VL-2B-Instruct-W4A16-AutoRound-GPTQ with Docker Model Runner:
docker model run hf.co/Vishva007/Qwen3-VL-2B-Instruct-W4A16-AutoRound-GPTQ
This is a 4-bit quantized version of the powerful Qwen/Qwen3-VL-2B-Instruct vision-language model.
It was optimized using Intel's AutoRound algorithm, which calibrates weights for 800 iterations to minimize quantization loss. This version retains the original FP16 vision tower, ensuring that visual capabilities (OCR, spatial reasoning, chart analysis) remain degradation-free.
W4A16 (4-bit weights, 16-bit activations)TrueTo use this model in its native AutoRound format, you need the auto-round library.
pip install auto-round transformers torch
from transformers import AutoModelForCausalLM, AutoTokenizer, AutoProcessor
from auto_round import AutoRoundConfig
model_id = "Vishva007/Qwen3-VL-2B-Instruct-W4A16-AutoRound-GPTQ"
# Load Model
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map="auto",
trust_remote_code=True
)
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
# Prepare Input
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"},
{"type": "text", "text": "Describe this image detailly."},
],
}
]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
).to(model.device)
# Generate
generated_ids = model.generate(**inputs, max_new_tokens=128)
print(processor.batch_decode(generated_ids, skip_special_tokens=True))
@article{cheng2023optimize,
title={Optimize Weight Rounding via Signed Gradient Descent for the Quantization of LLMs},
author={Cheng, Wenhua and Zhang, Weiwei and Shen, Haihao and Cai, Yiyang and He, Xin and Lv, Kaokao},
journal={arXiv preprint arXiv:2309.05516},
year={2023}
}
Base model
Qwen/Qwen3-VL-2B-Instruct