Instructions to use QuantTrio/Qwen3-VL-235B-A22B-Instruct-AWQ with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use QuantTrio/Qwen3-VL-235B-A22B-Instruct-AWQ with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="QuantTrio/Qwen3-VL-235B-A22B-Instruct-AWQ")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("QuantTrio/Qwen3-VL-235B-A22B-Instruct-AWQ")
model = AutoModelForImageTextToText.from_pretrained("QuantTrio/Qwen3-VL-235B-A22B-Instruct-AWQ")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use QuantTrio/Qwen3-VL-235B-A22B-Instruct-AWQ with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "QuantTrio/Qwen3-VL-235B-A22B-Instruct-AWQ"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "QuantTrio/Qwen3-VL-235B-A22B-Instruct-AWQ",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/QuantTrio/Qwen3-VL-235B-A22B-Instruct-AWQ

SGLang

How to use QuantTrio/Qwen3-VL-235B-A22B-Instruct-AWQ with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "QuantTrio/Qwen3-VL-235B-A22B-Instruct-AWQ" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "QuantTrio/Qwen3-VL-235B-A22B-Instruct-AWQ",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "QuantTrio/Qwen3-VL-235B-A22B-Instruct-AWQ" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "QuantTrio/Qwen3-VL-235B-A22B-Instruct-AWQ",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use QuantTrio/Qwen3-VL-235B-A22B-Instruct-AWQ with Docker Model Runner:
```
docker model run hf.co/QuantTrio/Qwen3-VL-235B-A22B-Instruct-AWQ
```

Output on 2nd Request

by ryanstout - opened Sep 30, 2025

Discussion

ryanstout

Sep 30, 2025

Posting this here (in addition to the Thinking-AWQ model) since I see it on both.

First, thanks so much for uploading an AWQ version of Qwen3-VL (what a great model also!)

So this is a weird one, and I'm guessing a bug in VLLM, but I'm posting here incase anyone else runs into it. The first time I make a request, I get a valid response, the 2nd time and onward I get what seems like random tokens. (once I change the prompt so it's not hitting the cache)

I'm running with pipeline parallel (which may be related)

uv run vllm serve \
    QuantTrio/Qwen3-VL-235B-A22B-Thinking-AWQ \
    --served-model-name My_Model \
    --enable-expert-parallel \
    --swap-space 16 \
    --max-num-seqs 1 \
    --max-model-len 32768 \
    --gpu-memory-utilization 0.9 \
    --pipeline-parallel-size 7 \
    --trust-remote-code \
    --disable-log-requests \
    --host 0.0.0.0 \
    --port 8000

Here's an example broken response. (prompt was "Describe this image")

{"id":"chatcmpl-7d3ffdece5914c5db967af956950b1ea","object":"chat.completion","created":1759244626,"model":"My_Model","choices":[{"index":0,"message":{"role":"assistant","content":"```沙滩```","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning_content":null},"logprobs":null,"finish_reason":"stop","stop_reason":151643,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":2770,"total_tokens":2774,"completion_tokens":4,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null}

The response content is:
沙滩

If I restart vllm and run the request again, it works until I do a new prompt.

Any ideas on this one? Thanks!

tclf90

QuantTrio org Sep 30, 2025

Hi ryan, I wasn’t able to reproduce this behavior on my side. It might be related to some subtle issues in the vLLM installation on your machine. If I were in your place, I would:

Try creating a fresh Python virtual environment and reinstall vLLM on a different day (some nightly builds work, while others may not).
Try installing and running in a clean Python venv without uv. While uv usually works, it’s not a silver bullet, and in rare cases it can fail. For vLLM, you can use:

pip install vllm --pre --extra-index-url https://wheels.vllm.ai/nightly

Hope this gives you a direction to troubleshoot further. Good luck!

maleal

Oct 1, 2025

Thanks a lot for this quantization! It works fine for me on Blackwell with the following:

uv pip install --upgrade --no-cache-dir torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu130
uv pip install pynccl nvidia-ml-py accelerate qwen-vl-utils git+https://github.com/huggingface/transformers accelerate
uv pip install -U vllm --torch-backend=auto --extra-index-url https://wheels.vllm.ai/nightly

I do have some issue related to tool calling but i'm not sure it's vllm/hermes plugin the issue. It does not seem to work when i set tool call required and if i don't and ask to use a tool it emits a tool call in the message content wrapping in xml tags <tool_call> (as per template from what i see).

I believe hermes should take care of this but it seems it doesnt.

crystech

Oct 2, 2025

I tried with vllm and stucked at
Loading safetensors checkpoint shards: 100% Completed | 42/42 [01:36<00:00, 2.29s/it]
(Worker_PP0 pid=1855)
(Worker_PP4 pid=1859) INFO 10-02 18:06:06 [gpu_model_runner.py:2573] Model loading took 25.3483 GiB and 107.931375 seconds
(Worker_PP0 pid=1855) INFO 10-02 18:06:06 [default_loader.py:268] Loading weights took 107.25 seconds
(Worker_PP2 pid=1857) INFO 10-02 18:06:07 [gpu_model_runner.py:2573] Model loading took 26.5518 GiB and 108.916489 seconds
(Worker_PP1 pid=1856) INFO 10-02 18:06:08 [default_loader.py:268] Loading weights took 109.07 seconds
(Worker_PP3 pid=1858) INFO 10-02 18:06:12 [default_loader.py:268] Loading weights took 113.16 seconds
(Worker_PP0 pid=1855) INFO 10-02 18:06:14 [gpu_model_runner.py:2573] Model loading took 26.5518 GiB and 116.291431 seconds
(Worker_PP1 pid=1856) INFO 10-02 18:06:16 [gpu_model_runner.py:2573] Model loading took 26.5518 GiB and 117.711829 seconds
(Worker_PP3 pid=1858) INFO 10-02 18:06:20 [gpu_model_runner.py:2573] Model loading took 26.5518 GiB and 122.089756 seconds
(Worker_PP2 pid=1857) INFO 10-02 18:06:21 [gpu_model_runner.py:3254] Encoder cache will be initialized with a budget of 16384 tokens, and profiled with 1 image items of the maximum feature size.
(Worker_PP1 pid=1856) INFO 10-02 18:06:21 [gpu_model_runner.py:3254] Encoder cache will be initialized with a budget of 16384 tokens, and profiled with 1 image items of the maximum feature size.
(Worker_PP4 pid=1859) INFO 10-02 18:06:21 [gpu_model_runner.py:3254] Encoder cache will be initialized with a budget of 16384 tokens, and profiled with 1 image items of the maximum feature size.
(Worker_PP3 pid=1858) INFO 10-02 18:06:21 [gpu_model_runner.py:3254] Encoder cache will be initialized with a budget of 16384 tokens, and profiled with 1 image items of the maximum feature size.
(Worker_PP0 pid=1855) INFO 10-02 18:06:21 [gpu_model_runner.py:3254] Encoder cache will be initialized with a budget of 16384 tokens, and profiled with 1 image items of the maximum feature size.

it won't get pass Encoder cache will be initialized with a budget of 16384 tokens, 5 GPUs setup..

maleal

Dec 4, 2025

This comment has been hidden (marked as Resolved)

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment