Instructions to use QuantTrio/Qwen3-VL-235B-A22B-Instruct-AWQ with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use QuantTrio/Qwen3-VL-235B-A22B-Instruct-AWQ with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="QuantTrio/Qwen3-VL-235B-A22B-Instruct-AWQ") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("QuantTrio/Qwen3-VL-235B-A22B-Instruct-AWQ") model = AutoModelForImageTextToText.from_pretrained("QuantTrio/Qwen3-VL-235B-A22B-Instruct-AWQ") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use QuantTrio/Qwen3-VL-235B-A22B-Instruct-AWQ with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "QuantTrio/Qwen3-VL-235B-A22B-Instruct-AWQ" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "QuantTrio/Qwen3-VL-235B-A22B-Instruct-AWQ", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/QuantTrio/Qwen3-VL-235B-A22B-Instruct-AWQ
- SGLang
How to use QuantTrio/Qwen3-VL-235B-A22B-Instruct-AWQ with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "QuantTrio/Qwen3-VL-235B-A22B-Instruct-AWQ" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "QuantTrio/Qwen3-VL-235B-A22B-Instruct-AWQ", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "QuantTrio/Qwen3-VL-235B-A22B-Instruct-AWQ" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "QuantTrio/Qwen3-VL-235B-A22B-Instruct-AWQ", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use QuantTrio/Qwen3-VL-235B-A22B-Instruct-AWQ with Docker Model Runner:
docker model run hf.co/QuantTrio/Qwen3-VL-235B-A22B-Instruct-AWQ
Output on 2nd Request
Posting this here (in addition to the Thinking-AWQ model) since I see it on both.
First, thanks so much for uploading an AWQ version of Qwen3-VL (what a great model also!)
So this is a weird one, and I'm guessing a bug in VLLM, but I'm posting here incase anyone else runs into it. The first time I make a request, I get a valid response, the 2nd time and onward I get what seems like random tokens. (once I change the prompt so it's not hitting the cache)
I'm running with pipeline parallel (which may be related)
uv run vllm serve \
QuantTrio/Qwen3-VL-235B-A22B-Thinking-AWQ \
--served-model-name My_Model \
--enable-expert-parallel \
--swap-space 16 \
--max-num-seqs 1 \
--max-model-len 32768 \
--gpu-memory-utilization 0.9 \
--pipeline-parallel-size 7 \
--trust-remote-code \
--disable-log-requests \
--host 0.0.0.0 \
--port 8000
Here's an example broken response. (prompt was "Describe this image")
{"id":"chatcmpl-7d3ffdece5914c5db967af956950b1ea","object":"chat.completion","created":1759244626,"model":"My_Model","choices":[{"index":0,"message":{"role":"assistant","content":"```沙滩```","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning_content":null},"logprobs":null,"finish_reason":"stop","stop_reason":151643,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":2770,"total_tokens":2774,"completion_tokens":4,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null}
The response content is:沙滩
If I restart vllm and run the request again, it works until I do a new prompt.
Any ideas on this one? Thanks!
Hi ryan, I wasn’t able to reproduce this behavior on my side. It might be related to some subtle issues in the vLLM installation on your machine. If I were in your place, I would:
Try creating a fresh Python virtual environment and reinstall vLLM on a different day (some nightly builds work, while others may not).
Try installing and running in a clean Python venv without uv. While uv usually works, it’s not a silver bullet, and in rare cases it can fail. For vLLM, you can use:
pip install vllm --pre --extra-index-url https://wheels.vllm.ai/nightly
Hope this gives you a direction to troubleshoot further. Good luck!
Thanks a lot for this quantization! It works fine for me on Blackwell with the following:
uv pip install --upgrade --no-cache-dir torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu130
uv pip install pynccl nvidia-ml-py accelerate qwen-vl-utils git+https://github.com/huggingface/transformers accelerate
uv pip install -U vllm --torch-backend=auto --extra-index-url https://wheels.vllm.ai/nightly
I do have some issue related to tool calling but i'm not sure it's vllm/hermes plugin the issue. It does not seem to work when i set tool call required and if i don't and ask to use a tool it emits a tool call in the message content wrapping in xml tags <tool_call> (as per template from what i see).
I believe hermes should take care of this but it seems it doesnt.
I tried with vllm and stucked at
Loading safetensors checkpoint shards: 100% Completed | 42/42 [01:36<00:00, 2.29s/it]
(Worker_PP0 pid=1855)
(Worker_PP4 pid=1859) INFO 10-02 18:06:06 [gpu_model_runner.py:2573] Model loading took 25.3483 GiB and 107.931375 seconds
(Worker_PP0 pid=1855) INFO 10-02 18:06:06 [default_loader.py:268] Loading weights took 107.25 seconds
(Worker_PP2 pid=1857) INFO 10-02 18:06:07 [gpu_model_runner.py:2573] Model loading took 26.5518 GiB and 108.916489 seconds
(Worker_PP1 pid=1856) INFO 10-02 18:06:08 [default_loader.py:268] Loading weights took 109.07 seconds
(Worker_PP3 pid=1858) INFO 10-02 18:06:12 [default_loader.py:268] Loading weights took 113.16 seconds
(Worker_PP0 pid=1855) INFO 10-02 18:06:14 [gpu_model_runner.py:2573] Model loading took 26.5518 GiB and 116.291431 seconds
(Worker_PP1 pid=1856) INFO 10-02 18:06:16 [gpu_model_runner.py:2573] Model loading took 26.5518 GiB and 117.711829 seconds
(Worker_PP3 pid=1858) INFO 10-02 18:06:20 [gpu_model_runner.py:2573] Model loading took 26.5518 GiB and 122.089756 seconds
(Worker_PP2 pid=1857) INFO 10-02 18:06:21 [gpu_model_runner.py:3254] Encoder cache will be initialized with a budget of 16384 tokens, and profiled with 1 image items of the maximum feature size.
(Worker_PP1 pid=1856) INFO 10-02 18:06:21 [gpu_model_runner.py:3254] Encoder cache will be initialized with a budget of 16384 tokens, and profiled with 1 image items of the maximum feature size.
(Worker_PP4 pid=1859) INFO 10-02 18:06:21 [gpu_model_runner.py:3254] Encoder cache will be initialized with a budget of 16384 tokens, and profiled with 1 image items of the maximum feature size.
(Worker_PP3 pid=1858) INFO 10-02 18:06:21 [gpu_model_runner.py:3254] Encoder cache will be initialized with a budget of 16384 tokens, and profiled with 1 image items of the maximum feature size.
(Worker_PP0 pid=1855) INFO 10-02 18:06:21 [gpu_model_runner.py:3254] Encoder cache will be initialized with a budget of 16384 tokens, and profiled with 1 image items of the maximum feature size.
it won't get pass Encoder cache will be initialized with a budget of 16384 tokens, 5 GPUs setup..