Phi-4-multimodal-instruct W4A16 GPTQ
GPTQ W4A16 quantization of microsoft/Phi-4-multimodal-instruct β a 5.6 B parameter multimodal model by Microsoft supporting text, vision (images), and audio inputs.
Quantized with llm-compressor on RTX 5090. Weights stored in compressed-tensors format β natively loaded by vLLM.
License: MIT β Β© Microsoft Corporation. This quantization carries the same MIT license as the original model.
Why this quantization?
| bf16 safetensors | GGUF (Q4_K_M + mmproj) | This model (W4A16 GPTQ) | |
|---|---|---|---|
| Size | ~14 GB | 2.37 GB + 825 MB | ~5β6 GB |
| Text | β | β | β |
| Vision (images) | β | β | β |
| Audio / Speech | β | β | β |
| Serves with | vLLM | llama.cpp / LM Studio | vLLM |
| Quantization method | none | GGML int4 | GPTQ int4 (W4A16) |
The GGUF files in Swicked86/phi4-mm-gguf are smaller but lack audio. This model is the sweet spot: all three modalities at roughly β the size of bf16.
Available Files
| File | Size | Notes |
|---|---|---|
model-00001-of-00002.safetensors |
~3 GB | Quantized weight shard 1 |
model-00002-of-00002.safetensors |
~2 GB | Quantized weight shard 2 |
config.json |
β | Includes quantization_config β vLLM auto-detects |
tokenizer.model / tokenizer.json |
β | Tokenizer |
preprocessor_config.json |
β | Vision + audio processor config (bf16 encoders) |
The SigLIP-400M vision encoder and conformer-based speech encoder are stored at full bfloat16 precision β only the Phi3 text transformer weights (32 decoder layers) are quantized to int4.
VRAM Requirements
Model weights occupy 6.11 GiB (measured). The remaining VRAM is used by the KV cache β vLLM pre-allocates the full KV cache pool at startup based on --gpu-memory-utilization and --max-model-len. Total VRAM allocated = weights + pre-allocated KV cache, regardless of how many requests are active.
By GPU tier
| GPU | VRAM | Recommended --max-model-len |
--gpu-memory-utilization |
Notes |
|---|---|---|---|---|
| RTX 3070 / 2080 Super | 8 GB | β | β | β οΈ Not recommended. Weights alone are 6.1 GB; insufficient headroom for KV cache. |
| RTX 3080 10 GB / 2080 Ti | 10 GB | 16,384 | 0.85 | Minimum viable. Tight β use lowest context only. |
| RTX 3080 12 GB / 4070 | 12 GB | 16,384β32,768 | 0.85 | Comfortable at 16K; 32K fits with care. |
| RTX 3080 Ti / 4070 Ti / 4080 | 16 GB | 32,768β65,520 | 0.85β0.90 | Good balance of context and headroom. |
| RTX 3090 / 4090 / 4080 Super | 24 GB | 65,520 | 0.85β0.90 | Recommended. Full tested context, comfortable. |
| RTX 5090 / A6000 / A100 40 GB | 32+ GB | 65,520β131,072 | 0.45β0.90 | Plenty of headroom; lower utilization keeps VRAM free for other tasks. |
By context length
--max-model-len |
Weights | KV cache (est.) | Total (est.) | Min GPU VRAM |
|---|---|---|---|---|
| 16,384 | ~6.1 GB | ~1.5 GB | ~8 GB | 10 GB |
| 32,768 | ~6.1 GB | ~3.0 GB | ~10 GB | 12 GB |
| 65,520 | ~6.1 GB | ~6.0 GB | ~13 GB | 16 GB |
| 131,072 (max, untested) | ~6.1 GB | ~12.0 GB | ~19 GB | 24 GB |
KV cache estimates use Phi-4-Mini architecture (32 layers, 8 KV heads, head_dim 96, bf16 activations β 96 KB/token). Add ~1β2 GB for framework overhead. Weights measured on RTX 5090 with vLLM.
Why does vLLM show higher usage than "total est." above? vLLM pre-allocates the entire KV cache pool at startup. On a large GPU (e.g. 32 GB at
--gpu-memory-utilization 0.45), it reserves 0.45 Γ 32 GB = ~14 GB for KV cache even if no requests are active. The table above shows the minimum needed, not what vLLM will allocate when given more headroom.
Usage
Step 1 β Install vLLM
Requirements: Python 3.10+, CUDA GPU with β₯ 8 GB VRAM, vLLM 0.9.0+
pip install vllm
Do not install
auto-gptqor pass--quantization gptq. This model usescompressed-tensorsformat, which vLLM handles automatically fromconfig.json.
Step 2 β Download the model
huggingface-cli download Swicked86/phi4-mm-gptq --local-dir ./phi4-mm-gptq
Or let vLLM download on first run by passing the repo ID directly (see Step 3).
Step 3 β Launch the vLLM server
Minimum working command (text + vision + audio, 8β10 GB GPU):
python -m vllm.entrypoints.openai.api_server \
--model ./phi4-mm-gptq \
--dtype bfloat16 \
--trust-remote-code \
--max-model-len 16384 \
--gpu-memory-utilization 0.85 \
--enable-lora \
--max-lora-rank 320 \
--lora-modules speech=./phi4-mm-gptq/speech-lora \
vision=./phi4-mm-gptq/vision-lora \
--limit-mm-per-prompt '{"image": 3, "audio": 3}' \
--port 8080 \
--host 127.0.0.1 \
--served-model-name phi4-mm
Extended context command (16 GB+ GPU, 65K context):
python -m vllm.entrypoints.openai.api_server \
--model ./phi4-mm-gptq \
--dtype bfloat16 \
--trust-remote-code \
--max-model-len 65520 \
--gpu-memory-utilization 0.90 \
--enable-lora \
--max-lora-rank 320 \
--lora-modules speech=./phi4-mm-gptq/speech-lora \
vision=./phi4-mm-gptq/vision-lora \
--limit-mm-per-prompt '{"image": 3, "audio": 3}' \
--enable-auto-tool-choice \
--tool-call-parser phi4_mini_json \
--port 8080 \
--host 127.0.0.1 \
--served-model-name phi4-mm
Flag reference:
| Flag | Value | Why |
|---|---|---|
--model |
path or Swicked86/phi4-mm-gptq |
Local dir or HF repo ID |
--dtype bfloat16 |
required | Model native dtype β do not use float16 |
--trust-remote-code |
required | phi4-mm uses custom modeling code |
--max-model-len |
16384β131072 | See VRAM table above. 65,520 = 16Γ4095 (vLLM block-aligned). Beyond 65,520 is untested with this quantization β proceed at your own risk |
--gpu-memory-utilization |
0.45β0.90 | Fraction of GPU VRAM to reserve for weights + KV cache |
--enable-lora |
required for vision/audio | Activates the rank-320 LoRA adapters |
--max-lora-rank 320 |
required | phi4-mm LoRAs are rank 320 (unusually large) |
--lora-modules |
speech=... vision=... |
Points to the adapter subdirs β enables those modalities |
--limit-mm-per-prompt |
{"image":3,"audio":3} |
Max attachments per message |
--tool-call-parser phi4_mini_json |
optional | phi4-mm emits functools[...] format β this parses it |
--served-model-name phi4-mm |
optional | Alias so clients use "model": "phi4-mm" |
No
--quantizationflag needed. vLLM readsquantization_configfromconfig.jsonand activates thecompressed-tensorsint4 kernels automatically.
Wait for "Application startup complete":
curl http://localhost:8080/health # β {"status":"ok"}
Text (Python β openai SDK)
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8080/v1", api_key="local")
response = client.chat.completions.create(
model="phi4-mm",
messages=[{"role": "user", "content": "What is the capital of France?"}],
max_tokens=256,
)
print(response.choices[0].message.content)
Vision β image understanding (Python)
import base64
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8080/v1", api_key="local")
with open("photo.jpg", "rb") as f:
image_b64 = base64.b64encode(f.read()).decode()
response = client.chat.completions.create(
model="phi4-mm",
messages=[{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{image_b64}"}},
{"type": "text", "text": "Describe this image in detail."},
],
}],
max_tokens=512,
)
print(response.choices[0].message.content)
curl:
IMAGE_B64=$(base64 -w0 photo.jpg)
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d "{
\"model\": \"phi4-mm\",
\"messages\": [{
\"role\": \"user\",
\"content\": [
{\"type\": \"image_url\", \"image_url\": {\"url\": \"data:image/jpeg;base64,${IMAGE_B64}\"}},
{\"type\": \"text\", \"text\": \"What is in this image?\"}
]
}],
\"max_tokens\": 300
}"
Audio β speech transcription / understanding (Python)
import base64
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8080/v1", api_key="local")
with open("audio.wav", "rb") as f:
audio_b64 = base64.b64encode(f.read()).decode()
response = client.chat.completions.create(
model="phi4-mm",
messages=[{
"role": "user",
"content": [
{"type": "input_audio", "input_audio": {"data": audio_b64, "format": "wav"}},
{"type": "text", "text": "Transcribe this audio."},
],
}],
max_tokens=512,
)
print(response.choices[0].message.content)
phi4-mm uses a custom conformer-based audio encoder (24 conformer blocks) with a rank-320 speech LoRA applied to the language decoder β no separate ASR model needed. Supported formats:
wav,mp3,ogg,flac.
curl:
AUDIO_B64=$(base64 -w0 audio.wav)
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d "{
\"model\": \"phi4-mm\",
\"messages\": [{
\"role\": \"user\",
\"content\": [
{\"type\": \"input_audio\", \"input_audio\": {\"data\": \"${AUDIO_B64}\", \"format\": \"wav\"}},
{\"type\": \"text\", \"text\": \"Transcribe and summarise.\"}
]
}],
\"max_tokens\": 512
}"
Combined β image + audio in one prompt
import base64
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8080/v1", api_key="local")
with open("photo.jpg", "rb") as f:
image_b64 = base64.b64encode(f.read()).decode()
with open("question.wav", "rb") as f:
audio_b64 = base64.b64encode(f.read()).decode()
response = client.chat.completions.create(
model="phi4-mm",
messages=[{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{image_b64}"}},
{"type": "input_audio", "input_audio": {"data": audio_b64, "format": "wav"}},
{"type": "text", "text": "Answer the spoken question about the image."},
],
}],
max_tokens=512,
)
print(response.choices[0].message.content)
Tool calling
phi4-mm emits tool calls as functools[{"name":"...","arguments":{...}}].
The --tool-call-parser phi4_mini_json flag (vLLM 0.7+) handles this automatically.
For a complete chat template that injects tools into phi4-mm's native <|tool|>...<|/tool|> block,
see deploy/wsl-vllm/phi4-mm-tool-template.jinja
in the companion repo.
Load locally with Transformers
from transformers import AutoModelForCausalLM, AutoProcessor
import torch
model_id = "Swicked86/phi4-mm-gptq"
model = AutoModelForCausalLM.from_pretrained(
model_id,
trust_remote_code=True,
torch_dtype=torch.bfloat16,
device_map="cuda",
)
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
Requires
pip install llmcompressor(orpip install compressed-tensors) to load thequantization_configfrom the checkpoint.
Quality
Inference Tests
All tests run via vLLM on RTX 5090, --gpu-memory-utilization 0.45.
Text β factual recall (< 0.5s):
Prompt:
"What is the capital of France?"Response:"The capital of France is Paris."β
Text β math reasoning (0.748s):
Prompt:
"Solve step by step: If a train travels 120 miles in 2 hours, what is its speed in km/h?"Response: step-by-step solution β96.54 km/hβ
Text β code generation (1.068s):
Prompt:
"Write a Python function that checks if a string is a palindrome."Response: correctis_palindrome()with docstring + example calls β
Vision β real image (3000Γ4000 JPEG):
Prompt:
"Describe what you see in this image in detail."Response: correctly identified anime figure on a TV screen, described the room, entertainment setup, and animation style β
Audio β real voice message (Discord OGG Opus, converted to 16kHz WAV):
Input: Discord voice message (~11s) discussing software development Response: "The speaker is describing the process of transforming the rag function into a function that uses a local database rather than writing to and from files." β
Audio note: Discord voice messages are OGG Opus at 48 kHz. Convert to 16 kHz mono WAV before sending for best results. Pass
"format": "wav"in the request.
vLLM LoRA note: vLLM currently only applies LoRA to the language model layers. Vision encoder LoRA layers (SigLIP) are silently skipped β this is a vLLM limitation. The speech LoRA (language decoder, rank-320) loaded and applied correctly.
Perplexity (wikitext-2-raw, context 512)
| Model | PPL | vs bf16 |
|---|---|---|
| bf16 (baseline) | 14.9338 Β± 0.107 | β |
| W4A16 GPTQ (this model) | pending | pending |
Benchmark will be added after upload.
Quantization Details
| Item | Value |
|---|---|
| Quantizer | llm-compressor (GPTQModifier) |
| Scheme | W4A16 (int4 weights, bfloat16 activations) |
| Group size | 128 |
| Sequential targets | Phi3DecoderLayer (32 Γ Phi3 text transformer blocks) |
| Excluded (kept bf16) | lm_head, model.embed_tokens_extend.* |
| β³ covers SigLIP-400M vision encoder + conformer-based audio encoder | |
| Calibration | 512 samples, wikitext-2 |
| Source model | ~/phi4-mm-hf (safetensors, downloaded from HF) |
| Hardware | RTX 5090 32 GB, CUDA 12.0, WSL2 Ubuntu 24.04 |
| Script | scripts/quantize_phi4mm.py |
Architecture
| Property | Value |
|---|---|
| Base model | Phi-4-Mini (3.8 B LLM backbone) |
| Total parameters | ~5.6 B |
| Context length | 128 K tokens (131,072) |
| Modalities | Text, Vision (SigLIP-400M), Audio/Speech (Conformer + rank-320 LoRA) |
| Text decoder | 32 Γ Phi3DecoderLayer β quantized to int4 |
| Vision encoder | SigLIP2 (embed_tokens_extend.image_embed) β bf16 |
| Audio encoder | Conformer-based audio encoder (24-block, 460M) + speech LoRA rank-320 β bf16 |
Related
- Swicked86/phi4-mm-gguf β GGUF quantizations (text + vision, no audio)
- microsoft/Phi-4-multimodal-instruct β original bf16 model
- llm-compressor β quantization tool
- vLLM β inference engine
- Downloads last month
- 23
Model tree for Swicked86/phi4-mm-gptq
Base model
microsoft/Phi-4-multimodal-instruct