Phi-4-multimodal-instruct W4A16 GPTQ

GPTQ W4A16 quantization of microsoft/Phi-4-multimodal-instruct — a 5.6 B parameter multimodal model by Microsoft supporting text, vision (images), and audio inputs.

Quantized with llm-compressor on RTX 5090. Weights stored in compressed-tensors format — natively loaded by vLLM.

License: MIT — © Microsoft Corporation. This quantization carries the same MIT license as the original model.

Why this quantization?

	bf16 safetensors	GGUF (Q4_K_M + mmproj)	This model (W4A16 GPTQ)
Size	~14 GB	2.37 GB + 825 MB	~5–6 GB
Text	✅	✅	✅
Vision (images)	✅	✅	✅
Audio / Speech	✅	❌	✅
Serves with	vLLM	llama.cpp / LM Studio	vLLM
Quantization method	none	GGML int4	GPTQ int4 (W4A16)

The GGUF files in Swicked86/phi4-mm-gguf are smaller but lack audio. This model is the sweet spot: all three modalities at roughly ⅓ the size of bf16.

Available Files

File	Size	Notes
`model-00001-of-00002.safetensors`	~3 GB	Quantized weight shard 1
`model-00002-of-00002.safetensors`	~2 GB	Quantized weight shard 2
`config.json`	—	Includes `quantization_config` — vLLM auto-detects
`tokenizer.model` / `tokenizer.json`	—	Tokenizer
`preprocessor_config.json`	—	Vision + audio processor config (bf16 encoders)

The SigLIP-400M vision encoder and conformer-based speech encoder are stored at full bfloat16 precision — only the Phi3 text transformer weights (32 decoder layers) are quantized to int4.

VRAM Requirements

Model weights occupy 6.11 GiB (measured). The remaining VRAM is used by the KV cache — vLLM pre-allocates the full KV cache pool at startup based on --gpu-memory-utilization and --max-model-len. Total VRAM allocated = weights + pre-allocated KV cache, regardless of how many requests are active.

By GPU tier

GPU	VRAM	Recommended `--max-model-len`	`--gpu-memory-utilization`	Notes
RTX 3070 / 2080 Super	8 GB	—	—	⚠️ Not recommended. Weights alone are 6.1 GB; insufficient headroom for KV cache.
RTX 3080 10 GB / 2080 Ti	10 GB	16,384	0.85	Minimum viable. Tight — use lowest context only.
RTX 3080 12 GB / 4070	12 GB	16,384–32,768	0.85	Comfortable at 16K; 32K fits with care.
RTX 3080 Ti / 4070 Ti / 4080	16 GB	32,768–65,520	0.85–0.90	Good balance of context and headroom.
RTX 3090 / 4090 / 4080 Super	24 GB	65,520	0.85–0.90	Recommended. Full tested context, comfortable.
RTX 5090 / A6000 / A100 40 GB	32+ GB	65,520–131,072	0.45–0.90	Plenty of headroom; lower utilization keeps VRAM free for other tasks.

By context length

`--max-model-len`	Weights	KV cache (est.)	Total (est.)	Min GPU VRAM
16,384	~6.1 GB	~1.5 GB	~8 GB	10 GB
32,768	~6.1 GB	~3.0 GB	~10 GB	12 GB
65,520	~6.1 GB	~6.0 GB	~13 GB	16 GB
131,072 (max, untested)	~6.1 GB	~12.0 GB	~19 GB	24 GB

KV cache estimates use Phi-4-Mini architecture (32 layers, 8 KV heads, head_dim 96, bf16 activations ≈ 96 KB/token). Add ~1–2 GB for framework overhead. Weights measured on RTX 5090 with vLLM.

Why does vLLM show higher usage than "total est." above? vLLM pre-allocates the entire KV cache pool at startup. On a large GPU (e.g. 32 GB at --gpu-memory-utilization 0.45), it reserves 0.45 × 32 GB = ~14 GB for KV cache even if no requests are active. The table above shows the minimum needed, not what vLLM will allocate when given more headroom.

Usage

Step 1 — Install vLLM

Requirements: Python 3.10+, CUDA GPU with ≥ 8 GB VRAM, vLLM 0.9.0+

pip install vllm

Do not install auto-gptq or pass --quantization gptq. This model uses compressed-tensors format, which vLLM handles automatically from config.json.

Step 2 — Download the model

huggingface-cli download Swicked86/phi4-mm-gptq --local-dir ./phi4-mm-gptq

Or let vLLM download on first run by passing the repo ID directly (see Step 3).

Step 3 — Launch the vLLM server

Minimum working command (text + vision + audio, 8–10 GB GPU):

python -m vllm.entrypoints.openai.api_server \
  --model ./phi4-mm-gptq \
  --dtype bfloat16 \
  --trust-remote-code \
  --max-model-len 16384 \
  --gpu-memory-utilization 0.85 \
  --enable-lora \
  --max-lora-rank 320 \
  --lora-modules speech=./phi4-mm-gptq/speech-lora \
                 vision=./phi4-mm-gptq/vision-lora \
  --limit-mm-per-prompt '{"image": 3, "audio": 3}' \
  --port 8080 \
  --host 127.0.0.1 \
  --served-model-name phi4-mm

Extended context command (16 GB+ GPU, 65K context):

python -m vllm.entrypoints.openai.api_server \
  --model ./phi4-mm-gptq \
  --dtype bfloat16 \
  --trust-remote-code \
  --max-model-len 65520 \
  --gpu-memory-utilization 0.90 \
  --enable-lora \
  --max-lora-rank 320 \
  --lora-modules speech=./phi4-mm-gptq/speech-lora \
                 vision=./phi4-mm-gptq/vision-lora \
  --limit-mm-per-prompt '{"image": 3, "audio": 3}' \
  --enable-auto-tool-choice \
  --tool-call-parser phi4_mini_json \
  --port 8080 \
  --host 127.0.0.1 \
  --served-model-name phi4-mm

Flag reference:

Flag	Value	Why
`--model`	path or `Swicked86/phi4-mm-gptq`	Local dir or HF repo ID
`--dtype bfloat16`	required	Model native dtype — do not use float16
`--trust-remote-code`	required	phi4-mm uses custom modeling code
`--max-model-len`	16384–131072	See VRAM table above. 65,520 = 16×4095 (vLLM block-aligned). Beyond 65,520 is untested with this quantization — proceed at your own risk
`--gpu-memory-utilization`	0.45–0.90	Fraction of GPU VRAM to reserve for weights + KV cache
`--enable-lora`	required for vision/audio	Activates the rank-320 LoRA adapters
`--max-lora-rank 320`	required	phi4-mm LoRAs are rank 320 (unusually large)
`--lora-modules`	`speech=...` `vision=...`	Points to the adapter subdirs — enables those modalities
`--limit-mm-per-prompt`	`{"image":3,"audio":3}`	Max attachments per message
`--tool-call-parser phi4_mini_json`	optional	phi4-mm emits `functools[...]` format — this parses it
`--served-model-name phi4-mm`	optional	Alias so clients use `"model": "phi4-mm"`

No --quantization flag needed. vLLM reads quantization_config from config.json and activates the compressed-tensors int4 kernels automatically.

Wait for "Application startup complete":

curl http://localhost:8080/health   # → {"status":"ok"}

Text (Python — openai SDK)

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8080/v1", api_key="local")

response = client.chat.completions.create(
    model="phi4-mm",
    messages=[{"role": "user", "content": "What is the capital of France?"}],
    max_tokens=256,
)
print(response.choices[0].message.content)

Vision — image understanding (Python)

import base64
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8080/v1", api_key="local")

with open("photo.jpg", "rb") as f:
    image_b64 = base64.b64encode(f.read()).decode()

response = client.chat.completions.create(
    model="phi4-mm",
    messages=[{
        "role": "user",
        "content": [
            {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{image_b64}"}},
            {"type": "text", "text": "Describe this image in detail."},
        ],
    }],
    max_tokens=512,
)
print(response.choices[0].message.content)

curl:

IMAGE_B64=$(base64 -w0 photo.jpg)
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d "{
    \"model\": \"phi4-mm\",
    \"messages\": [{
      \"role\": \"user\",
      \"content\": [
        {\"type\": \"image_url\", \"image_url\": {\"url\": \"data:image/jpeg;base64,${IMAGE_B64}\"}},
        {\"type\": \"text\", \"text\": \"What is in this image?\"}
      ]
    }],
    \"max_tokens\": 300
  }"

Audio — speech transcription / understanding (Python)

import base64
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8080/v1", api_key="local")

with open("audio.wav", "rb") as f:
    audio_b64 = base64.b64encode(f.read()).decode()

response = client.chat.completions.create(
    model="phi4-mm",
    messages=[{
        "role": "user",
        "content": [
            {"type": "input_audio", "input_audio": {"data": audio_b64, "format": "wav"}},
            {"type": "text", "text": "Transcribe this audio."},
        ],
    }],
    max_tokens=512,
)
print(response.choices[0].message.content)

phi4-mm uses a custom conformer-based audio encoder (24 conformer blocks) with a rank-320 speech LoRA applied to the language decoder — no separate ASR model needed. Supported formats: wav, mp3, ogg, flac.

curl:

AUDIO_B64=$(base64 -w0 audio.wav)
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d "{
    \"model\": \"phi4-mm\",
    \"messages\": [{
      \"role\": \"user\",
      \"content\": [
        {\"type\": \"input_audio\", \"input_audio\": {\"data\": \"${AUDIO_B64}\", \"format\": \"wav\"}},
        {\"type\": \"text\", \"text\": \"Transcribe and summarise.\"}
      ]
    }],
    \"max_tokens\": 512
  }"

Combined — image + audio in one prompt

import base64
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8080/v1", api_key="local")

with open("photo.jpg", "rb") as f:
    image_b64 = base64.b64encode(f.read()).decode()
with open("question.wav", "rb") as f:
    audio_b64 = base64.b64encode(f.read()).decode()

response = client.chat.completions.create(
    model="phi4-mm",
    messages=[{
        "role": "user",
        "content": [
            {"type": "image_url",  "image_url":  {"url": f"data:image/jpeg;base64,{image_b64}"}},
            {"type": "input_audio", "input_audio": {"data": audio_b64, "format": "wav"}},
            {"type": "text", "text": "Answer the spoken question about the image."},
        ],
    }],
    max_tokens=512,
)
print(response.choices[0].message.content)

Tool calling

phi4-mm emits tool calls as functools[{"name":"...","arguments":{...}}]. The --tool-call-parser phi4_mini_json flag (vLLM 0.7+) handles this automatically. For a complete chat template that injects tools into phi4-mm's native <|tool|>...<|/tool|> block, see deploy/wsl-vllm/phi4-mm-tool-template.jinja in the companion repo.

Load locally with Transformers

from transformers import AutoModelForCausalLM, AutoProcessor
import torch

model_id = "Swicked86/phi4-mm-gptq"

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
    device_map="cuda",
)
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)

Requires pip install llmcompressor (or pip install compressed-tensors) to load the quantization_config from the checkpoint.

Quality

Inference Tests

All tests run via vLLM on RTX 5090, --gpu-memory-utilization 0.45.

Text — factual recall (< 0.5s):

Prompt: "What is the capital of France?" Response: "The capital of France is Paris." ✅

Text — math reasoning (0.748s):

Prompt: "Solve step by step: If a train travels 120 miles in 2 hours, what is its speed in km/h?" Response: step-by-step solution → 96.54 km/h ✅

Text — code generation (1.068s):

Prompt: "Write a Python function that checks if a string is a palindrome." Response: correct is_palindrome() with docstring + example calls ✅

Vision — real image (3000×4000 JPEG):

Prompt: "Describe what you see in this image in detail." Response: correctly identified anime figure on a TV screen, described the room, entertainment setup, and animation style ✅

Audio — real voice message (Discord OGG Opus, converted to 16kHz WAV):

Input: Discord voice message (~11s) discussing software development Response: "The speaker is describing the process of transforming the rag function into a function that uses a local database rather than writing to and from files." ✅

Audio note: Discord voice messages are OGG Opus at 48 kHz. Convert to 16 kHz mono WAV before sending for best results. Pass "format": "wav" in the request.

vLLM LoRA note: vLLM currently only applies LoRA to the language model layers. Vision encoder LoRA layers (SigLIP) are silently skipped — this is a vLLM limitation. The speech LoRA (language decoder, rank-320) loaded and applied correctly.

Perplexity (wikitext-2-raw, context 512)

Model	PPL	vs bf16
bf16 (baseline)	14.9338 ± 0.107	—
W4A16 GPTQ (this model)	pending	pending

Benchmark will be added after upload.

Quantization Details

Item	Value
Quantizer	llm-compressor (`GPTQModifier`)
Scheme	W4A16 (int4 weights, bfloat16 activations)
Group size	128
Sequential targets	`Phi3DecoderLayer` (32 × Phi3 text transformer blocks)
Excluded (kept bf16)	`lm_head`, `model.embed_tokens_extend.*`
	↳ covers SigLIP-400M vision encoder + conformer-based audio encoder
Calibration	512 samples, wikitext-2
Source model	`~/phi4-mm-hf` (safetensors, downloaded from HF)
Hardware	RTX 5090 32 GB, CUDA 12.0, WSL2 Ubuntu 24.04
Script	`scripts/quantize_phi4mm.py`

Architecture

Property	Value
Base model	Phi-4-Mini (3.8 B LLM backbone)
Total parameters	~5.6 B
Context length	128 K tokens (131,072)
Modalities	Text, Vision (SigLIP-400M), Audio/Speech (Conformer + rank-320 LoRA)
Text decoder	32 × Phi3DecoderLayer — quantized to int4
Vision encoder	SigLIP2 (embed_tokens_extend.image_embed) — bf16
Audio encoder	Conformer-based audio encoder (24-block, 460M) + speech LoRA rank-320 — bf16

Swicked86/phi4-mm-gguf — GGUF quantizations (text + vision, no audio)
microsoft/Phi-4-multimodal-instruct — original bf16 model
llm-compressor — quantization tool
vLLM — inference engine

Downloads last month: 23

Safetensors

Model size

3B params

Tensor type

I64

F32

I32

BF16

Model tree for Swicked86/phi4-mm-gptq

Base model

microsoft/Phi-4-multimodal-instruct

Quantized

(8)

this model

Swicked86
/

phi4-mm-gptq