Phi-4-multimodal-instruct W4A16 GPTQ

GPTQ W4A16 quantization of microsoft/Phi-4-multimodal-instruct β€” a 5.6 B parameter multimodal model by Microsoft supporting text, vision (images), and audio inputs.

Quantized with llm-compressor on RTX 5090. Weights stored in compressed-tensors format β€” natively loaded by vLLM.

License: MIT β€” Β© Microsoft Corporation. This quantization carries the same MIT license as the original model.


Why this quantization?

bf16 safetensors GGUF (Q4_K_M + mmproj) This model (W4A16 GPTQ)
Size ~14 GB 2.37 GB + 825 MB ~5–6 GB
Text βœ… βœ… βœ…
Vision (images) βœ… βœ… βœ…
Audio / Speech βœ… ❌ βœ…
Serves with vLLM llama.cpp / LM Studio vLLM
Quantization method none GGML int4 GPTQ int4 (W4A16)

The GGUF files in Swicked86/phi4-mm-gguf are smaller but lack audio. This model is the sweet spot: all three modalities at roughly β…“ the size of bf16.


Available Files

File Size Notes
model-00001-of-00002.safetensors ~3 GB Quantized weight shard 1
model-00002-of-00002.safetensors ~2 GB Quantized weight shard 2
config.json β€” Includes quantization_config β€” vLLM auto-detects
tokenizer.model / tokenizer.json β€” Tokenizer
preprocessor_config.json β€” Vision + audio processor config (bf16 encoders)

The SigLIP-400M vision encoder and conformer-based speech encoder are stored at full bfloat16 precision β€” only the Phi3 text transformer weights (32 decoder layers) are quantized to int4.


VRAM Requirements

Model weights occupy 6.11 GiB (measured). The remaining VRAM is used by the KV cache β€” vLLM pre-allocates the full KV cache pool at startup based on --gpu-memory-utilization and --max-model-len. Total VRAM allocated = weights + pre-allocated KV cache, regardless of how many requests are active.

By GPU tier

GPU VRAM Recommended --max-model-len --gpu-memory-utilization Notes
RTX 3070 / 2080 Super 8 GB β€” β€” ⚠️ Not recommended. Weights alone are 6.1 GB; insufficient headroom for KV cache.
RTX 3080 10 GB / 2080 Ti 10 GB 16,384 0.85 Minimum viable. Tight β€” use lowest context only.
RTX 3080 12 GB / 4070 12 GB 16,384–32,768 0.85 Comfortable at 16K; 32K fits with care.
RTX 3080 Ti / 4070 Ti / 4080 16 GB 32,768–65,520 0.85–0.90 Good balance of context and headroom.
RTX 3090 / 4090 / 4080 Super 24 GB 65,520 0.85–0.90 Recommended. Full tested context, comfortable.
RTX 5090 / A6000 / A100 40 GB 32+ GB 65,520–131,072 0.45–0.90 Plenty of headroom; lower utilization keeps VRAM free for other tasks.

By context length

--max-model-len Weights KV cache (est.) Total (est.) Min GPU VRAM
16,384 ~6.1 GB ~1.5 GB ~8 GB 10 GB
32,768 ~6.1 GB ~3.0 GB ~10 GB 12 GB
65,520 ~6.1 GB ~6.0 GB ~13 GB 16 GB
131,072 (max, untested) ~6.1 GB ~12.0 GB ~19 GB 24 GB

KV cache estimates use Phi-4-Mini architecture (32 layers, 8 KV heads, head_dim 96, bf16 activations β‰ˆ 96 KB/token). Add ~1–2 GB for framework overhead. Weights measured on RTX 5090 with vLLM.

Why does vLLM show higher usage than "total est." above? vLLM pre-allocates the entire KV cache pool at startup. On a large GPU (e.g. 32 GB at --gpu-memory-utilization 0.45), it reserves 0.45 Γ— 32 GB = ~14 GB for KV cache even if no requests are active. The table above shows the minimum needed, not what vLLM will allocate when given more headroom.


Usage

Step 1 β€” Install vLLM

Requirements: Python 3.10+, CUDA GPU with β‰₯ 8 GB VRAM, vLLM 0.9.0+

pip install vllm

Do not install auto-gptq or pass --quantization gptq. This model uses compressed-tensors format, which vLLM handles automatically from config.json.


Step 2 β€” Download the model

huggingface-cli download Swicked86/phi4-mm-gptq --local-dir ./phi4-mm-gptq

Or let vLLM download on first run by passing the repo ID directly (see Step 3).


Step 3 β€” Launch the vLLM server

Minimum working command (text + vision + audio, 8–10 GB GPU):

python -m vllm.entrypoints.openai.api_server \
  --model ./phi4-mm-gptq \
  --dtype bfloat16 \
  --trust-remote-code \
  --max-model-len 16384 \
  --gpu-memory-utilization 0.85 \
  --enable-lora \
  --max-lora-rank 320 \
  --lora-modules speech=./phi4-mm-gptq/speech-lora \
                 vision=./phi4-mm-gptq/vision-lora \
  --limit-mm-per-prompt '{"image": 3, "audio": 3}' \
  --port 8080 \
  --host 127.0.0.1 \
  --served-model-name phi4-mm

Extended context command (16 GB+ GPU, 65K context):

python -m vllm.entrypoints.openai.api_server \
  --model ./phi4-mm-gptq \
  --dtype bfloat16 \
  --trust-remote-code \
  --max-model-len 65520 \
  --gpu-memory-utilization 0.90 \
  --enable-lora \
  --max-lora-rank 320 \
  --lora-modules speech=./phi4-mm-gptq/speech-lora \
                 vision=./phi4-mm-gptq/vision-lora \
  --limit-mm-per-prompt '{"image": 3, "audio": 3}' \
  --enable-auto-tool-choice \
  --tool-call-parser phi4_mini_json \
  --port 8080 \
  --host 127.0.0.1 \
  --served-model-name phi4-mm

Flag reference:

Flag Value Why
--model path or Swicked86/phi4-mm-gptq Local dir or HF repo ID
--dtype bfloat16 required Model native dtype β€” do not use float16
--trust-remote-code required phi4-mm uses custom modeling code
--max-model-len 16384–131072 See VRAM table above. 65,520 = 16Γ—4095 (vLLM block-aligned). Beyond 65,520 is untested with this quantization β€” proceed at your own risk
--gpu-memory-utilization 0.45–0.90 Fraction of GPU VRAM to reserve for weights + KV cache
--enable-lora required for vision/audio Activates the rank-320 LoRA adapters
--max-lora-rank 320 required phi4-mm LoRAs are rank 320 (unusually large)
--lora-modules speech=... vision=... Points to the adapter subdirs β€” enables those modalities
--limit-mm-per-prompt {"image":3,"audio":3} Max attachments per message
--tool-call-parser phi4_mini_json optional phi4-mm emits functools[...] format β€” this parses it
--served-model-name phi4-mm optional Alias so clients use "model": "phi4-mm"

No --quantization flag needed. vLLM reads quantization_config from config.json and activates the compressed-tensors int4 kernels automatically.

Wait for "Application startup complete":

curl http://localhost:8080/health   # β†’ {"status":"ok"}

Text (Python β€” openai SDK)

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8080/v1", api_key="local")

response = client.chat.completions.create(
    model="phi4-mm",
    messages=[{"role": "user", "content": "What is the capital of France?"}],
    max_tokens=256,
)
print(response.choices[0].message.content)

Vision β€” image understanding (Python)

import base64
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8080/v1", api_key="local")

with open("photo.jpg", "rb") as f:
    image_b64 = base64.b64encode(f.read()).decode()

response = client.chat.completions.create(
    model="phi4-mm",
    messages=[{
        "role": "user",
        "content": [
            {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{image_b64}"}},
            {"type": "text", "text": "Describe this image in detail."},
        ],
    }],
    max_tokens=512,
)
print(response.choices[0].message.content)

curl:

IMAGE_B64=$(base64 -w0 photo.jpg)
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d "{
    \"model\": \"phi4-mm\",
    \"messages\": [{
      \"role\": \"user\",
      \"content\": [
        {\"type\": \"image_url\", \"image_url\": {\"url\": \"data:image/jpeg;base64,${IMAGE_B64}\"}},
        {\"type\": \"text\", \"text\": \"What is in this image?\"}
      ]
    }],
    \"max_tokens\": 300
  }"

Audio β€” speech transcription / understanding (Python)

import base64
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8080/v1", api_key="local")

with open("audio.wav", "rb") as f:
    audio_b64 = base64.b64encode(f.read()).decode()

response = client.chat.completions.create(
    model="phi4-mm",
    messages=[{
        "role": "user",
        "content": [
            {"type": "input_audio", "input_audio": {"data": audio_b64, "format": "wav"}},
            {"type": "text", "text": "Transcribe this audio."},
        ],
    }],
    max_tokens=512,
)
print(response.choices[0].message.content)

phi4-mm uses a custom conformer-based audio encoder (24 conformer blocks) with a rank-320 speech LoRA applied to the language decoder β€” no separate ASR model needed. Supported formats: wav, mp3, ogg, flac.

curl:

AUDIO_B64=$(base64 -w0 audio.wav)
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d "{
    \"model\": \"phi4-mm\",
    \"messages\": [{
      \"role\": \"user\",
      \"content\": [
        {\"type\": \"input_audio\", \"input_audio\": {\"data\": \"${AUDIO_B64}\", \"format\": \"wav\"}},
        {\"type\": \"text\", \"text\": \"Transcribe and summarise.\"}
      ]
    }],
    \"max_tokens\": 512
  }"

Combined β€” image + audio in one prompt

import base64
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8080/v1", api_key="local")

with open("photo.jpg", "rb") as f:
    image_b64 = base64.b64encode(f.read()).decode()
with open("question.wav", "rb") as f:
    audio_b64 = base64.b64encode(f.read()).decode()

response = client.chat.completions.create(
    model="phi4-mm",
    messages=[{
        "role": "user",
        "content": [
            {"type": "image_url",  "image_url":  {"url": f"data:image/jpeg;base64,{image_b64}"}},
            {"type": "input_audio", "input_audio": {"data": audio_b64, "format": "wav"}},
            {"type": "text", "text": "Answer the spoken question about the image."},
        ],
    }],
    max_tokens=512,
)
print(response.choices[0].message.content)

Tool calling

phi4-mm emits tool calls as functools[{"name":"...","arguments":{...}}]. The --tool-call-parser phi4_mini_json flag (vLLM 0.7+) handles this automatically. For a complete chat template that injects tools into phi4-mm's native <|tool|>...<|/tool|> block, see deploy/wsl-vllm/phi4-mm-tool-template.jinja in the companion repo.


Load locally with Transformers

from transformers import AutoModelForCausalLM, AutoProcessor
import torch

model_id = "Swicked86/phi4-mm-gptq"

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
    device_map="cuda",
)
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)

Requires pip install llmcompressor (or pip install compressed-tensors) to load the quantization_config from the checkpoint.


Quality

Inference Tests

All tests run via vLLM on RTX 5090, --gpu-memory-utilization 0.45.

Text β€” factual recall (< 0.5s):

Prompt: "What is the capital of France?" Response: "The capital of France is Paris." βœ…

Text β€” math reasoning (0.748s):

Prompt: "Solve step by step: If a train travels 120 miles in 2 hours, what is its speed in km/h?" Response: step-by-step solution β†’ 96.54 km/h βœ…

Text β€” code generation (1.068s):

Prompt: "Write a Python function that checks if a string is a palindrome." Response: correct is_palindrome() with docstring + example calls βœ…

Vision β€” real image (3000Γ—4000 JPEG):

Prompt: "Describe what you see in this image in detail." Response: correctly identified anime figure on a TV screen, described the room, entertainment setup, and animation style βœ…

Audio β€” real voice message (Discord OGG Opus, converted to 16kHz WAV):

Input: Discord voice message (~11s) discussing software development Response: "The speaker is describing the process of transforming the rag function into a function that uses a local database rather than writing to and from files." βœ…

Audio note: Discord voice messages are OGG Opus at 48 kHz. Convert to 16 kHz mono WAV before sending for best results. Pass "format": "wav" in the request.

vLLM LoRA note: vLLM currently only applies LoRA to the language model layers. Vision encoder LoRA layers (SigLIP) are silently skipped β€” this is a vLLM limitation. The speech LoRA (language decoder, rank-320) loaded and applied correctly.


Perplexity (wikitext-2-raw, context 512)

Model PPL vs bf16
bf16 (baseline) 14.9338 Β± 0.107 β€”
W4A16 GPTQ (this model) pending pending

Benchmark will be added after upload.


Quantization Details

Item Value
Quantizer llm-compressor (GPTQModifier)
Scheme W4A16 (int4 weights, bfloat16 activations)
Group size 128
Sequential targets Phi3DecoderLayer (32 Γ— Phi3 text transformer blocks)
Excluded (kept bf16) lm_head, model.embed_tokens_extend.*
↳ covers SigLIP-400M vision encoder + conformer-based audio encoder
Calibration 512 samples, wikitext-2
Source model ~/phi4-mm-hf (safetensors, downloaded from HF)
Hardware RTX 5090 32 GB, CUDA 12.0, WSL2 Ubuntu 24.04
Script scripts/quantize_phi4mm.py

Architecture

Property Value
Base model Phi-4-Mini (3.8 B LLM backbone)
Total parameters ~5.6 B
Context length 128 K tokens (131,072)
Modalities Text, Vision (SigLIP-400M), Audio/Speech (Conformer + rank-320 LoRA)
Text decoder 32 Γ— Phi3DecoderLayer β€” quantized to int4
Vision encoder SigLIP2 (embed_tokens_extend.image_embed) β€” bf16
Audio encoder Conformer-based audio encoder (24-block, 460M) + speech LoRA rank-320 β€” bf16

Related

Downloads last month
23
Safetensors
Model size
3B params
Tensor type
I64
Β·
F32
Β·
I32
Β·
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Swicked86/phi4-mm-gptq

Quantized
(8)
this model