Phi-4-multimodal-instruct GGUF Quantizations
GGUF quantizations of microsoft/Phi-4-multimodal-instruct β a 5.6 B parameter multimodal model by Microsoft supporting text, vision (images), and audio inputs.
Produced with llama.cpp build b8347 on RTX 5090.
License: MIT β Β© Microsoft Corporation. These quantizations carry the same MIT license as the original model.
Available Files
| File | Quant | Size | BPW | Best for |
|---|---|---|---|---|
phi4-mm-f16.gguf |
F16 | 7.17 GB | 16.0 | Re-quantization base, maximum quality |
phi4-mm-Q8_0.gguf |
Q8_0 | 3.90 GB | 8.0 | High-end GPU β near-lossless |
phi4-mm-Q4_K_M.gguf |
Q4_K_M | 2.37 GB | 5.18 | CPU / constrained VRAM β good quality |
mmproj-phi4-mm-f16.gguf |
F16 | 825 MB | 16.0 | Vision encoder β required for image input |
One mmproj for all:
mmproj-phi4-mm-f16.ggufworks with every text GGUF above. It cannot be quantized further β the CLIP FF layer dimension (4304) is not divisible by 32.
VRAM Requirements
Full GPU offload (-ngl 99):
| Configuration | VRAM |
|---|---|
| F16 + mmproj-F16 | ~10,000 MiB |
| Q8_0 + mmproj-F16 | ~5,400 MiB |
| Q4_K_M + mmproj-F16 | ~3,500 MiB |
Quality Metrics
Perplexity (wikitext-2-raw test set, context 512)
| Model | PPL | vs F16 |
|---|---|---|
| F16 (baseline) | 14.9338 Β± 0.107 | β |
| Q8_0 | 14.9107 Β± 0.106 | β0.15% β lossless |
| Q4_K_M | 16.3183 Β± 0.121 | +9.3% |
Throughput (llama-bench, RTX 5090, pp512 / tg128)
| Model | Prompt (t/s) | Generation (t/s) |
|---|---|---|
| Q8_0 | 21,352 | 247 |
| Q4_K_M | 19,904 | 324 |
Multimodal Benchmarks (lmms-eval)
VQA evaluation in progress β will be added when complete. Suite: MMStar, OCRBench, AI2D, MathVista, HallusionBench.
Usage
LM Studio (Recommended for Desktop)
LM Studio has native GGUF support including multimodal vision. No command-line needed.
Text-only:
- Open LM Studio β search
Swicked86/phi4-mm-gguf - Download
phi4-mm-Q8_0.gguf(GPU) orphi4-mm-Q4_K_M.gguf(CPU / low VRAM) - Load the model β Chat
With vision (image input):
- Download both
phi4-mm-Q8_0.ggufandmmproj-phi4-mm-f16.gguf - Load the main model in LM Studio
- In Model Settings β Multimodal β Vision Model (mmproj), browse to
mmproj-phi4-mm-f16.gguf - In Chat, click the image icon to attach a photo and ask questions about it
The mmproj file is the vision encoder. Without it the model runs text-only.
mmproj-phi4-mm-f16.ggufis compatible with all three text GGUFs.
llama.cpp CLI
Step 1 β Download the files
# Install huggingface-cli if needed
pip install huggingface_hub
# Text + vision (recommended)
huggingface-cli download Swicked86/phi4-mm-gguf phi4-mm-Q8_0.gguf mmproj-phi4-mm-f16.gguf --local-dir ./phi4-mm
# CPU / low-VRAM variant
huggingface-cli download Swicked86/phi4-mm-gguf phi4-mm-Q4_K_M.gguf mmproj-phi4-mm-f16.gguf --local-dir ./phi4-mm
Build llama.cpp if you haven't already:
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build -DGGML_CUDA=ON # omit DGGML_CUDA=ON for CPU-only
cmake --build build --config Release -j$(nproc)
Step 2 β Interactive multimodal chat (images + text)
llama-mtmd-cli launches an interactive session. Type your prompt, or prefix it with an image path using /image:
./build/bin/llama-mtmd-cli \
-m ./phi4-mm/phi4-mm-Q8_0.gguf \
--mmproj ./phi4-mm/mmproj-phi4-mm-f16.gguf \
-ngl 99 --threads 16 --ctx-size 8192
Inside the session:
> /image photo.jpg
Image loaded.
> What is in this image?
[model describes the image]
> /image chart.png
Image loaded.
> Summarise the trend shown in this chart.
[model analyses the chart]
> Explain the previous image again but in French.
[responds without re-loading the image]
Single-shot (non-interactive):
./build/bin/llama-mtmd-cli \
-m ./phi4-mm/phi4-mm-Q8_0.gguf \
--mmproj ./phi4-mm/mmproj-phi4-mm-f16.gguf \
--image photo.jpg \
-p "Describe this image in detail." \
-ngl 99 --threads 16 --no-display-prompt
CPU (no GPU):
./build/bin/llama-mtmd-cli \
-m ./phi4-mm/phi4-mm-Q4_K_M.gguf \
--mmproj ./phi4-mm/mmproj-phi4-mm-f16.gguf \
--image photo.jpg \
-p "What objects are in this photo?" \
--threads 8
Text-only (no image)
./build/bin/llama-cli \
-m ./phi4-mm/phi4-mm-Q8_0.gguf \
--ctx-size 65536 --flash-attn on --kv-offload -ngl 99 --threads 16 \
-p "<|system|>You are a helpful assistant.<|end|><|user|>Hello<|end|><|assistant|>"
llama-server β OpenAI-compatible API
Serves both text and vision via /v1/chat/completions. Useful for integrations (Open WebUI, SillyTavern, Continue.dev, etc.):
./build/bin/llama-server \
-m ./phi4-mm/phi4-mm-Q8_0.gguf \
--mmproj ./phi4-mm/mmproj-phi4-mm-f16.gguf \
-ngl 99 --threads 16 \
--ctx-size 8192 --parallel 4 \
--port 8080 --host 127.0.0.1
Text query (curl):
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "phi4-mm",
"messages": [{"role": "user", "content": "Explain the Pythagorean theorem."}],
"max_tokens": 300
}'
Image query (curl, base64):
IMAGE_B64=$(base64 -w0 photo.jpg)
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d "{
\"model\": \"phi4-mm\",
\"messages\": [{
\"role\": \"user\",
\"content\": [
{\"type\": \"image_url\", \"image_url\": {\"url\": \"data:image/jpeg;base64,${IMAGE_B64}\"}},
{\"type\": \"text\", \"text\": \"What is in this image?\"}
]
}],
\"max_tokens\": 300
}"
Image query (Python, openai SDK):
import base64, httpx
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8080/v1", api_key="local")
with open("photo.jpg", "rb") as f:
image_b64 = base64.b64encode(f.read()).decode()
response = client.chat.completions.create(
model="phi4-mm",
messages=[{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{image_b64}"}},
{"type": "text", "text": "Describe this image."},
],
}],
max_tokens=300,
)
print(response.choices[0].message.content)
Modality support summary for GGUF (llama.cpp / LM Studio)
| Modality | Supported | Notes |
|---|---|---|
| Text | β | All three GGUFs |
| Vision (images) | β | Requires mmproj-phi4-mm-f16.gguf + --mmproj flag |
| Audio / Speech | β | Not available β see below |
Audio is not supported in the GGUF files. phi4-mm's speech capability uses a custom conformer-based audio encoder (24 conformer blocks, initialized from a proprietary AED ASR model) plus a rank-320 speech LoRA applied to the language decoder. The GGUF conversion pipeline (
convert_hf_to_gguf.py) only exports the text transformer and the SigLIP vision encoder (mmproj) β the audio encoder tensors are not extracted. There is currently noaudioprojequivalent in llama.cpp for phi4-mm.To use audio/speech transcription, use the vLLM path below with the original bf16 safetensors model.
vLLM β Full Multimodal (Image + Audio, bf16 Safetensors)
Use vLLM when you want maximum quality and full multimodal support (images + audio) from the original bf16 safetensors weights. For constrained hardware, use the GGUF options above instead.
Requirements: Python 3.10+, CUDA GPU with ~16 GB VRAM, vLLM 0.7.0+
1 β Install vLLM and download the model
python3 -m venv ~/.vllm-env
source ~/.vllm-env/bin/activate
pip install --upgrade pip
pip install vllm
# Download the original safetensors model (~14 GB, 3 shards)
huggingface-cli login # paste your HF token if the model is gated
huggingface-cli download microsoft/Phi-4-multimodal-instruct \
--local-dir ~/phi4-mm-hf
2 β Launch the vLLM server
source ~/.vllm-env/bin/activate
python -m vllm.entrypoints.openai.api_server \
--model ~/phi4-mm-hf \
--dtype bfloat16 \
--trust-remote-code \
--max-model-len 65520 \
--kv-cache-memory-bytes 8G \
--limit-mm-per-prompt '{"image": 3, "audio": 3}' \
--enable-auto-tool-choice \
--tool-call-parser phi4_mini_json \
--port 8080 \
--host 127.0.0.1 \
--served-model-name phi4-mm
| Flag | Value | Notes |
|---|---|---|
--dtype bfloat16 |
β | Native dtype β do not change to float16 |
--max-model-len 65520 |
β | Stable context ceiling for phi4-mm (131 K nominal) |
--kv-cache-memory-bytes |
8G |
Tune down to 4G on 12β16 GB GPUs |
--limit-mm-per-prompt |
{"image":3,"audio":3} |
Max attachments per request |
--tool-call-parser phi4_mini_json |
β | phi4-mm emits functools[...] not Hermes β required for tool calling |
--trust-remote-code |
β | Required for phi4-mm's custom modelling code |
Official vLLM LoRA flags: Microsoft's published vLLM command includes explicit LoRA adapter flags to activate the rank-320 vision and speech adapters stored in separate subfolders of the model directory:
--enable-lora \ --max-lora-rank 320 \ --lora-extra-vocab-size 0 \ --max-loras 2 \ --lora-modules speech=~/phi4-mm-hf/speech-lora vision=~/phi4-mm-hf/vision-loraIf you experience degraded vision or audio quality, add these flags to the launch command above.
Wait for the server to finish loading (~60 s):
curl http://localhost:8080/health # β {"status":"ok"}
3 β Send requests (OpenAI-compatible API)
Text (Python):
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8080/v1", api_key="local")
response = client.chat.completions.create(
model="phi4-mm",
messages=[{"role": "user", "content": "What is the capital of France?"}],
max_tokens=256,
)
print(response.choices[0].message.content)
Image (Python):
import base64
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8080/v1", api_key="local")
with open("photo.jpg", "rb") as f:
image_b64 = base64.b64encode(f.read()).decode()
response = client.chat.completions.create(
model="phi4-mm",
messages=[{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{image_b64}"}},
{"type": "text", "text": "Describe this image in detail."},
],
}],
max_tokens=512,
)
print(response.choices[0].message.content)
Image (curl, base64):
IMAGE_B64=$(base64 -w0 photo.jpg)
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d "{
\"model\": \"phi4-mm\",
\"messages\": [{
\"role\": \"user\",
\"content\": [
{\"type\": \"image_url\", \"image_url\": {\"url\": \"data:image/jpeg;base64,${IMAGE_B64}\"}},
{\"type\": \"text\", \"text\": \"What is in this image?\"}
]
}],
\"max_tokens\": 300
}"
Audio β speech transcription / understanding (Python):
import base64
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8080/v1", api_key="local")
with open("audio.wav", "rb") as f:
audio_b64 = base64.b64encode(f.read()).decode()
response = client.chat.completions.create(
model="phi4-mm",
messages=[{
"role": "user",
"content": [
{"type": "input_audio", "input_audio": {"data": audio_b64, "format": "wav"}},
{"type": "text", "text": "Transcribe this audio."},
],
}],
max_tokens=512,
)
print(response.choices[0].message.content)
phi4-mm uses a custom conformer-based audio encoder with rank-320 speech LoRA β no separate ASR model needed. Supported formats:
wav,mp3,ogg,flac.
Tool calling
phi4-mm emits tool calls as functools[{"name":"...","arguments":{...}}].
The --tool-call-parser phi4_mini_json flag (vLLM 0.7+) handles this format automatically.
For a complete chat template that injects tools into phi4-mm's native <|tool|>...<|/tool|> block,
see deploy/wsl-vllm/phi4-mm-tool-template.jinja
in the companion repo.
Ollama (CPU / NUC / edge)
See the deploy/ folder for a complete Modelfile, NUC install script, and OpenClaw integration config.
FROM ./phi4-mm-Q4_K_M.gguf
PARAMETER num_ctx 8192
PARAMETER num_thread 8
PARAMETER num_gpu 0
PARAMETER flash_attn false
PARAMETER temperature 0.7
Architecture
| Property | Value |
|---|---|
| Base model | Phi-4-Mini (3.8 B LLM backbone) |
| Total parameters | ~5.6 B |
| GGUF arch | phi3 |
| Context length | 128 K tokens (131,072) |
| Modalities | Text, Vision (SigLIP-400M), Audio/Speech |
The vision encoder (mmproj-phi4-mm-f16.gguf) is a SigLIP-400M encoder finetuned
with LLM2CLIP, with a 2-layer MLP projector. Audio/speech is not embedded in the GGUF β see the audio limitation callout above.
Conversion Details
| Item | Value |
|---|---|
| Converter | llama.cpp convert_hf_to_gguf.py (text) + custom mmproj converter |
| llama.cpp build | b8347 / fc350fdf9 |
| Source | microsoft/Phi-4-multimodal-instruct |
| Hardware | RTX 5090 32 GB, CUDA 12.0, WSL2 Ubuntu 24.04 |
Quantization commands:
# Q8_0
llama-quantize phi4-mm-f16.gguf phi4-mm-Q8_0.gguf Q8_0
# Q4_K_M
llama-quantize phi4-mm-f16.gguf phi4-mm-Q4_K_M.gguf Q4_K_M
NUC / Edge Deployment
The Q4_K_M + mmproj-phi4-mm-f16.gguf combination (~3,500 MiB VRAM) fits on:
- Intel NUC 13/14 Pro (Intel Arc iGPU, 4β8 GB shared VRAM)
- Systems with 8 GB unified memory (Apple Silicon M-series, etc.)
See deploy/ for install scripts and configuration.
Related
- microsoft/Phi-4-multimodal-instruct β original model weights
- llama.cpp β inference engine
- lmms-eval β multimodal evaluation harness
- Downloads last month
- 242
4-bit
8-bit
16-bit
Model tree for Swicked86/phi4-mm-gguf
Base model
microsoft/Phi-4-multimodal-instruct