Phi-4-multimodal-instruct GGUF Quantizations

GGUF quantizations of microsoft/Phi-4-multimodal-instruct — a 5.6 B parameter multimodal model by Microsoft supporting text, vision (images), and audio inputs.

Produced with llama.cpp build b8347 on RTX 5090.

License: MIT — © Microsoft Corporation. These quantizations carry the same MIT license as the original model.

Available Files

File	Quant	Size	BPW	Best for
`phi4-mm-f16.gguf`	F16	7.17 GB	16.0	Re-quantization base, maximum quality
`phi4-mm-Q8_0.gguf`	Q8_0	3.90 GB	8.0	High-end GPU — near-lossless
`phi4-mm-Q4_K_M.gguf`	Q4_K_M	2.37 GB	5.18	CPU / constrained VRAM — good quality
`mmproj-phi4-mm-f16.gguf`	F16	825 MB	16.0	Vision encoder — required for image input

One mmproj for all: mmproj-phi4-mm-f16.gguf works with every text GGUF above. It cannot be quantized further — the CLIP FF layer dimension (4304) is not divisible by 32.

VRAM Requirements

Full GPU offload (-ngl 99):

Configuration	VRAM
F16 + mmproj-F16	~10,000 MiB
Q8_0 + mmproj-F16	~5,400 MiB
Q4_K_M + mmproj-F16	~3,500 MiB

Quality Metrics

Perplexity (wikitext-2-raw test set, context 512)

Model	PPL	vs F16
F16 (baseline)	14.9338 ± 0.107	—
Q8_0	14.9107 ± 0.106	−0.15% ✅ lossless
Q4_K_M	16.3183 ± 0.121	+9.3%

Throughput (llama-bench, RTX 5090, pp512 / tg128)

Model	Prompt (t/s)	Generation (t/s)
Q8_0	21,352	247
Q4_K_M	19,904	324

Multimodal Benchmarks (lmms-eval)

VQA evaluation in progress — will be added when complete. Suite: MMStar, OCRBench, AI2D, MathVista, HallusionBench.

Usage

LM Studio (Recommended for Desktop)

LM Studio has native GGUF support including multimodal vision. No command-line needed.

Text-only:

Open LM Studio → search Swicked86/phi4-mm-gguf
Download phi4-mm-Q8_0.gguf (GPU) or phi4-mm-Q4_K_M.gguf (CPU / low VRAM)
Load the model → Chat

With vision (image input):

Download both phi4-mm-Q8_0.gguf and mmproj-phi4-mm-f16.gguf
Load the main model in LM Studio
In Model Settings → Multimodal → Vision Model (mmproj), browse to mmproj-phi4-mm-f16.gguf
In Chat, click the image icon to attach a photo and ask questions about it

The mmproj file is the vision encoder. Without it the model runs text-only.
mmproj-phi4-mm-f16.gguf is compatible with all three text GGUFs.

llama.cpp CLI

Step 1 — Download the files

# Install huggingface-cli if needed
pip install huggingface_hub

# Text + vision (recommended)
huggingface-cli download Swicked86/phi4-mm-gguf phi4-mm-Q8_0.gguf mmproj-phi4-mm-f16.gguf --local-dir ./phi4-mm

# CPU / low-VRAM variant
huggingface-cli download Swicked86/phi4-mm-gguf phi4-mm-Q4_K_M.gguf mmproj-phi4-mm-f16.gguf --local-dir ./phi4-mm

Build llama.cpp if you haven't already:

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build -DGGML_CUDA=ON    # omit DGGML_CUDA=ON for CPU-only
cmake --build build --config Release -j$(nproc)

Step 2 — Interactive multimodal chat (images + text)

llama-mtmd-cli launches an interactive session. Type your prompt, or prefix it with an image path using /image:

./build/bin/llama-mtmd-cli \
  -m ./phi4-mm/phi4-mm-Q8_0.gguf \
  --mmproj ./phi4-mm/mmproj-phi4-mm-f16.gguf \
  -ngl 99 --threads 16 --ctx-size 8192

Inside the session:

> /image photo.jpg
Image loaded.
> What is in this image?
[model describes the image]

> /image chart.png
Image loaded.
> Summarise the trend shown in this chart.
[model analyses the chart]

> Explain the previous image again but in French.
[responds without re-loading the image]

Single-shot (non-interactive):

./build/bin/llama-mtmd-cli \
  -m ./phi4-mm/phi4-mm-Q8_0.gguf \
  --mmproj ./phi4-mm/mmproj-phi4-mm-f16.gguf \
  --image photo.jpg \
  -p "Describe this image in detail." \
  -ngl 99 --threads 16 --no-display-prompt

CPU (no GPU):

./build/bin/llama-mtmd-cli \
  -m ./phi4-mm/phi4-mm-Q4_K_M.gguf \
  --mmproj ./phi4-mm/mmproj-phi4-mm-f16.gguf \
  --image photo.jpg \
  -p "What objects are in this photo?" \
  --threads 8

Text-only (no image)

./build/bin/llama-cli \
  -m ./phi4-mm/phi4-mm-Q8_0.gguf \
  --ctx-size 65536 --flash-attn on --kv-offload -ngl 99 --threads 16 \
  -p "<|system|>You are a helpful assistant.<|end|><|user|>Hello<|end|><|assistant|>"

llama-server — OpenAI-compatible API

Serves both text and vision via /v1/chat/completions. Useful for integrations (Open WebUI, SillyTavern, Continue.dev, etc.):

./build/bin/llama-server \
  -m ./phi4-mm/phi4-mm-Q8_0.gguf \
  --mmproj ./phi4-mm/mmproj-phi4-mm-f16.gguf \
  -ngl 99 --threads 16 \
  --ctx-size 8192 --parallel 4 \
  --port 8080 --host 127.0.0.1

Text query (curl):

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "phi4-mm",
    "messages": [{"role": "user", "content": "Explain the Pythagorean theorem."}],
    "max_tokens": 300
  }'

Image query (curl, base64):

IMAGE_B64=$(base64 -w0 photo.jpg)
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d "{
    \"model\": \"phi4-mm\",
    \"messages\": [{
      \"role\": \"user\",
      \"content\": [
        {\"type\": \"image_url\", \"image_url\": {\"url\": \"data:image/jpeg;base64,${IMAGE_B64}\"}},
        {\"type\": \"text\", \"text\": \"What is in this image?\"}
      ]
    }],
    \"max_tokens\": 300
  }"

Image query (Python, openai SDK):

import base64, httpx
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8080/v1", api_key="local")

with open("photo.jpg", "rb") as f:
    image_b64 = base64.b64encode(f.read()).decode()

response = client.chat.completions.create(
    model="phi4-mm",
    messages=[{
        "role": "user",
        "content": [
            {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{image_b64}"}},
            {"type": "text", "text": "Describe this image."},
        ],
    }],
    max_tokens=300,
)
print(response.choices[0].message.content)

Modality support summary for GGUF (llama.cpp / LM Studio)

Modality	Supported	Notes
Text	✅	All three GGUFs
Vision (images)	✅	Requires `mmproj-phi4-mm-f16.gguf` + `--mmproj` flag
Audio / Speech	❌	Not available — see below

Audio is not supported in the GGUF files. phi4-mm's speech capability uses a custom conformer-based audio encoder (24 conformer blocks, initialized from a proprietary AED ASR model) plus a rank-320 speech LoRA applied to the language decoder. The GGUF conversion pipeline (convert_hf_to_gguf.py) only exports the text transformer and the SigLIP vision encoder (mmproj) — the audio encoder tensors are not extracted. There is currently no audioproj equivalent in llama.cpp for phi4-mm.

To use audio/speech transcription, use the vLLM path below with the original bf16 safetensors model.

vLLM — Full Multimodal (Image + Audio, bf16 Safetensors)

Use vLLM when you want maximum quality and full multimodal support (images + audio) from the original bf16 safetensors weights. For constrained hardware, use the GGUF options above instead.

Requirements: Python 3.10+, CUDA GPU with ~16 GB VRAM, vLLM 0.7.0+

1 — Install vLLM and download the model

python3 -m venv ~/.vllm-env
source ~/.vllm-env/bin/activate
pip install --upgrade pip
pip install vllm

# Download the original safetensors model (~14 GB, 3 shards)
huggingface-cli login          # paste your HF token if the model is gated
huggingface-cli download microsoft/Phi-4-multimodal-instruct \
    --local-dir ~/phi4-mm-hf

2 — Launch the vLLM server

source ~/.vllm-env/bin/activate
python -m vllm.entrypoints.openai.api_server \
  --model ~/phi4-mm-hf \
  --dtype bfloat16 \
  --trust-remote-code \
  --max-model-len 65520 \
  --kv-cache-memory-bytes 8G \
  --limit-mm-per-prompt '{"image": 3, "audio": 3}' \
  --enable-auto-tool-choice \
  --tool-call-parser phi4_mini_json \
  --port 8080 \
  --host 127.0.0.1 \
  --served-model-name phi4-mm

Flag	Value	Notes
`--dtype bfloat16`	—	Native dtype — do not change to float16
`--max-model-len 65520`	—	Stable context ceiling for phi4-mm (131 K nominal)
`--kv-cache-memory-bytes`	`8G`	Tune down to `4G` on 12–16 GB GPUs
`--limit-mm-per-prompt`	`{"image":3,"audio":3}`	Max attachments per request
`--tool-call-parser phi4_mini_json`	—	phi4-mm emits `functools[...]` not Hermes — required for tool calling
`--trust-remote-code`	—	Required for phi4-mm's custom modelling code

Official vLLM LoRA flags: Microsoft's published vLLM command includes explicit LoRA adapter flags to activate the rank-320 vision and speech adapters stored in separate subfolders of the model directory:
--enable-lora \
--max-lora-rank 320 \
--lora-extra-vocab-size 0 \
--max-loras 2 \
--lora-modules speech=~/phi4-mm-hf/speech-lora vision=~/phi4-mm-hf/vision-lora
If you experience degraded vision or audio quality, add these flags to the launch command above.

Wait for the server to finish loading (~60 s):

curl http://localhost:8080/health   # → {"status":"ok"}

3 — Send requests (OpenAI-compatible API)

Text (Python):

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8080/v1", api_key="local")
response = client.chat.completions.create(
    model="phi4-mm",
    messages=[{"role": "user", "content": "What is the capital of France?"}],
    max_tokens=256,
)
print(response.choices[0].message.content)

Image (Python):

import base64
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8080/v1", api_key="local")

with open("photo.jpg", "rb") as f:
    image_b64 = base64.b64encode(f.read()).decode()

response = client.chat.completions.create(
    model="phi4-mm",
    messages=[{
        "role": "user",
        "content": [
            {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{image_b64}"}},
            {"type": "text", "text": "Describe this image in detail."},
        ],
    }],
    max_tokens=512,
)
print(response.choices[0].message.content)

Image (curl, base64):

IMAGE_B64=$(base64 -w0 photo.jpg)
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d "{
    \"model\": \"phi4-mm\",
    \"messages\": [{
      \"role\": \"user\",
      \"content\": [
        {\"type\": \"image_url\", \"image_url\": {\"url\": \"data:image/jpeg;base64,${IMAGE_B64}\"}},
        {\"type\": \"text\", \"text\": \"What is in this image?\"}
      ]
    }],
    \"max_tokens\": 300
  }"

Audio — speech transcription / understanding (Python):

import base64
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8080/v1", api_key="local")

with open("audio.wav", "rb") as f:
    audio_b64 = base64.b64encode(f.read()).decode()

response = client.chat.completions.create(
    model="phi4-mm",
    messages=[{
        "role": "user",
        "content": [
            {"type": "input_audio", "input_audio": {"data": audio_b64, "format": "wav"}},
            {"type": "text", "text": "Transcribe this audio."},
        ],
    }],
    max_tokens=512,
)
print(response.choices[0].message.content)

phi4-mm uses a custom conformer-based audio encoder with rank-320 speech LoRA — no separate ASR model needed. Supported formats: wav, mp3, ogg, flac.

Tool calling

phi4-mm emits tool calls as functools[{"name":"...","arguments":{...}}]. The --tool-call-parser phi4_mini_json flag (vLLM 0.7+) handles this format automatically. For a complete chat template that injects tools into phi4-mm's native <|tool|>...<|/tool|> block, see deploy/wsl-vllm/phi4-mm-tool-template.jinja in the companion repo.

Ollama (CPU / NUC / edge)

See the deploy/ folder for a complete Modelfile, NUC install script, and OpenClaw integration config.

FROM ./phi4-mm-Q4_K_M.gguf
PARAMETER num_ctx 8192
PARAMETER num_thread 8
PARAMETER num_gpu 0
PARAMETER flash_attn false
PARAMETER temperature 0.7

Architecture

Property	Value
Base model	Phi-4-Mini (3.8 B LLM backbone)
Total parameters	~5.6 B
GGUF arch	`phi3`
Context length	128 K tokens (131,072)
Modalities	Text, Vision (SigLIP-400M), Audio/Speech

The vision encoder (mmproj-phi4-mm-f16.gguf) is a SigLIP-400M encoder finetuned with LLM2CLIP, with a 2-layer MLP projector. Audio/speech is not embedded in the GGUF — see the audio limitation callout above.

Conversion Details

Item	Value
Converter	`llama.cpp convert_hf_to_gguf.py` (text) + custom mmproj converter
llama.cpp build	b8347 / `fc350fdf9`
Source	microsoft/Phi-4-multimodal-instruct
Hardware	RTX 5090 32 GB, CUDA 12.0, WSL2 Ubuntu 24.04

Quantization commands:

# Q8_0
llama-quantize phi4-mm-f16.gguf phi4-mm-Q8_0.gguf Q8_0

# Q4_K_M
llama-quantize phi4-mm-f16.gguf phi4-mm-Q4_K_M.gguf Q4_K_M

NUC / Edge Deployment

The Q4_K_M + mmproj-phi4-mm-f16.gguf combination (~3,500 MiB VRAM) fits on:

Intel NUC 13/14 Pro (Intel Arc iGPU, 4–8 GB shared VRAM)
Systems with 8 GB unified memory (Apple Silicon M-series, etc.)

See deploy/ for install scripts and configuration.

microsoft/Phi-4-multimodal-instruct — original model weights
llama.cpp — inference engine
lmms-eval — multimodal evaluation harness

Downloads last month: 242

GGUF

Model size

4B params

Architecture

phi3

Hardware compatibility

4-bit

8-bit

16-bit

Model tree for Swicked86/phi4-mm-gguf

Base model

microsoft/Phi-4-multimodal-instruct

Quantized

(8)

this model

Swicked86
/

phi4-mm-gguf

Phi-4-multimodal-instruct GGUF Quantizations

Available Files

VRAM Requirements

Quality Metrics

Perplexity (wikitext-2-raw test set, context 512)

Throughput (llama-bench, RTX 5090, pp512 / tg128)

Multimodal Benchmarks (lmms-eval)

Usage

LM Studio (Recommended for Desktop)

llama.cpp CLI

Step 1 — Download the files

Step 2 — Interactive multimodal chat (images + text)

Text-only (no image)

llama-server — OpenAI-compatible API

Modality support summary for GGUF (llama.cpp / LM Studio)

vLLM — Full Multimodal (Image + Audio, bf16 Safetensors)

1 — Install vLLM and download the model

2 — Launch the vLLM server

3 — Send requests (OpenAI-compatible API)

Tool calling

Ollama (CPU / NUC / edge)

Architecture

Conversion Details

NUC / Edge Deployment

Related

Model tree for Swicked86/phi4-mm-gguf