Phi-4-multimodal-instruct GGUF Quantizations

GGUF quantizations of microsoft/Phi-4-multimodal-instruct β€” a 5.6 B parameter multimodal model by Microsoft supporting text, vision (images), and audio inputs.

Produced with llama.cpp build b8347 on RTX 5090.

License: MIT β€” Β© Microsoft Corporation. These quantizations carry the same MIT license as the original model.


Available Files

File Quant Size BPW Best for
phi4-mm-f16.gguf F16 7.17 GB 16.0 Re-quantization base, maximum quality
phi4-mm-Q8_0.gguf Q8_0 3.90 GB 8.0 High-end GPU β€” near-lossless
phi4-mm-Q4_K_M.gguf Q4_K_M 2.37 GB 5.18 CPU / constrained VRAM β€” good quality
mmproj-phi4-mm-f16.gguf F16 825 MB 16.0 Vision encoder β€” required for image input

One mmproj for all: mmproj-phi4-mm-f16.gguf works with every text GGUF above. It cannot be quantized further β€” the CLIP FF layer dimension (4304) is not divisible by 32.


VRAM Requirements

Full GPU offload (-ngl 99):

Configuration VRAM
F16 + mmproj-F16 ~10,000 MiB
Q8_0 + mmproj-F16 ~5,400 MiB
Q4_K_M + mmproj-F16 ~3,500 MiB

Quality Metrics

Perplexity (wikitext-2-raw test set, context 512)

Model PPL vs F16
F16 (baseline) 14.9338 Β± 0.107 β€”
Q8_0 14.9107 Β± 0.106 βˆ’0.15% βœ… lossless
Q4_K_M 16.3183 Β± 0.121 +9.3%

Throughput (llama-bench, RTX 5090, pp512 / tg128)

Model Prompt (t/s) Generation (t/s)
Q8_0 21,352 247
Q4_K_M 19,904 324

Multimodal Benchmarks (lmms-eval)

VQA evaluation in progress β€” will be added when complete. Suite: MMStar, OCRBench, AI2D, MathVista, HallusionBench.


Usage

LM Studio (Recommended for Desktop)

LM Studio has native GGUF support including multimodal vision. No command-line needed.

Text-only:

  1. Open LM Studio β†’ search Swicked86/phi4-mm-gguf
  2. Download phi4-mm-Q8_0.gguf (GPU) or phi4-mm-Q4_K_M.gguf (CPU / low VRAM)
  3. Load the model β†’ Chat

With vision (image input):

  1. Download both phi4-mm-Q8_0.gguf and mmproj-phi4-mm-f16.gguf
  2. Load the main model in LM Studio
  3. In Model Settings β†’ Multimodal β†’ Vision Model (mmproj), browse to mmproj-phi4-mm-f16.gguf
  4. In Chat, click the image icon to attach a photo and ask questions about it

The mmproj file is the vision encoder. Without it the model runs text-only.
mmproj-phi4-mm-f16.gguf is compatible with all three text GGUFs.


llama.cpp CLI

Step 1 β€” Download the files

# Install huggingface-cli if needed
pip install huggingface_hub

# Text + vision (recommended)
huggingface-cli download Swicked86/phi4-mm-gguf phi4-mm-Q8_0.gguf mmproj-phi4-mm-f16.gguf --local-dir ./phi4-mm

# CPU / low-VRAM variant
huggingface-cli download Swicked86/phi4-mm-gguf phi4-mm-Q4_K_M.gguf mmproj-phi4-mm-f16.gguf --local-dir ./phi4-mm

Build llama.cpp if you haven't already:

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build -DGGML_CUDA=ON    # omit DGGML_CUDA=ON for CPU-only
cmake --build build --config Release -j$(nproc)

Step 2 β€” Interactive multimodal chat (images + text)

llama-mtmd-cli launches an interactive session. Type your prompt, or prefix it with an image path using /image:

./build/bin/llama-mtmd-cli \
  -m ./phi4-mm/phi4-mm-Q8_0.gguf \
  --mmproj ./phi4-mm/mmproj-phi4-mm-f16.gguf \
  -ngl 99 --threads 16 --ctx-size 8192

Inside the session:

> /image photo.jpg
Image loaded.
> What is in this image?
[model describes the image]

> /image chart.png
Image loaded.
> Summarise the trend shown in this chart.
[model analyses the chart]

> Explain the previous image again but in French.
[responds without re-loading the image]

Single-shot (non-interactive):

./build/bin/llama-mtmd-cli \
  -m ./phi4-mm/phi4-mm-Q8_0.gguf \
  --mmproj ./phi4-mm/mmproj-phi4-mm-f16.gguf \
  --image photo.jpg \
  -p "Describe this image in detail." \
  -ngl 99 --threads 16 --no-display-prompt

CPU (no GPU):

./build/bin/llama-mtmd-cli \
  -m ./phi4-mm/phi4-mm-Q4_K_M.gguf \
  --mmproj ./phi4-mm/mmproj-phi4-mm-f16.gguf \
  --image photo.jpg \
  -p "What objects are in this photo?" \
  --threads 8

Text-only (no image)

./build/bin/llama-cli \
  -m ./phi4-mm/phi4-mm-Q8_0.gguf \
  --ctx-size 65536 --flash-attn on --kv-offload -ngl 99 --threads 16 \
  -p "<|system|>You are a helpful assistant.<|end|><|user|>Hello<|end|><|assistant|>"

llama-server β€” OpenAI-compatible API

Serves both text and vision via /v1/chat/completions. Useful for integrations (Open WebUI, SillyTavern, Continue.dev, etc.):

./build/bin/llama-server \
  -m ./phi4-mm/phi4-mm-Q8_0.gguf \
  --mmproj ./phi4-mm/mmproj-phi4-mm-f16.gguf \
  -ngl 99 --threads 16 \
  --ctx-size 8192 --parallel 4 \
  --port 8080 --host 127.0.0.1

Text query (curl):

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "phi4-mm",
    "messages": [{"role": "user", "content": "Explain the Pythagorean theorem."}],
    "max_tokens": 300
  }'

Image query (curl, base64):

IMAGE_B64=$(base64 -w0 photo.jpg)
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d "{
    \"model\": \"phi4-mm\",
    \"messages\": [{
      \"role\": \"user\",
      \"content\": [
        {\"type\": \"image_url\", \"image_url\": {\"url\": \"data:image/jpeg;base64,${IMAGE_B64}\"}},
        {\"type\": \"text\", \"text\": \"What is in this image?\"}
      ]
    }],
    \"max_tokens\": 300
  }"

Image query (Python, openai SDK):

import base64, httpx
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8080/v1", api_key="local")

with open("photo.jpg", "rb") as f:
    image_b64 = base64.b64encode(f.read()).decode()

response = client.chat.completions.create(
    model="phi4-mm",
    messages=[{
        "role": "user",
        "content": [
            {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{image_b64}"}},
            {"type": "text", "text": "Describe this image."},
        ],
    }],
    max_tokens=300,
)
print(response.choices[0].message.content)

Modality support summary for GGUF (llama.cpp / LM Studio)

Modality Supported Notes
Text βœ… All three GGUFs
Vision (images) βœ… Requires mmproj-phi4-mm-f16.gguf + --mmproj flag
Audio / Speech ❌ Not available β€” see below

Audio is not supported in the GGUF files. phi4-mm's speech capability uses a custom conformer-based audio encoder (24 conformer blocks, initialized from a proprietary AED ASR model) plus a rank-320 speech LoRA applied to the language decoder. The GGUF conversion pipeline (convert_hf_to_gguf.py) only exports the text transformer and the SigLIP vision encoder (mmproj) β€” the audio encoder tensors are not extracted. There is currently no audioproj equivalent in llama.cpp for phi4-mm.

To use audio/speech transcription, use the vLLM path below with the original bf16 safetensors model.


vLLM β€” Full Multimodal (Image + Audio, bf16 Safetensors)

Use vLLM when you want maximum quality and full multimodal support (images + audio) from the original bf16 safetensors weights. For constrained hardware, use the GGUF options above instead.

Requirements: Python 3.10+, CUDA GPU with ~16 GB VRAM, vLLM 0.7.0+

1 β€” Install vLLM and download the model

python3 -m venv ~/.vllm-env
source ~/.vllm-env/bin/activate
pip install --upgrade pip
pip install vllm

# Download the original safetensors model (~14 GB, 3 shards)
huggingface-cli login          # paste your HF token if the model is gated
huggingface-cli download microsoft/Phi-4-multimodal-instruct \
    --local-dir ~/phi4-mm-hf

2 β€” Launch the vLLM server

source ~/.vllm-env/bin/activate
python -m vllm.entrypoints.openai.api_server \
  --model ~/phi4-mm-hf \
  --dtype bfloat16 \
  --trust-remote-code \
  --max-model-len 65520 \
  --kv-cache-memory-bytes 8G \
  --limit-mm-per-prompt '{"image": 3, "audio": 3}' \
  --enable-auto-tool-choice \
  --tool-call-parser phi4_mini_json \
  --port 8080 \
  --host 127.0.0.1 \
  --served-model-name phi4-mm
Flag Value Notes
--dtype bfloat16 β€” Native dtype β€” do not change to float16
--max-model-len 65520 β€” Stable context ceiling for phi4-mm (131 K nominal)
--kv-cache-memory-bytes 8G Tune down to 4G on 12–16 GB GPUs
--limit-mm-per-prompt {"image":3,"audio":3} Max attachments per request
--tool-call-parser phi4_mini_json β€” phi4-mm emits functools[...] not Hermes β€” required for tool calling
--trust-remote-code β€” Required for phi4-mm's custom modelling code

Official vLLM LoRA flags: Microsoft's published vLLM command includes explicit LoRA adapter flags to activate the rank-320 vision and speech adapters stored in separate subfolders of the model directory:

--enable-lora \
--max-lora-rank 320 \
--lora-extra-vocab-size 0 \
--max-loras 2 \
--lora-modules speech=~/phi4-mm-hf/speech-lora vision=~/phi4-mm-hf/vision-lora

If you experience degraded vision or audio quality, add these flags to the launch command above.

Wait for the server to finish loading (~60 s):

curl http://localhost:8080/health   # β†’ {"status":"ok"}

3 β€” Send requests (OpenAI-compatible API)

Text (Python):

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8080/v1", api_key="local")
response = client.chat.completions.create(
    model="phi4-mm",
    messages=[{"role": "user", "content": "What is the capital of France?"}],
    max_tokens=256,
)
print(response.choices[0].message.content)

Image (Python):

import base64
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8080/v1", api_key="local")

with open("photo.jpg", "rb") as f:
    image_b64 = base64.b64encode(f.read()).decode()

response = client.chat.completions.create(
    model="phi4-mm",
    messages=[{
        "role": "user",
        "content": [
            {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{image_b64}"}},
            {"type": "text", "text": "Describe this image in detail."},
        ],
    }],
    max_tokens=512,
)
print(response.choices[0].message.content)

Image (curl, base64):

IMAGE_B64=$(base64 -w0 photo.jpg)
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d "{
    \"model\": \"phi4-mm\",
    \"messages\": [{
      \"role\": \"user\",
      \"content\": [
        {\"type\": \"image_url\", \"image_url\": {\"url\": \"data:image/jpeg;base64,${IMAGE_B64}\"}},
        {\"type\": \"text\", \"text\": \"What is in this image?\"}
      ]
    }],
    \"max_tokens\": 300
  }"

Audio β€” speech transcription / understanding (Python):

import base64
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8080/v1", api_key="local")

with open("audio.wav", "rb") as f:
    audio_b64 = base64.b64encode(f.read()).decode()

response = client.chat.completions.create(
    model="phi4-mm",
    messages=[{
        "role": "user",
        "content": [
            {"type": "input_audio", "input_audio": {"data": audio_b64, "format": "wav"}},
            {"type": "text", "text": "Transcribe this audio."},
        ],
    }],
    max_tokens=512,
)
print(response.choices[0].message.content)

phi4-mm uses a custom conformer-based audio encoder with rank-320 speech LoRA β€” no separate ASR model needed. Supported formats: wav, mp3, ogg, flac.

Tool calling

phi4-mm emits tool calls as functools[{"name":"...","arguments":{...}}]. The --tool-call-parser phi4_mini_json flag (vLLM 0.7+) handles this format automatically. For a complete chat template that injects tools into phi4-mm's native <|tool|>...<|/tool|> block, see deploy/wsl-vllm/phi4-mm-tool-template.jinja in the companion repo.


Ollama (CPU / NUC / edge)

See the deploy/ folder for a complete Modelfile, NUC install script, and OpenClaw integration config.

FROM ./phi4-mm-Q4_K_M.gguf
PARAMETER num_ctx 8192
PARAMETER num_thread 8
PARAMETER num_gpu 0
PARAMETER flash_attn false
PARAMETER temperature 0.7

Architecture

Property Value
Base model Phi-4-Mini (3.8 B LLM backbone)
Total parameters ~5.6 B
GGUF arch phi3
Context length 128 K tokens (131,072)
Modalities Text, Vision (SigLIP-400M), Audio/Speech

The vision encoder (mmproj-phi4-mm-f16.gguf) is a SigLIP-400M encoder finetuned with LLM2CLIP, with a 2-layer MLP projector. Audio/speech is not embedded in the GGUF β€” see the audio limitation callout above.


Conversion Details

Item Value
Converter llama.cpp convert_hf_to_gguf.py (text) + custom mmproj converter
llama.cpp build b8347 / fc350fdf9
Source microsoft/Phi-4-multimodal-instruct
Hardware RTX 5090 32 GB, CUDA 12.0, WSL2 Ubuntu 24.04

Quantization commands:

# Q8_0
llama-quantize phi4-mm-f16.gguf phi4-mm-Q8_0.gguf Q8_0

# Q4_K_M
llama-quantize phi4-mm-f16.gguf phi4-mm-Q4_K_M.gguf Q4_K_M

NUC / Edge Deployment

The Q4_K_M + mmproj-phi4-mm-f16.gguf combination (~3,500 MiB VRAM) fits on:

  • Intel NUC 13/14 Pro (Intel Arc iGPU, 4–8 GB shared VRAM)
  • Systems with 8 GB unified memory (Apple Silicon M-series, etc.)

See deploy/ for install scripts and configuration.


Related

Downloads last month
242
GGUF
Model size
4B params
Architecture
phi3
Hardware compatibility
Log In to add your hardware

4-bit

8-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Swicked86/phi4-mm-gguf

Quantized
(8)
this model