Instructions to use bartowski/zai-org_GLM-4.6V-Flash-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use bartowski/zai-org_GLM-4.6V-Flash-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="bartowski/zai-org_GLM-4.6V-Flash-GGUF",
	filename="mmproj-zai-org_GLM-4.6V-Flash-bf16.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": [
				{
					"type": "text",
					"text": "Describe this image in one sentence."
				},
				{
					"type": "image_url",
					"image_url": {
						"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
					}
				}
			]
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use bartowski/zai-org_GLM-4.6V-Flash-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf bartowski/zai-org_GLM-4.6V-Flash-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf bartowski/zai-org_GLM-4.6V-Flash-GGUF:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf bartowski/zai-org_GLM-4.6V-Flash-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf bartowski/zai-org_GLM-4.6V-Flash-GGUF:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf bartowski/zai-org_GLM-4.6V-Flash-GGUF:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf bartowski/zai-org_GLM-4.6V-Flash-GGUF:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf bartowski/zai-org_GLM-4.6V-Flash-GGUF:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf bartowski/zai-org_GLM-4.6V-Flash-GGUF:Q4_K_M

Use Docker

docker model run hf.co/bartowski/zai-org_GLM-4.6V-Flash-GGUF:Q4_K_M

LM Studio
Jan

vLLM

How to use bartowski/zai-org_GLM-4.6V-Flash-GGUF with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "bartowski/zai-org_GLM-4.6V-Flash-GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "bartowski/zai-org_GLM-4.6V-Flash-GGUF",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/bartowski/zai-org_GLM-4.6V-Flash-GGUF:Q4_K_M

Ollama
How to use bartowski/zai-org_GLM-4.6V-Flash-GGUF with Ollama:
```
ollama run hf.co/bartowski/zai-org_GLM-4.6V-Flash-GGUF:Q4_K_M
```

Unsloth Studio new

How to use bartowski/zai-org_GLM-4.6V-Flash-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for bartowski/zai-org_GLM-4.6V-Flash-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for bartowski/zai-org_GLM-4.6V-Flash-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for bartowski/zai-org_GLM-4.6V-Flash-GGUF to start chatting

Pi new

How to use bartowski/zai-org_GLM-4.6V-Flash-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf bartowski/zai-org_GLM-4.6V-Flash-GGUF:Q4_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "bartowski/zai-org_GLM-4.6V-Flash-GGUF:Q4_K_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use bartowski/zai-org_GLM-4.6V-Flash-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf bartowski/zai-org_GLM-4.6V-Flash-GGUF:Q4_K_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default bartowski/zai-org_GLM-4.6V-Flash-GGUF:Q4_K_M

Run Hermes

hermes

Docker Model Runner
How to use bartowski/zai-org_GLM-4.6V-Flash-GGUF with Docker Model Runner:
```
docker model run hf.co/bartowski/zai-org_GLM-4.6V-Flash-GGUF:Q4_K_M
```

Lemonade

How to use bartowski/zai-org_GLM-4.6V-Flash-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull bartowski/zai-org_GLM-4.6V-Flash-GGUF:Q4_K_M

Run and chat with the model

lemonade run user.zai-org_GLM-4.6V-Flash-GGUF-Q4_K_M

List all available models

lemonade list

mmproj file?

by dolphinfan - opened Dec 9, 2025

Discussion

dolphinfan

Dec 9, 2025

will there be mmproj files with these GLM 4.6V releases? Thanks

bartowski

Owner Dec 9, 2025

I'll try it again but no it didn't manage to generate for some reason

MrDevolver

Dec 9, 2025

Vision model without vision support is like a pirate without parrot.

bartowski

Owner Dec 9, 2025

Yeah I was surprised to see it convert at all but then not having vision, I assume there was something complicated, I'll dig around if I remember later

jacek2024

Dec 9, 2025

this model is supported text only now
details https://github.com/ggml-org/llama.cpp/pull/14823
pull request for vision https://github.com/ggml-org/llama.cpp/pull/16600

dolphinfan

Dec 9, 2025

Yeah I was surprised to see it convert at all but then not having vision, I assume there was something complicated, I'll dig around if I remember later

Hey thanks for trying. Hopefully they fulfill the pull request for vision.

dolphinfan

Dec 16, 2025

Thanks for jumping on this so fast! I take it that you'll be doing the 4.6V large model, too? You're the best!

dolphinfan

Dec 16, 2025

Also, anyone else having trouble getting this to load with mmproj files? I can load the bare model without the mmproj files in TextGenWebUI/Oobabooga. But I've tried the bartowski BF16 mmproj as well as the ggml Q8 mmproj, and neither would allow the model to load.

???

knarp

Dec 17, 2025

Also, anyone else having trouble getting this to load with mmproj files? I can load the bare model without the mmproj files in TextGenWebUI/Oobabooga. But I've tried the bartowski BF16 mmproj as well as the ggml Q8 mmproj, and neither would allow the model to load.

???

Your LLM client needs at least this backend build: llama.cpp b7429 https://github.com/ggml-org/llama.cpp/releases/tag/b7429
I don't know if you can manually update llama.cpp in Oobabooga; otherwise, you'll need to wait until it's updated.

dolphinfan

Dec 17, 2025

•

edited Dec 17, 2025

Also, anyone else having trouble getting this to load with mmproj files? I can load the bare model without the mmproj files in TextGenWebUI/Oobabooga. But I've tried the bartowski BF16 mmproj as well as the ggml Q8 mmproj, and neither would allow the model to load.

???

Your LLM client needs at least this backend build: llama.cpp b7429 https://github.com/ggml-org/llama.cpp/releases/tag/b7429
I don't know if you can manually update llama.cpp in Oobabooga; otherwise, you'll need to wait until it's updated.

Thanks! The latest Oobabooga was updated two days ago. So it didn't include these most recent changes.

So I just went to the link you posted... and then I downloaded 'llama-b7429-bin-win-cuda-12.4-x64.zip' and 'cudart-llama-bin-win-cuda-12.4-x64.zip' ... then I extracted their contents and copied all files to the Oobabooga llama.cpp bin folder which manually updates Oobabooga. And yes, the model instantly loaded with the mmproj file and vision capability 😎

Thanks for your help!

bartowski

Owner Dec 17, 2025

if anyone was using the bf16 mmproj there was apparently a bug with it, resolved on master here:

https://github.com/ggml-org/llama.cpp/pull/18124

pushed the new mmproj files here and to GLM-4.6V

dolphinfan

Dec 18, 2025

Yeah I learned that the hard way. The BF16 mmproj wouldn't load for me. But I already had their Q8 mmproj, and I also grabbed your F16 mmproj just to see if there was a difference. I think yours was a little "smarter," but there's was a little more human-like. Idk, maybe it's just me tho? Anyway, it works! Thanks again for being so on top of this. It's a great model.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment