Instructions to use ubergarm/Kimi-K2.6-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use ubergarm/Kimi-K2.6-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="ubergarm/Kimi-K2.6-GGUF",
	filename="IQ3_K/Kimi-K2.6-IQ3_K-00001-of-00012.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use ubergarm/Kimi-K2.6-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf ubergarm/Kimi-K2.6-GGUF:Q2_K
# Run inference directly in the terminal:
llama-cli -hf ubergarm/Kimi-K2.6-GGUF:Q2_K

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf ubergarm/Kimi-K2.6-GGUF:Q2_K
# Run inference directly in the terminal:
llama-cli -hf ubergarm/Kimi-K2.6-GGUF:Q2_K

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf ubergarm/Kimi-K2.6-GGUF:Q2_K
# Run inference directly in the terminal:
./llama-cli -hf ubergarm/Kimi-K2.6-GGUF:Q2_K

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf ubergarm/Kimi-K2.6-GGUF:Q2_K
# Run inference directly in the terminal:
./build/bin/llama-cli -hf ubergarm/Kimi-K2.6-GGUF:Q2_K

Use Docker

docker model run hf.co/ubergarm/Kimi-K2.6-GGUF:Q2_K

LM Studio
Jan

vLLM

How to use ubergarm/Kimi-K2.6-GGUF with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "ubergarm/Kimi-K2.6-GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ubergarm/Kimi-K2.6-GGUF",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/ubergarm/Kimi-K2.6-GGUF:Q2_K

Ollama
How to use ubergarm/Kimi-K2.6-GGUF with Ollama:
```
ollama run hf.co/ubergarm/Kimi-K2.6-GGUF:Q2_K
```

Unsloth Studio

How to use ubergarm/Kimi-K2.6-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for ubergarm/Kimi-K2.6-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for ubergarm/Kimi-K2.6-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for ubergarm/Kimi-K2.6-GGUF to start chatting

How to use ubergarm/Kimi-K2.6-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf ubergarm/Kimi-K2.6-GGUF:Q2_K

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "ubergarm/Kimi-K2.6-GGUF:Q2_K"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use ubergarm/Kimi-K2.6-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf ubergarm/Kimi-K2.6-GGUF:Q2_K

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default ubergarm/Kimi-K2.6-GGUF:Q2_K

Run Hermes

hermes

Docker Model Runner
How to use ubergarm/Kimi-K2.6-GGUF with Docker Model Runner:
```
docker model run hf.co/ubergarm/Kimi-K2.6-GGUF:Q2_K
```

Lemonade

How to use ubergarm/Kimi-K2.6-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull ubergarm/Kimi-K2.6-GGUF:Q2_K

Run and chat with the model

lemonade run user.Kimi-K2.6-GGUF-Q2_K

List all available models

lemonade list

smol-IQ2_KL bench, GPUs

by curiouspp8 - opened Apr 22

Discussion

curiouspp8

Apr 22

First impressions - great quant. As a smoke test, gave it annoying bug to fix, iterations quality and amount was very sane, fixed in 1 shot. Warranting more testing against GLM 5.1.

Full vram offloading -

| Prefilled | PP@4096 | TG@512 |
| --------- | ------- | ------ |
|         0 |  1845.3 |  44.42 |
|        4K |  1654.5 |  41.81 |
|       16K |  1466.2 |  38.68 |
|       32K |  1183.7 |  34.76 |
|       64K |   866.6 |  28.14 |
|    TTFR 0 |    2266 |      - |
|   TTFR 4K |    5006 |      - |
|  TTFR 16K |   14211 |      - |
|  TTFR 32K |   31714 |      - |
|  TTFR 64K |   81801 |      - |


## TG Peak (burst throughput)

48.00 45.00 42.00 37.00 31.00

ubergarm

Owner Apr 22

Yes, I've been keeping Kimi-K2.6 loaded up now instead of GLM-5.1 for the "heavy lifter".

Though I def wanna check out Qwen3.6 for the "small fast one" so to speak hah..

curiouspp8

Apr 22

I know you really like the smaller ~100b qwen moe, but once I experienced the minimax and higher, just couldn't make any of the qwen work for my me, no matter the size. And of course minimax feels stupid after GLM/Kimi. But it's very fast stupid :) Kimi 2.6 looks very promising.. Downloading higher quants now to see if any difference for real work. Will be able to run it with proper TP in a week or so. I heard an opinion that the native INT4 models lose much more after quantization than when traditional b16 to q8/q4. Curious what's your opinion. Based on the tests you do, what quant of INT4 seems to be the sweet spot?

ubergarm

Owner Apr 22

And of course minimax feels stupid after GLM/Kimi. But it's very fast stupid :)

haha, yeah I feel that way about MiniMax-M2.7 even at large quantization size, but to be nice it is only A10B so indeed very fast.

I heard an opinion that the native INT4 models lose much more after quantization than when traditional b16 to q8/q4. Curious what's your opinion.

I've discussed it some already on various posts e.g.

Basically the original model is released using llm-tensor style int4 for the routed experts and bf16 for the rest: https://huggingface.co/moonshotai/Kimi-K2-Thinking/blob/main/config.json#L79-L111

They don't release a "full bf16" only this pre-quantized QAT version with symmetric int4 given discussions previously with them and jukofyork.

So juk created a patch, which is now built into ik_llama.cpp which is only needed by the quantizers. The users don't need the patch, here is some recent info on that as well as the link in the model card.

https://github.com/ikawrakow/ik_llama.cpp/pull/1677

I believe that my Q4_X and @AesSedai 'sQ4_X which are basically identical are the best way to go for "local" inference. While yes you could leave those extra 10GB in bf16, keep in mind for an A32B model that is a lot more active weights dragging your TG speed way down on every token.

Based on the tests you do, what quant of INT4 seems to be the sweet spot?

If you can fit the Q4_X you don't need anything bigger imo unless you really want to wait a long time for TG.

If you can't fit Q4_X, get the next largest quant you can fit e.g. my IQ3_K which preserves int4 for the ffn_down_exps and only quantizes as little as possible as can be seen in the "secret recipe".

curiouspp8

Apr 22

Thank you for details reply. With fp16 it feels like around Q5-Q8 is basically identical to original and we are saving insane amount of resources running them. This is the context for INT4 question. Eg Q4_X seems like no savings and IQ3_K offers about ~15%, while with traditional models we get ~50-70%. Just reflecting out loud.

curiouspp8

Apr 22

And of course minimax feels stupid after GLM/Kimi. But it's very fast stupid :)

haha, yeah I feel that way about MiniMax-M2.7 even at large quantization size, but to be nice it is only A10B so indeed very fast.

Interestingly, with full GPU offload via ik_llama kimi k2.6 actually is "almost" a fast model. I'd imagine it would totally crush it with TP. ~30-40tps with 1k pp is a very usable setup.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment