Instructions to use ubergarm/Kimi-K2.6-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use ubergarm/Kimi-K2.6-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="ubergarm/Kimi-K2.6-GGUF", filename="IQ3_K/Kimi-K2.6-IQ3_K-00001-of-00012.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use ubergarm/Kimi-K2.6-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf ubergarm/Kimi-K2.6-GGUF:Q2_K # Run inference directly in the terminal: llama-cli -hf ubergarm/Kimi-K2.6-GGUF:Q2_K
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf ubergarm/Kimi-K2.6-GGUF:Q2_K # Run inference directly in the terminal: llama-cli -hf ubergarm/Kimi-K2.6-GGUF:Q2_K
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf ubergarm/Kimi-K2.6-GGUF:Q2_K # Run inference directly in the terminal: ./llama-cli -hf ubergarm/Kimi-K2.6-GGUF:Q2_K
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf ubergarm/Kimi-K2.6-GGUF:Q2_K # Run inference directly in the terminal: ./build/bin/llama-cli -hf ubergarm/Kimi-K2.6-GGUF:Q2_K
Use Docker
docker model run hf.co/ubergarm/Kimi-K2.6-GGUF:Q2_K
- LM Studio
- Jan
- vLLM
How to use ubergarm/Kimi-K2.6-GGUF with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "ubergarm/Kimi-K2.6-GGUF" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ubergarm/Kimi-K2.6-GGUF", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/ubergarm/Kimi-K2.6-GGUF:Q2_K
- Ollama
How to use ubergarm/Kimi-K2.6-GGUF with Ollama:
ollama run hf.co/ubergarm/Kimi-K2.6-GGUF:Q2_K
- Unsloth Studio
How to use ubergarm/Kimi-K2.6-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for ubergarm/Kimi-K2.6-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for ubergarm/Kimi-K2.6-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for ubergarm/Kimi-K2.6-GGUF to start chatting
- Pi
How to use ubergarm/Kimi-K2.6-GGUF with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf ubergarm/Kimi-K2.6-GGUF:Q2_K
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "ubergarm/Kimi-K2.6-GGUF:Q2_K" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use ubergarm/Kimi-K2.6-GGUF with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf ubergarm/Kimi-K2.6-GGUF:Q2_K
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default ubergarm/Kimi-K2.6-GGUF:Q2_K
Run Hermes
hermes
- Docker Model Runner
How to use ubergarm/Kimi-K2.6-GGUF with Docker Model Runner:
docker model run hf.co/ubergarm/Kimi-K2.6-GGUF:Q2_K
- Lemonade
How to use ubergarm/Kimi-K2.6-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull ubergarm/Kimi-K2.6-GGUF:Q2_K
Run and chat with the model
lemonade run user.Kimi-K2.6-GGUF-Q2_K
List all available models
lemonade list
smol-IQ2_KL bench, GPUs
First impressions - great quant. As a smoke test, gave it annoying bug to fix, iterations quality and amount was very sane, fixed in 1 shot. Warranting more testing against GLM 5.1.
Full vram offloading -
| Prefilled | PP@4096 | TG@512 |
| --------- | ------- | ------ |
| 0 | 1845.3 | 44.42 |
| 4K | 1654.5 | 41.81 |
| 16K | 1466.2 | 38.68 |
| 32K | 1183.7 | 34.76 |
| 64K | 866.6 | 28.14 |
| TTFR 0 | 2266 | - |
| TTFR 4K | 5006 | - |
| TTFR 16K | 14211 | - |
| TTFR 32K | 31714 | - |
| TTFR 64K | 81801 | - |
## TG Peak (burst throughput)
48.00 45.00 42.00 37.00 31.00
Yes, I've been keeping Kimi-K2.6 loaded up now instead of GLM-5.1 for the "heavy lifter".
Though I def wanna check out Qwen3.6 for the "small fast one" so to speak hah..
I know you really like the smaller ~100b qwen moe, but once I experienced the minimax and higher, just couldn't make any of the qwen work for my me, no matter the size. And of course minimax feels stupid after GLM/Kimi. But it's very fast stupid :) Kimi 2.6 looks very promising.. Downloading higher quants now to see if any difference for real work. Will be able to run it with proper TP in a week or so. I heard an opinion that the native INT4 models lose much more after quantization than when traditional b16 to q8/q4. Curious what's your opinion. Based on the tests you do, what quant of INT4 seems to be the sweet spot?
And of course minimax feels stupid after GLM/Kimi. But it's very fast stupid :)
haha, yeah I feel that way about MiniMax-M2.7 even at large quantization size, but to be nice it is only A10B so indeed very fast.
I heard an opinion that the native INT4 models lose much more after quantization than when traditional b16 to q8/q4. Curious what's your opinion.
I've discussed it some already on various posts e.g.
- https://huggingface.co/ubergarm/Kimi-K2.6-GGUF/discussions/6#69e8f97efe45c6c6aa847275
- https://huggingface.co/unsloth/Kimi-K2.6-GGUF/discussions/2#69e8f778f65ab0602fd41816
Basically the original model is released using llm-tensor style int4 for the routed experts and bf16 for the rest: https://huggingface.co/moonshotai/Kimi-K2-Thinking/blob/main/config.json#L79-L111
They don't release a "full bf16" only this pre-quantized QAT version with symmetric int4 given discussions previously with them and jukofyork.
So juk created a patch, which is now built into ik_llama.cpp which is only needed by the quantizers. The users don't need the patch, here is some recent info on that as well as the link in the model card.
I believe that my Q4_X and @AesSedai 'sQ4_X which are basically identical are the best way to go for "local" inference. While yes you could leave those extra 10GB in bf16, keep in mind for an A32B model that is a lot more active weights dragging your TG speed way down on every token.
Based on the tests you do, what quant of INT4 seems to be the sweet spot?
If you can fit the Q4_X you don't need anything bigger imo unless you really want to wait a long time for TG.
- If you can't fit Q4_X, get the next largest quant you can fit e.g. my
IQ3_Kwhich preservesint4for theffn_down_expsand only quantizes as little as possible as can be seen in the "secret recipe".
Thank you for details reply. With fp16 it feels like around Q5-Q8 is basically identical to original and we are saving insane amount of resources running them. This is the context for INT4 question. Eg Q4_X seems like no savings and IQ3_K offers about ~15%, while with traditional models we get ~50-70%. Just reflecting out loud.
And of course minimax feels stupid after GLM/Kimi. But it's very fast stupid :)
haha, yeah I feel that way about MiniMax-M2.7 even at large quantization size, but to be nice it is only A10B so indeed very fast.
Interestingly, with full GPU offload via ik_llama kimi k2.6 actually is "almost" a fast model. I'd imagine it would totally crush it with TP. ~30-40tps with 1k pp is a very usable setup.