smol-IQ2_KL bench, GPUs

#5
by curiouspp8 - opened

First impressions - great quant. As a smoke test, gave it annoying bug to fix, iterations quality and amount was very sane, fixed in 1 shot. Warranting more testing against GLM 5.1.

Full vram offloading -

| Prefilled | PP@4096 | TG@512 |
| --------- | ------- | ------ |
|         0 |  1845.3 |  44.42 |
|        4K |  1654.5 |  41.81 |
|       16K |  1466.2 |  38.68 |
|       32K |  1183.7 |  34.76 |
|       64K |   866.6 |  28.14 |
|    TTFR 0 |    2266 |      - |
|   TTFR 4K |    5006 |      - |
|  TTFR 16K |   14211 |      - |
|  TTFR 32K |   31714 |      - |
|  TTFR 64K |   81801 |      - |


## TG Peak (burst throughput)

48.00 45.00 42.00 37.00 31.00

Yes, I've been keeping Kimi-K2.6 loaded up now instead of GLM-5.1 for the "heavy lifter".

Though I def wanna check out Qwen3.6 for the "small fast one" so to speak hah..

I know you really like the smaller ~100b qwen moe, but once I experienced the minimax and higher, just couldn't make any of the qwen work for my me, no matter the size. And of course minimax feels stupid after GLM/Kimi. But it's very fast stupid :) Kimi 2.6 looks very promising.. Downloading higher quants now to see if any difference for real work. Will be able to run it with proper TP in a week or so. I heard an opinion that the native INT4 models lose much more after quantization than when traditional b16 to q8/q4. Curious what's your opinion. Based on the tests you do, what quant of INT4 seems to be the sweet spot?

And of course minimax feels stupid after GLM/Kimi. But it's very fast stupid :)

haha, yeah I feel that way about MiniMax-M2.7 even at large quantization size, but to be nice it is only A10B so indeed very fast.

I heard an opinion that the native INT4 models lose much more after quantization than when traditional b16 to q8/q4. Curious what's your opinion.

I've discussed it some already on various posts e.g.

Basically the original model is released using llm-tensor style int4 for the routed experts and bf16 for the rest: https://huggingface.co/moonshotai/Kimi-K2-Thinking/blob/main/config.json#L79-L111

They don't release a "full bf16" only this pre-quantized QAT version with symmetric int4 given discussions previously with them and jukofyork.

So juk created a patch, which is now built into ik_llama.cpp which is only needed by the quantizers. The users don't need the patch, here is some recent info on that as well as the link in the model card.

I believe that my Q4_X and @AesSedai 'sQ4_X which are basically identical are the best way to go for "local" inference. While yes you could leave those extra 10GB in bf16, keep in mind for an A32B model that is a lot more active weights dragging your TG speed way down on every token.

Based on the tests you do, what quant of INT4 seems to be the sweet spot?

If you can fit the Q4_X you don't need anything bigger imo unless you really want to wait a long time for TG.

  • If you can't fit Q4_X, get the next largest quant you can fit e.g. my IQ3_K which preserves int4 for the ffn_down_exps and only quantizes as little as possible as can be seen in the "secret recipe".

Thank you for details reply. With fp16 it feels like around Q5-Q8 is basically identical to original and we are saving insane amount of resources running them. This is the context for INT4 question. Eg Q4_X seems like no savings and IQ3_K offers about ~15%, while with traditional models we get ~50-70%. Just reflecting out loud.

And of course minimax feels stupid after GLM/Kimi. But it's very fast stupid :)

haha, yeah I feel that way about MiniMax-M2.7 even at large quantization size, but to be nice it is only A10B so indeed very fast.

Interestingly, with full GPU offload via ik_llama kimi k2.6 actually is "almost" a fast model. I'd imagine it would totally crush it with TP. ~30-40tps with 1k pp is a very usable setup.

Sign up or log in to comment