Open to Collab

Mike Ravkine PRO

mike-ravkine

the-crypt-keeper

AI & ML interests

LLM Research / Development / Evaluation

Recent Activity

liked a model 1 day ago

LGAI-EXAONE/EXAONE-4.5-33B

liked a model 3 days ago

nvidia/NVIDIA-Nemotron-3-Nano-4B-BF16

liked a model 3 days ago

nvidia/Nemotron-Cascade-2-30B-A3B

View all activity

Organizations

repliedto their post 4 days ago

You're very much on to something here, and this is why I think it matters if this behavior is intentional or latent.

If they've taught it to recognize benchmarks specifically, that's benchmaxxing and is not going to help real world performance when your real tasks don't trigger the maxxxed paths. This is a genuine concern.

If they've taught it to "reach beyond the prompt" in the general sense, to understand the context and user intent behind the query, that's a genuinely useful capability and would explain why this model feels a little different.

Some stats: some version of this reasoning path happened in 39 out of 1070 test configurations, across 4 of my 12 tasks. In the most common occurrence, responsible for 30 of 39 hits, it recognized the task as being from BigBenchHard specifically and uses it's knowledge of the BBH category sets - which unfortunately suggests benchmaxxing.

posted an update 6 days ago

Post

1342

Gemma-4, specifically google/gemma-4-26B-A4B-it is doing something inside it's reasoning traces I have never seen before: it's recognizing that its being evaluated and spends meta-thinking tokens on understanding the evaluation regime in which it believes it find itself.

Let's see if 12/10/2023 is a more likely answer than 12/09/2023

In most AI benchmark tests (like those this prompt resembles), the simplest path is often the intended one.

I am blown away by this, and it prompts the obvious question: *Is this cheating?*

I am leaning towards no.

Humans *always* know when they're being evaluated, so this situational bindless is not actually a pre-requisite of evaluation - it just so happens that no model before Gemma-4 looked up in the middle of the test and went "Wait a minute - this is a test! I should try align my answer with the test format's expectations."

What I would love to know, if anyone from the Google team can indulge me, is was his behavior intentionally trained or did it emerge?

3 replies

posted an update about 1 month ago

Post

289

gpt-oss-120b has held on to the ReasonScape crown since it's release on Aug 5, 2025 - 7 months in the LLM space is *impressive*.

With the release of Qwen-3.5 the king has been dethroned by not one but 2 models the mid-dense Qwen/Qwen3.5-27B and the large-MoE Qwen/Qwen3.5-122B-A10B-FP8.

The old king is dead - long live the new king 👑

Note that these rankings are based on r12 - a 27k prompts, 12 task domain 3rd iteration of the ReasonScape evaluation. Compared to the previous m12x ranking this evaluation fixes a slew of test bugs, refines the task set to add table-extraction, and lifts the context ceiling to 16k - so these rankings are quite a bit different vs the previous m12x Leaderboard (which has an 8k context limit).

posted an update 3 months ago

Post

1484

In case you missed it, I had some fun playing with allura-forge/Llama-3.3-8B-Instruct over the holidays - its not every day that we get a new "old llama".

https://huggingface.co/blog/mike-ravkine/new-old-llamas

posted an update 3 months ago

Post

232

My hat is off to the

upstage team 🎩

upstage/Solar-Open-100B is a very interesting, permissively licensed (Apache-with-attribution), trained from scratch (19T tokens), 12B active MoE - but that's not even the cool part.

The cool part is that their fork of vLLM comes with the addition of a reasoning_effort parameter and a corresponding reasoning/tool-calling controller FSM to consume it!

https://github.com/UpstageAI/vllm/blob/c9a05e077cd82df8cab4f729396c178c29c81aa8/vllm/model_executor/models/solar_open_logits_processor.py

Looks like only "medium" and "high" are actually implemented, but still absolutely love to see this sorta thing.

To make this model a little more accessible, I have created a FP8-Dynamic quant at mike-ravkine/Solar-Open-100B-FP8-Dynamic which makes it fit nicely into 2xPro-6000 or 4xA6000 GPUs.

My ReasonScape evaluations are currently running, will take me a couple days for this one but early results are quite strong: it's showing the competency expected from a 100B reasoning model (it can count the r's in strawberry, it can do basic arithmetic, etc..) and I haven't seen a truncation yet.

repliedto mitkox's post 3 months ago

🔥 Got any pics of this rig? Would love to see how it's managing thermals.

reactedto mitkox's post with 🚀 3 months ago

Post

3346

I just stress-tested the Beast: MiniMax-M2.1 on Z8 Fury G5.
2101 tokens/sec. FORTY concurrent clients. That's 609 t/s out, 1492 t/s in. The model outputs fire faster than I can type, but feeds on data like a black hole on cheat day.
But wait, there's more! Threw it into Claude Code torture testing with 60+ tools, 8 agents (7 sub-agents because apparently one wasn't enough chaos). It didn't even flinch. Extremely fast, scary good at coding. The kind of performance that makes you wonder if the model's been secretly reading Stack Overflow in its spare time lol
3 months ago, these numbers lived in my "maybe in “2030 dreams. Today it's running on my desk AND heaths my home office during the winter!

3 replies

posted an update 3 months ago

Post

3048

Happy 2026 everyone!

I've been busy working on some new ranking/position methodologies and excited to start sharing some results.

Plot legends:

- X = truncation rate (low = good)
- ? = confusion rate (low = good)
- blue bars = average completion tokens (low = good)
- black diamonds = CI-banded performance (high = good)
- cluster squares = models inside this group are equivalent

openai/gpt-oss-120b remains the king in all dimensions of interest: truncation rates, completion lengths and performance. If I had but one complaint it's the reason_effort does not seem to actually work - more on this soon.

Second is a 3-way tie in performance between the Qwen3-235B-2507 we all know and love with an unexpected entrant - ByteDance-Seed/Seed-OSS-36B-Instruct

This is a very capable model and it's reasoning effort controls actually works, but you should absolutely not leave it on the default "unlimited" - enable a sensible limit (4k works well for 8k context length).

Third place is another 3-way tie, this one between Seed-OSS-36B (it straddles the CI boundary between 2nd and 3rd place), Qwen/Qwen3-Next-80B-A3B-Instruct (demonstrating that full attention may be overrated after all and gated is the way to go) and the newly released zai-org/GLM-4.7 which offers excellent across the board performance with some of the shortest reasoning traces I've seen so far.

1 reply

posted an update 5 months ago

Post

1106

Applying Hazard and Entropy Analysis to LLMs

Here's an example of a model that behaves perfectly well up to 8k, smoothly increasing its entropy before going into a struggle zone, collapsing, seeing a region of recovery and finally falling down hard at the 16k wall.

Is your model implementation behaving badly like this?

Would you know if it was? 👀

posted an update 5 months ago

Post

291

sunday = lab day

goal: understand how GGUF compression works - what exactly is being lost?

approach: quantize/dequantize some images and look at error maps

spent 80% of the time tracing down what turns out to be a data distribution assumption: real LLM weights are symmetric and their mean is 0 so our test image MUST retain these properties or the results turn into a kind of nonsense soup where Q5_1 beats Q8

with that issue solved, we have some fun results! from left to right:

- test pattern image (mean value is around 0.01)

- q8 error (almost nothing - some light banding in the gradients)

- q5km error (starting to see the 'blocks' around the circles)

- q4_0 error (this is why q4_1 is 'preferred')

- q3k error. q3k is a really interesting set of trade-offs: it does not have a block-offset so it really leans into the 0-mean assumption HARD, if you violate it locally the results are BAD

- q2k error: q2k has a block-offset so for certain patterns the errors are actually less then q3k (a rather counter-intuitive result)

looking at mxfp4, i-quants and the other stuff that's possible inside gguf remains future work.. aiming to clean up this repo and push it this week, feel free to ping me if you want to play sooner.

reactedto mitkox's post with 🔥 5 months ago

Post

4218

I just threw Qwen3-0.6B in BF16 into an on device AI drag race on AMD Strix Halo with vLLM:

564 tokens/sec on short 100-token sprints
96 tokens/sec on 8K-token marathons

TL;DR You don't just run AI on AMD. You negotiate with it.

The hardware absolutely delivers. Spoiler alert; there is exactly ONE configuration where vLLM + ROCm + Triton + PyTorch + Drivers + Ubuntu Kernel to work at the same time. Finding it required the patience of a saint

Consumer AMD for AI inference is the ultimate "budget warrior" play, insane performance-per-euro, but you need hardcore technical skills that would make a senior sysadmin nod in quiet respect.

1 reply

reactedto danielhanchen's post with ❤️ 5 months ago

Post

4457

You can now run Kimi K2 Thinking locally with our Dynamic 1-bit GGUFs: unsloth/Kimi-K2-Thinking-GGUF

We shrank the 1T model to 245GB (-62%) & retained ~85% of accuracy on Aider Polyglot. Run on >247GB RAM for fast inference.

We also collaborated with the Moonshot AI Kimi team on a system prompt fix! 🥰

Guide + fix details: https://docs.unsloth.ai/models/kimi-k2-thinking-how-to-run-locally

repliedto their post 5 months ago

The way the coolers on these cards are designed is VERY unusual - this photo should come in the box! If you're having trouble in a closed case, see if you have space to add an intake on the bottom beside the PCIe edge - if all air is coming from the front like in this pic, the rear blower steals it all!

repliedto their post 5 months ago

I am running 3x140mm 89CFM as intakes on the front and directly underneath the in-blowers of those two 3090FE is the main trick - a dedicated 140mm 140CFM blowing upwards to feed the intake blowers at the front/pcie edge of these coolers. The cross-blower air passes through both cards and then has space to vent their heat on the right.

At 280W load the 'outside' card is bored and sitting at 50C with the blower fans at 50% while the 'inside' card maintains 60-65C with it's blowers closer to 70-80%.

The main thing to be careful about with the FE dual-blower coolers is do not try to pump air into the 'front' (pcie side) of them! I see this configuration on many mining rigs and its fine for air cooled cards but the FEs actually have a blower venting out this pci-slot side so you HAVE to feed them from either the rear or underneath.

When I had the 4-slot bridge installed across them, the rear-feed alone was sufficient.

posted an update 5 months ago

Post

1449

measuring the information content of a reasoning trace seems like a straightforward reasoning LLM KPI, but how can we achieve this?

what if we keep it simple: gzip the resulting text and take the length of the compressed stream... "compressed bytes of information per output token" becomes the KPI

if we split across correct answers vs incorrect answers vs truncated answers and group by difficulty, a whole new world of analysis becomes not just possible but visually intuitive and almost trivial:

1) what is the model's overall reasoning efficiency? this is the slope of the scatterplot curve segments (there may be more then one..)

2) is the model able to apply more test-time compute towards more difficult variations of the task? the two on the left are not, the two on the right are.

3) when applying more test-time compute, is that compute useful? this is the curvature of the scatterplot trends - the two in the middle are 'losing their mojo' as answers get longer the information content falls down

4) is the model applying multiple approaches to the task? (right) do those approaches change with difficulty?

5) are truncations because we don't have enough context budget (left) or because the model has lost its mind and gone into a repeat loop (middle two) and does this happen across the board (middle left) or only when the problem is more difficult (middle right)

would love to hear your guys feedback on this kind of analysis, is anyone doing similar work?

this approach generates 12 plots per model (one for each task) so quite a bit of data and i've been hesitant to publish it so far, consider this post a toe tip.

repliedto their post 5 months ago

The top high density rack is successfully hitting 65C targets at 1.15kW TDP.

Really impressive coolers on the FEs, there is barely 10mm of space but the sandwiched one still keeps up to unobstructed air coolers.

We are back, baby 🏇

posted an update 5 months ago

Post

2777

There is no anxiety quite like powering up 2KW of basement compute after rewiring it all. Small bit of trouble with the horizontal 3090 because I misread my motherboard manual, but otherwise so far so good.. Next we see if I've built up enough cooling to hit my target TDP on those 3-slot nvlinked cards especially. The 4-slot bridges are much easier to work with but their prices went bananas and I couldn't acquire a second, so gotta get a little creative with intakes.

5 replies

posted an update 5 months ago

Post

191

How many 3090 is too many 3090? Trick question - never enough 🤖.

Working on rewiring for a 5th this weekend so need the quad in 4U to make this happen. Definitely pushing air cooling to its limit but NVLink requires aggressive spacing between the cards.

Fortunately up here in Canada our power is quite cheap, so stacking compute this way works. While being annoying in terms of physical constraint the NVlink bridges really helps squeeze these cards - on smaller models at big batch the difference at -tp 2 can be as much as 50%. It's not bandwidth, it's latency!