You're very much on to something here, and this is why I think it matters if this behavior is intentional or latent.
If they've taught it to recognize benchmarks specifically, that's benchmaxxing and is not going to help real world performance when your real tasks don't trigger the maxxxed paths. This is a genuine concern.
If they've taught it to "reach beyond the prompt" in the general sense, to understand the context and user intent behind the query, that's a genuinely useful capability and would explain why this model feels a little different.
Some stats: some version of this reasoning path happened in 39 out of 1070 test configurations, across 4 of my 12 tasks. In the most common occurrence, responsible for 30 of 39 hits, it recognized the task as being from BigBenchHard specifically and uses it's knowledge of the BBH category sets - which unfortunately suggests benchmaxxing.

