We probably all remember that VW ships their cars with special software that detects conditions of testing (after decades of incidents where they were doing it with actual hardware devices, see for example this list) which is a popular but sneaky way to look much better on paper for inferior products and deceive consumers.
Today, in another, novel, hypercompetitive industry with trillions at stake, LLMs, we see the same happening. For example, the long-awaited open-source model release from OpenAI, shows clear signs that it has been tested to perform really well on benchmarks, but otherwise isn't performing well at all, compared to other open-source models, likely due to training shortcuts. I personally experienced this too when this self-proclaimed "highest performing open source model", failed misrably in medium complexity code tasks, that a 5-month old non-fancy mistral model zooms through.
This is not unique to OpenAI; there have been similar accusations of
benchwashing
towards Meta (specifically llama4) and DeepSeek. A benchmark list like #1063531 becomes useless if the models are tuned to the benchmarks:Model | MMLU-Pro | GPQA Diamond | LiveCodeBench | SciCode | IFBench | AIME 2025 |
---|---|---|---|---|---|---|
Qwen3 235B 2507 | 84% | 79% | 79% | 42% | 51% | 91% |
DeepSeek R1 0528 | 85% | 81% | 77% | 40% | 40% | 76% |
gpt-oss-120B | 79% | 72% | 69% | 39% | 64% | 86% |
GLM-4.5 | 84% | 78% | 74% | 35% | 44% | 74% |
Qwen3 30B 2507 | 81% | 71% | 71% | 33% |
Goodhart's law
gpt-oss-120b
at rank 3 is almost entirely thanks to it (allegedly) gamingIFBench
, despite not even being top 5 inLiveCodeBench
.Footnotes
Footnotes