"When a measure becomes a target, it ceases to be a good measure"
That's a great standardized principle for the practice. I didn't think of it when writing this. Thanks for the reminder.
Livebench is supposed to address the gamification and there is some work being done on making the benchmarks better, per for example #1034728, but I fear that this won't help against all the cheating - having gpt-oss-120b at rank 3 is almost entirely thanks to it (allegedly) gaming IFBench, despite not even being top 5 in LiveCodeBench.
The best defense is just not to trust anything at face value.
Don't trust, verify is expensive at this pace. I have spent close to half a million sats just testing new open-weights models that came out the past 10 days 1 and basically discarding everything except Qwen3-coder.
So we have to trust something. Maybe we should do stacker-bench where we measure stacker vs llm cohen's kappa (like here)
Footnotes
I didn't even test Kimi a while back because it's too huge for me to even run on spot AWS (70-80% discount) within my budget. ↩
Goodhart's law
gpt-oss-120b
at rank 3 is almost entirely thanks to it (allegedly) gamingIFBench
, despite not even being top 5 inLiveCodeBench
.Footnotes