AskSN

"Benchwashing" - how do you defend against this?

optimism

Bonus points for `Goodhart's law`

***"When a measure becomes a target, it ceases to be a good measure"***

That's a great standardized principle for the practice. I didn't think of it when writing this. Thanks for the reminder.

[Livebench](https://stacker.news/items/1042897/r/optimism) is supposed to address the gamification and there is some work being done on making the benchmarks better, per for example https://stacker.news/items/1034728/r/optimism, but I fear that this won't help against all the cheating - having `gpt-oss-120b` at rank 3 is almost entirely thanks to it (allegedly) gaming `IFBench`, despite not even being top 5 in `LiveCodeBench`.

> The best defense is just not to trust anything at face value.

Don't trust, verify is expensive at this pace. I have spent close to half a million sats just testing new open-weights models that came out the past 10 days [^1] and basically discarding everything except Qwen3-coder.

So we have to trust something. Maybe we should do stacker-bench where we measure stacker vs llm cohen's kappa ([like here](https://stacker.news/items/1072714/r/optimism?commentId=1072795)) 

[^1]: I didn't even test Kimi a while back because it's too huge for me to even run on spot AWS (70-80% discount) within my budget.

Footnotes