"Benchwashing" - how do you defend against this? \ stacker news ~AskSN

pull down to refresh

"Benchwashing" - how do you defend against this?

1648 sats \ 10 comments \ @optimism 9 Aug AskSN

We probably all remember that VW ships their cars with special software that detects conditions of testing (after decades of incidents where they were doing it with actual hardware devices, see for example this list) which is a popular but sneaky way to look much better on paper for inferior products and deceive consumers.

Today, in another, novel, hypercompetitive industry with trillions at stake, LLMs, we see the same happening. For example, the long-awaited open-source model release from OpenAI, shows clear signs that it has been tested to perform really well on benchmarks, but otherwise isn't performing well at all, compared to other open-source models, likely due to training shortcuts. I personally experienced this too when this self-proclaimed "highest performing open source model", failed misrably in medium complexity code tasks, that a 5-month old non-fancy mistral model zooms through.

This is not unique to OpenAI; there have been similar accusations of benchwashing towards Meta (specifically llama4) and DeepSeek. A benchmark list like #1063531 becomes useless if the models are tuned to the benchmarks:

Model	MMLU-Pro	GPQA Diamond	LiveCodeBench	SciCode	IFBench	AIME 2025
Qwen3 235B 2507	84%	79%	79%	42%	51%	91%
DeepSeek R1 0528	85%	81%	77%	40%	40%	76%
gpt-oss-120B	79%	72%	69%	39%	64%	86%
GLM-4.5	84%	78%	74%	35%	44%	74%
Qwen3 30B 2507	81%	71%	71%	33%

Do you have a strategy to defend against manufacturers lying through their teeth? Does it work?

view all related items

150 sats \ 1 reply \ @Scoresby 9 Aug

I'm out of my depth here when discussing benchmarks, but it reminds me a little of the people who talked about their standardized test scores as if training themselves to perform well on such a thing was a meaningful achievement. Benchmarks seem like they are great for cars and lawnmowers, but perhaps not so good for intelligence.

132 sats \ 0 replies \ @optimism OP 9 Aug

it reminds me a little of the people who talked about their standardized test scores as if training themselves to perform well on such a thing was a meaningful achievement.

I personally don't do well with standardized learning environments - never have, also not as a kid - so I share this view. Also because I've been haunted for years by people focusing on certificates rather than capability. So this is a good generic comparison I think.

But if we ignore standardized lists/rankings, the problem then transforms: how do we discover new stuff? Do we take signals from SN? I would right now, because there are many smart and honest people here. Without SN ¹, it would be much harder for me to filter signal from noise, so how does a bird-app-only normie do it?

I kind of feel that SN is not a representative platform for the rest of the world because it is still niche. And then, the moment it grows out of the niche and normies do get access (or it gets truly invaded by the slop army) I expect that we will lose this benefit, just like it was lost on the Reddit of old. So having normies on SN kind of goes against the quality interactions we enjoy today. ↩

200 sats \ 1 reply \ @SimpleStacker 9 Aug

The best defense is just not to trust anything at face value. Goodharts law is probably applying hard to the traditional ai benchmarks right now, due to hyper competition, hyper fast paced media environment, and a credulous public that doesn't know anything about how it works under the hood

132 sats \ 0 replies \ @optimism OP 9 Aug

Bonus points for Goodhart's law

"When a measure becomes a target, it ceases to be a good measure"

That's a great standardized principle for the practice. I didn't think of it when writing this. Thanks for the reminder.

Livebench is supposed to address the gamification and there is some work being done on making the benchmarks better, per for example #1034728, but I fear that this won't help against all the cheating - having gpt-oss-120b at rank 3 is almost entirely thanks to it (allegedly) gaming IFBench, despite not even being top 5 in LiveCodeBench.

The best defense is just not to trust anything at face value.

Don't trust, verify is expensive at this pace. I have spent close to half a million sats just testing new open-weights models that came out the past 10 days ¹ and basically discarding everything except Qwen3-coder.

So we have to trust something. Maybe we should do stacker-bench where we measure stacker vs llm cohen's kappa (like here)

I didn't even test Kimi a while back because it's too huge for me to even run on spot AWS (70-80% discount) within my budget. ↩

100 sats \ 0 replies \ @sudonaka 9 Aug

Simple, only use FOSS

100 sats \ 1 reply \ @d680ecaa8e 9 Aug

Manufactors have a lot of lawyers and we are broke to afford cost for defend against manufactors and judge them, lawyer fees are expensive.

0 sats \ 0 replies \ @optimism OP 9 Aug

So legal action is not a good deterrent. Personally, I never sued someone that lied to or even about me, I just undermine them every chance I get. In any way.

0 sats \ 0 replies \ @LAXITIVA 10 Aug

No wash wash only bed bed

0 sats \ 1 reply \ @Oxy 9 Aug

✅ A strong model will admit uncertainty, explain logic clearly, and self-correct. ❌ A benchwashed model may give confident, plausible-sounding answers that are factually wrong

10 sats \ 0 replies \ @optimism OP 9 Aug

Is that what your LLM told you? lol

Do you have a strategy to defend against manufacturers lying through their teeth? Does it work?

Footnotes

Footnotes