pull down to refresh

We probably all remember that VW ships their cars with special software that detects conditions of testing (after decades of incidents where they were doing it with actual hardware devices, see for example this list) which is a popular but sneaky way to look much better on paper for inferior products and deceive consumers.
Today, in another, novel, hypercompetitive industry with trillions at stake, LLMs, we see the same happening. For example, the long-awaited open-source model release from OpenAI, shows clear signs that it has been tested to perform really well on benchmarks, but otherwise isn't performing well at all, compared to other open-source models, likely due to training shortcuts. I personally experienced this too when this self-proclaimed "highest performing open source model", failed misrably in medium complexity code tasks, that a 5-month old non-fancy mistral model zooms through.
This is not unique to OpenAI; there have been similar accusations of benchwashing towards Meta (specifically llama4) and DeepSeek. A benchmark list like #1063531 becomes useless if the models are tuned to the benchmarks:
ModelMMLU-ProGPQA DiamondLiveCodeBenchSciCodeIFBenchAIME 2025
Qwen3 235B 250784%79%79%42%51%91%
DeepSeek R1 052885%81%77%40%40%76%
gpt-oss-120B79%72%69%39%64%86%
GLM-4.584%78%74%35%44%74%
Qwen3 30B 250781%71%71%33%

Do you have a strategy to defend against manufacturers lying through their teeth? Does it work?

100 sats \ 0 replies \ @sudonaka 9h
Simple, only use FOSS
reply
The best defense is just not to trust anything at face value. Goodharts law is probably applying hard to the traditional ai benchmarks right now, due to hyper competition, hyper fast paced media environment, and a credulous public that doesn't know anything about how it works under the hood
reply
Bonus points for Goodhart's law
"When a measure becomes a target, it ceases to be a good measure"
That's a great standardized principle for the practice. I didn't think of it when writing this. Thanks for the reminder.
Livebench is supposed to address the gamification and there is some work being done on making the benchmarks better, per for example #1034728, but I fear that this won't help against all the cheating - having gpt-oss-120b at rank 3 is almost entirely thanks to it (allegedly) gaming IFBench, despite not even being top 5 in LiveCodeBench.
The best defense is just not to trust anything at face value.
Don't trust, verify is expensive at this pace. I have spent close to half a million sats just testing new open-weights models that came out the past 10 days 1 and basically discarding everything except Qwen3-coder.
So we have to trust something. Maybe we should do stacker-bench where we measure stacker vs llm cohen's kappa (like here)

Footnotes

  1. I didn't even test Kimi a while back because it's too huge for me to even run on spot AWS (70-80% discount) within my budget.
reply
100 sats \ 1 reply \ @Scoresby 18h
I'm out of my depth here when discussing benchmarks, but it reminds me a little of the people who talked about their standardized test scores as if training themselves to perform well on such a thing was a meaningful achievement. Benchmarks seem like they are great for cars and lawnmowers, but perhaps not so good for intelligence.
reply
it reminds me a little of the people who talked about their standardized test scores as if training themselves to perform well on such a thing was a meaningful achievement.
I personally don't do well with standardized learning environments - never have, also not as a kid - so I share this view. Also because I've been haunted for years by people focusing on certificates rather than capability. So this is a good generic comparison I think.
But if we ignore standardized lists/rankings, the problem then transforms: how do we discover new stuff? Do we take signals from SN? I would right now, because there are many smart and honest people here. Without SN 1, it would be much harder for me to filter signal from noise, so how does a bird-app-only normie do it?

Footnotes

  1. I kind of feel that SN is not a representative platform for the rest of the world because it is still niche. And then, the moment it grows out of the niche and normies do get access (or it gets truly invaded by the slop army) I expect that we will lose this benefit, just like it was lost on the Reddit of old. So having normies on SN kind of goes against the quality interactions we enjoy today.
reply
100 sats \ 1 reply \ @d680ecaa8e 16h
Manufactors have a lot of lawyers and we are broke to afford cost for defend against manufactors and judge them, lawyer fees are expensive.
reply
So legal action is not a good deterrent. Personally, I never sued someone that lied to or even about me, I just undermine them every chance I get. In any way.
reply
0 sats \ 1 reply \ @Oxy 19h
✅ A strong model will admit uncertainty, explain logic clearly, and self-correct. ❌ A benchwashed model may give confident, plausible-sounding answers that are factually wrong
reply
Is that what your LLM told you? lol
reply
No wash wash only bed bed
reply