pull down to refresh

I did an experiment yesterday: I let LLMs battle each other on LMArena, over a little python code snippet I desired that I didn't feel like firing Claude Code up for and I got a shitton of same ol' same ol'.

So, it's 2026 and:

  1. Newer LLMs still hallucinate non-existing pypi packages, including gpt-5.2 and claude-4.5-haiku.
  2. Both grok-4.1-thinking and claude-4.5-sonnet hallucinated "top rankings on MTEB" which, when I checked, turned out to be ranked outside of top 25.
  3. claude-4.5-haiku doubled down on a non-functional script it wrote, insisting that my (clean!) venv was dirty and that i just needed to upgrade packages.
  4. amazon-nova-experimental-chat-12-10 is definitely experimental, as it told me to change imports into non-existing module paths

and so on... where's all that improvement?

It was a fun experiment though.

100 sats \ 1 reply \ @BlokchainB 2h

Smoke and mirrors?

reply
53 sats \ 0 replies \ @optimism 1h

Hmm no but I suspect that this means that the coding frameworks are working around fundamental issues, rather than the actual ultra-expensive models improving much.

For example, in #1415961, the author is very enthusiastic about ralph-wiggum. This is a plugin that executes your prompt until Claude gets it right. A.k.a. if you bet 100x on a 1:100 odds outcome, you have a real chance at getting it right.

All this bull crap the liars-in-chief have been spilling at Davos is all about simulating until you get it right, rather than actually getting it right.

reply
101 sats \ 1 reply \ @ek 5h
Newer LLMs still hallucinate non-existing pypi packages, including gpt-5.2 and claude-4.5-haiku.

Great for slopsquatting

reply
63 sats \ 0 replies \ @optimism 5h

Yeah. It's amazing that this is addressed in front-end frameworks only. Very fragile.

reply