reply on: Stacker Saloon \ stacker news

pull down to refresh

299 sats \ 4 replies \ @optimism 24 Jan \ on: Stacker Saloon

I did an experiment yesterday: I let LLMs battle each other on LMArena, over a little python code snippet I desired that I didn't feel like firing Claude Code up for and I got a shitton of same ol' same ol'.

So, it's 2026 and:

Newer LLMs still hallucinate non-existing pypi packages, including gpt-5.2 and claude-4.5-haiku.
Both grok-4.1-thinking and claude-4.5-sonnet hallucinated "top rankings on MTEB" which, when I checked, turned out to be ranked outside of top 25.
claude-4.5-haiku doubled down on a non-functional script it wrote, insisting that my (clean!) venv was dirty and that i just needed to upgrade packages.
amazon-nova-experimental-chat-12-10 is definitely experimental, as it told me to change imports into non-existing module paths

and so on... where's all that improvement?

It was a fun experiment though.

101 sats \ 1 reply \ @BlokchainB 24 Jan

Smoke and mirrors?

54 sats \ 0 replies \ @optimism 24 Jan

Hmm no but I suspect that this means that the coding frameworks are working around fundamental issues, rather than the actual ultra-expensive models improving much.

For example, in #1415961, the author is very enthusiastic about ralph-wiggum. This is a plugin that executes your prompt until Claude gets it right. A.k.a. if you bet 100x on a 1:100 odds outcome, you have a real chance at getting it right.

All this bull crap the liars-in-chief have been spilling at Davos is all about simulating until you get it right, rather than actually getting it right.

102 sats \ 1 reply \ @ek 24 Jan

Newer LLMs still hallucinate non-existing pypi packages, including gpt-5.2 and claude-4.5-haiku.

Great for slopsquatting

64 sats \ 0 replies \ @optimism 24 Jan

Yeah. It's amazing that this is addressed in front-end frameworks only. Very fragile.