pull down to refresh
Smoke and mirrors?
reply
Hmm no but I suspect that this means that the coding frameworks are working around fundamental issues, rather than the actual ultra-expensive models improving much.
For example, in #1415961, the author is very enthusiastic about ralph-wiggum. This is a plugin that executes your prompt until Claude gets it right. A.k.a. if you bet 100x on a 1:100 odds outcome, you have a real chance at getting it right.
All this bull crap the liars-in-chief have been spilling at Davos is all about simulating until you get it right, rather than actually getting it right.
reply
I did an experiment yesterday: I let LLMs battle each other on LMArena, over a little python code snippet I desired that I didn't feel like firing Claude Code up for and I got a shitton of same ol' same ol'.
So, it's 2026 and:
gpt-5.2andclaude-4.5-haiku.grok-4.1-thinkingandclaude-4.5-sonnethallucinated "top rankings on MTEB" which, when I checked, turned out to be ranked outside of top 25.claude-4.5-haikudoubled down on a non-functional script it wrote, insisting that my (clean!) venv was dirty and that i just needed to upgrade packages.amazon-nova-experimental-chat-12-10is definitely experimental, as it told me to change imports into non-existing module pathsand so on... where's all that improvement?
It was a fun experiment though.