pull down to refresh

It's another symptom of the overly financialized fiat economy: #1440612

I'm not even convinced they actually have good metrics for what constitutes a model improvement. But they are forced by this prisoner's dilemma to just target whatever metric the capital allocators use.

15 sats \ 2 replies \ @optimism 7h

They don't have good metrics for a model improvement because they just train towards improving on the benchmarks - it's what you yourself often bring up: Goodhart's Law.

That's why GPT-5.2 is sublime in the benchmarks but sucks in human ELO ranking.. I was looking out for it to drop on arena for days after launch only to realize it was ranked #25 and I just didn't scroll down far enough. Same happened to Llama 4, which was (allegedly) even doctored for the bench - Meta is the VW of software now.

So yes, it's all about the money, not so much about delivering world class autocorrect.

reply

I'm sure better metrics exist. They should have enough users to A/B test user satisfaction and feedback, for example. But the metrics don't matter if that's not what the investors are looking at. Which brings me to the whole financialization thing: the real audience is the investor, not the user.

reply
82 sats \ 0 replies \ @optimism 6h
I'm sure better metrics exist.

For us, yes, things like arena.ai (I actively use that for 1-shot) allow you to not only see the leaderboard, but to judge for yourself, even though its ELO biases towards normie prompts. I wish they had prompter tiers (or maybe they do and I'm in retard tier haha)

They should have enough users to A/B test user satisfaction

Too paranoid! The competition is on their platforms: #1064935

financialization, the real audience is the investor

Not contending that. Aren't we just seeing the same old playbook, except with 4 contenders still standing and at 10x speed? The only crack I've seen is Sequoia jumping on the Anthropic bandwagon (#1415041), which I think is the worst signal Sam Altman must have gotten in his entire life, because they never jumped before.

reply