pull down to refresh
It's another symptom of the overly financialized fiat economy: #1440612
I'm not even convinced they actually have good metrics for what constitutes a model improvement. But they are forced by this prisoner's dilemma to just target whatever metric the capital allocators use.
They don't have good metrics for a model improvement because they just train towards improving on the benchmarks - it's what you yourself often bring up: Goodhart's Law.
That's why GPT-5.2 is sublime in the benchmarks but sucks in human ELO ranking.. I was looking out for it to drop on arena for days after launch only to realize it was ranked #25 and I just didn't scroll down far enough. Same happened to Llama 4, which was (allegedly) even doctored for the bench - Meta is the VW of software now.
So yes, it's all about the money, not so much about delivering world class autocorrect.
I'm sure better metrics exist. They should have enough users to A/B test user satisfaction and feedback, for example. But the metrics don't matter if that's not what the investors are looking at. Which brings me to the whole financialization thing: the real audience is the investor, not the user.
I'm sure better metrics exist.
For us, yes, things like arena.ai (I actively use that for 1-shot) allow you to not only see the leaderboard, but to judge for yourself, even though its ELO biases towards normie prompts. I wish they had prompter tiers (or maybe they do and I'm in retard tier haha)
They should have enough users to A/B test user satisfaction
Too paranoid! The competition is on their platforms: #1064935
financialization, the real audience is the investor
Not contending that. Aren't we just seeing the same old playbook, except with 4 contenders still standing and at 10x speed? The only crack I've seen is Sequoia jumping on the Anthropic bandwagon (#1415041), which I think is the worst signal Sam Altman must have gotten in his entire life, because they never jumped before.
I fear that the report about the xAI pressure cooker (#1438462) and a bunch of months ago, the report about the gpt-codex pressure cooker (#1040555) illustrate very much what every frontier model company does right now.
They have weeks to ship because of the competition. Remember, grok-4.1 was king of the world for exactly 1 day, then Google launched Gemini-3, and the spotlight was over for Grok. It's all going fast. Too fast, I think, and not because I'm worried about
sAfEtY, but because I'm worried about what this does for quality.Delivering broken products may have been the norm in Silicon Valley for a while now, but that's also why there was space for multiple 100x. If you onboard your entire future customer base on version 0.1, that's not going to happen. Because I believe that we should read GPT 5.3 as version
0.5.3, Grok 4.1 as0.4.1, and Gemini 3 as0.3- these aren't mature products in any way, they're not even "alpha" worthy - it's all experimental and they regress across minor versions like crazy.AI companies need to chill a little, but consumers, Wall Street, and all they hype-pushers need to chill a little too. Maybe #1440600 is a good perspective to at least keep in mind. To illustrate: