pull down to refresh
I'm sure better metrics exist. They should have enough users to A/B test user satisfaction and feedback, for example. But the metrics don't matter if that's not what the investors are looking at. Which brings me to the whole financialization thing: the real audience is the investor, not the user.
I'm sure better metrics exist.
For us, yes, things like arena.ai (I actively use that for 1-shot) allow you to not only see the leaderboard, but to judge for yourself, even though its ELO biases towards normie prompts. I wish they had prompter tiers (or maybe they do and I'm in retard tier haha)
They should have enough users to A/B test user satisfaction
Too paranoid! The competition is on their platforms: #1064935
financialization, the real audience is the investor
Not contending that. Aren't we just seeing the same old playbook, except with 4 contenders still standing and at 10x speed? The only crack I've seen is Sequoia jumping on the Anthropic bandwagon (#1415041), which I think is the worst signal Sam Altman must have gotten in his entire life, because they never jumped before.
They don't have good metrics for a model improvement because they just train towards improving on the benchmarks - it's what you yourself often bring up: Goodhart's Law.
That's why GPT-5.2 is sublime in the benchmarks but sucks in human ELO ranking.. I was looking out for it to drop on arena for days after launch only to realize it was ranked #25 and I just didn't scroll down far enough. Same happened to Llama 4, which was (allegedly) even doctored for the bench - Meta is the VW of software now.
So yes, it's all about the money, not so much about delivering world class autocorrect.