reply on: "The World Is In Peril": Anthropic's Safety Boss Quits \ stacker news

pull down to refresh

224 sats \ 4 replies \ @optimism 7h \ on: "The World Is In Peril": Anthropic's Safety Boss Quits AI

I fear that the report about the xAI pressure cooker (#1438462) and a bunch of months ago, the report about the gpt-codex pressure cooker (#1040555) illustrate very much what every frontier model company does right now.

They have weeks to ship because of the competition. Remember, grok-4.1 was king of the world for exactly 1 day, then Google launched Gemini-3, and the spotlight was over for Grok. It's all going fast. Too fast, I think, and not because I'm worried about sAfEtY, but because I'm worried about what this does for quality.

Delivering broken products may have been the norm in Silicon Valley for a while now, but that's also why there was space for multiple 100x. If you onboard your entire future customer base on version 0.1, that's not going to happen. Because I believe that we should read GPT 5.3 as version 0.5.3, Grok 4.1 as 0.4.1, and Gemini 3 as 0.3 - these aren't mature products in any way, they're not even "alpha" worthy - it's all experimental and they regress across minor versions like crazy.

AI companies need to chill a little, but consumers, Wall Street, and all they hype-pushers need to chill a little too. Maybe #1440600 is a good perspective to at least keep in mind. To illustrate:

Everyone catches up, everyone shifts resources, everyone ships the update. The net competitive position of every company is exactly where it was before, except they all [spent] engineering time on esoteria and none of them made their end-product meaningfully better for end-users.

115 sats \ 3 replies \ @SimpleStacker 7h

It's another symptom of the overly financialized fiat economy: #1440612

I'm not even convinced they actually have good metrics for what constitutes a model improvement. But they are forced by this prisoner's dilemma to just target whatever metric the capital allocators use.

15 sats \ 2 replies \ @optimism 7h

They don't have good metrics for a model improvement because they just train towards improving on the benchmarks - it's what you yourself often bring up: Goodhart's Law.

That's why GPT-5.2 is sublime in the benchmarks but sucks in human ELO ranking.. I was looking out for it to drop on arena for days after launch only to realize it was ranked #25 and I just didn't scroll down far enough. Same happened to Llama 4, which was (allegedly) even doctored for the bench - Meta is the VW of software now.

So yes, it's all about the money, not so much about delivering world class autocorrect.

215 sats \ 1 reply \ @SimpleStacker 7h

I'm sure better metrics exist. They should have enough users to A/B test user satisfaction and feedback, for example. But the metrics don't matter if that's not what the investors are looking at. Which brings me to the whole financialization thing: the real audience is the investor, not the user.

82 sats \ 0 replies \ @optimism 6h

I'm sure better metrics exist.

For us, yes, things like arena.ai (I actively use that for 1-shot) allow you to not only see the leaderboard, but to judge for yourself, even though its ELO biases towards normie prompts. I wish they had prompter tiers (or maybe they do and I'm in retard tier haha)

They should have enough users to A/B test user satisfaction

Too paranoid! The competition is on their platforms: #1064935

financialization, the real audience is the investor

Not contending that. Aren't we just seeing the same old playbook, except with 4 contenders still standing and at 10x speed? The only crack I've seen is Sequoia jumping on the Anthropic bandwagon (#1415041), which I think is the worst signal Sam Altman must have gotten in his entire life, because they never jumped before.