People notice that while AI can now write programs, design websites, etc, [..]
It can only do this in the opinion of people that aren't capable of writing said programs or designing said websites themselves, and we see the newer models not making much headway into being much better.
gpt-5 regressed against gpt-4o in instruction hierarchy - meaning that important instructions are ignored more often - and this has, despite being promised to be mitigated, in almost 2 months not been mitigated according to their own, current, system card. Especially cognitive rulesets, which is what we use in coding instructions, but also in the detection of malicious user messages on chatbots, regressed according to OpenAI themselves from a 73% success factor to 62% in gpt-5. I've lightly observed similar outcomes in claude-4-sonnet vs claude-3.7-sonnet in a vibe coding setup.
So it's not a "breathing spell", imho. It's a failure to deliver an improved product. Products that had room to grow, like qwen-2.5 to qwen-3, have in the meantime improved, although that mostly means that they have caught up with the bleeding edge a bit.
There are however 2 products that did deliver some improvement: Gemini 2.5 and Grok 4, but they too have slowed down a bit.
Remember that errors compound in agentic setups because those are loops. So if you have a 5% error rate on one-shot, then you have a ~23% error rate when you have the same LLM perform 5 steps on top of each other.
gpt-5
regressed againstgpt-4o
in instruction hierarchy - meaning that important instructions are ignored more often - and this has, despite being promised to be mitigated, in almost 2 months not been mitigated according to their own, current, system card. Especially cognitive rulesets, which is what we use in coding instructions, but also in the detection of malicious user messages on chatbots, regressed according to OpenAI themselves from a 73% success factor to 62% ingpt-5
. I've lightly observed similar outcomes inclaude-4-sonnet
vsclaude-3.7-sonnet
in a vibe coding setup.qwen-2.5
toqwen-3
, have in the meantime improved, although that mostly means that they have caught up with the bleeding edge a bit.