> At this point every week there’s a new “insane benchmark” headline

True. A year ago o3 was the best model on the market. Progress is fast.

> Real test is still, can it actually help without hallucinating halfway through the task?

Have you used a SOTA model in opencode yet? Chatbots still do that - the progress of agents is on another level tho.

zuspotirko

At this point every week there’s a new “insane benchmark” headline

Real test is still, can it actually help without hallucinating halfway through the task? But yeah, the pace AI models are moving right now is honestly wild.

Google Gemini 3.5 Flash: insane benchmark results

At this point every week there’s a new “insane benchmark” headline 
Real test is still, can it actually help without hallucinating halfway through the task? But yeah, the pace AI models are moving right now is honestly wild.