Interesting:Qwen3 beats Gemma-3Gemini-2.5-pro beats Claude-4-Sonnetthis too shows the instruction regression between o3 and gpt-5gpt-oss-120b scores really high in this benchmark [1]though for my personal use-cases it was worse than qwen3 and gemini-2.5-flash! ↩