Interesting: Qwen3 beats Gemma-3 Gemini-2.5-pro beats Claude-4-Sonnet this too shows the instruction regression between o3 and gpt-5 gpt-oss-120b scores really high in this benchmark 1 Footnotes though for my personal use-cases it was worse than qwen3 and gemini-2.5-flash! ↩