pull down to refresh


Interesting:
  • Qwen3 beats Gemma-3
  • Gemini-2.5-pro beats Claude-4-Sonnet
  • this too shows the instruction regression between o3 and gpt-5
  • gpt-oss-120b scores really high in this benchmark 1

Footnotes

  1. though for my personal use-cases it was worse than qwen3 and gemini-2.5-flash!