Today's leading public coding benchmarks are starting to saturate at the frontier: top models cluster within a narrow score band where adjacent configurations often overlap on confidence intervals. DeepSWE is a long-horizon software engineering benchmark built to separate them. It delivers four advances over existing public benchmarks:
Contamination free: Tasks are written from scratch, not adapted from existing commits or PRs, so no model has seen the solution during pretraining.
High diversity: Tasks span a broad pool of 91 repositories across 5 languages.
Real-world complexity: Prompts are ~half the length of SWE-bench Pro's, yet solutions require 5.5x more code and ~2x more output tokens.
Reliable verification: Verifiers are hand-written to test software behavior rather than implementation details.
The result is a benchmark that reflects how today's frontier coding agents actually perform in software engineering work.
Has anyone gotten a 70% one-shot success rate in practice? If yes, in what language? For what task?
Well they did! See https://deepswe.datacurve.ai/blog#methodology
Okay. I guess it's just my "skill problem" and these guys are my betters.
I'll put on a collar and hand them the leash.