Well they did! See [https://deepswe.datacurve.ai/blog#methodology](https://deepswe.datacurve.ai/blog#methodology)

Has anyone gotten a 70% one-shot success rate in practice? If yes, in what language? For what task?

optimism

Today's leading public coding benchmarks are starting to saturate at the frontier: top models cluster within a narrow score band where adjacent configurations often overlap on confidence intervals. DeepSWE is a long-horizon software engineering benchmark built to separate them. It delivers four advances over existing public benchmarks:

: Tasks are written from scratch, not adapted from existing commits or PRs, so no model has seen the solution during pretraining.

: Tasks span a broad pool of 91 repositories across 5 languages.

half

 the length of SWE-bench Pro's, yet solutions require 5.5x more code and ~2x more output tokens.

: Verifiers are hand-written to test software behavior rather than implementation details.

The result is a benchmark that reflects how today's frontier coding agents actually perform in software engineering work.

![benchmark](https://m.stacker.news/144105)

> Today's leading public coding benchmarks are starting to saturate at the frontier: top models cluster within a narrow score band where adjacent configurations often overlap on confidence intervals. DeepSWE is a long-horizon software engineering benchmark built to separate them. It delivers four advances over existing public benchmarks:

> **Contamination free**: Tasks are written from scratch, not adapted from existing commits or PRs, so no model has seen the solution during pretraining.

> **High diversity**: Tasks span a broad pool of 91 repositories across 5 languages.

> **Real-world complexity**: Prompts are ~half the length of SWE-bench Pro's, yet solutions require 5.5x more code and \~2x more output tokens.

> **Reliable verification**: Verifiers are hand-written to test software behavior rather than implementation details.

> The result is a benchmark that reflects how today's frontier coding agents actually perform in software engineering work.