TLDR:
OpenAI's official tests showed GPT-5 had mediocre hacking capabilities, on par with previous models. However, when the cybersecurity firm XBOW integrated GPT-5 into its autonomous penetration testing platform, the agent's performance more than doubled.
"The pace of improvement in AI-driven offensive security is accelerating dramatically."
The key takeaway is that an AI model's effectiveness depends heavily on the system it operates in. XBOW's platform provides the necessary tools, coordination, and framework (the "scaffolding") that unlocks GPT-5's hidden power, allowing it to find more vulnerabilities, faster, more consistently, and with fewer errors. GPT-5 specifically excels because of its superior reasoning and ability to write long, correct shell command sequences.
Progression of XBOW agent success rate over time with different models.