The increasing adoption of large language models (LLMs) in software engineering necessitates rigorous security evaluation of their generated code. However, existing benchmarks are inadequate, as they focus on isolated code snippets, employ unstable evaluation methods that lack reproducibility, and fail to connect the quality of input context with the security of the output. To address these gaps, we introduceA.S.E
(AI Code Generation Security Evaluation), a benchmark for repository-level secure code generation.A.S.E
constructs tasks from real-world repositories with documented CVEs, preserving full repository context like build systems and cross-file dependencies. Its reproducible, containerized evaluation framework uses expert-defined rules to provide stable, auditable assessments of security, build quality, and generation stability.Our evaluation of leading LLMs on A.S.E reveals three key findings:
Claude-3.7-Sonnet
achieves the best overall performance.- The security gap between proprietary and open-source models is narrow;
Qwen3-235B-A22B-Instruct
attains the top security score.- Concise, “fast-thinking” decoding strategies consistently outperform complex, “slow-thinking” reasoning for security patching.
Leaderboard:
Rank | Model | License | Overall | Security | Quality | Stability |
---|---|---|---|---|---|---|
1 | Claude-3.7-Sonnet-20250219 | Proprietary | 63.01 | 46.72 | 91.58 | 75.00 |
2 | Claude-3.7-Sonnet-Thinking-20250219 | Proprietary | 61.04 | 44.65 | 89.85 | 72.92 |
3 | Qwen3-235B-A22B-Instruct-2507 | Open Source | 60.15 | 48.03 | 82.08 | 67.08 |
4 | Qwen3-Coder | Open Source | 59.31 | 42.69 | 85.16 | 81.54 |
5 | DeepSeek-V3-20250324 | Open Source | 58.59 | 40.89 | 85.87 | 82.94 |
6 | Claude-Sonnet-4-20250514 | Proprietary | 57.14 | 34.78 | 92.37 | 85.65 |
7 | Kimi-K2-20250711-Preview | Open Source | 55.29 | 37.82 | 79.90 | 86.25 |
8 | GPT-4o-20241120 | Proprietary | 55.10 | 45.65 | 72.46 | 59.67 |
9 | Qwen-Coder-Plus-20241106 | Proprietary | 53.55 | 37.98 | 73.78 | 86.27 |
10 | Claude-Opus-4-20250514 | Proprietary | 52.71 | 31.95 | 85.82 | 77.91 |
11 | Grok-3 | Proprietary | 52.18 | 38.64 | 73.54 | 69.41 |
12 | DeepSeek-R1-20250528 | Open Source | 51.76 | 38.01 | 74.39 | 66.38 |
13 | Gemini-2.5-Pro-Exp-20250325 | Proprietary | 51.02 | 29.98 | 84.04 | 78.21 |
14 | Claude-Sonnet-4-Thinking-20250514 | Proprietary | 50.92 | 34.10 | 76.81 | 74.22 |
15 | Claude-Opus-4-Thinking-20250514 | Proprietary | 50.17 | 30.70 | 79.84 | 77.98 |
16 | GLM-4.5 | Open Source | 49.80 | 35.92 | 70.24 | 71.74 |
17 | Grok-4 | Proprietary | 42.40 | 29.53 | 59.78 | 67.42 |
18 | o4-mini-20250416 | Proprietary | 41.35 | 27.87 | 60.74 | 64.07 |
19 | Grok-3-mini | Proprietary | 30.49 | 22.37 | 38.15 | 56.26 |
20 | Codex-mini-latest | Proprietary | 29.71 | 22.96 | 34.68 | 55.29 |
21 | Hunyuan-T1-20250321 | Proprietary | 21.92 | 15.57 | 20.21 | 65.18 |
22 | Qwen3-235B-A22B-Thinking | Open Source | 18.11 | 9.42 | 15.60 | 77.81 |
23 | GPT-4.1-20250414 | Proprietary | 17.26 | 5.26 | 16.46 | 91.66 |
24 | Qwen3-235B-A22B | Open Source | 13.37 | 3.34 | 7.27 | 91.86 |
25 | o3-mini-20250131 | Proprietary | 13.23 | 3.67 | 3.91 | 98.57 |
26 | o3-20250416 | Proprietary | 10.22 | 0.36 | 0.36 | 98.91 |
My notes:
- The highest security score of
48.03
out of100.00
is no bueno - Also note a potential significant security regression between Claude 3.7 Sonnet and Claude 4 Sonnet.
- GPT-5 wasn't tested