> The increasing adoption of large language models (LLMs) in software engineering necessitates rigorous security evaluation of their generated code. However, existing benchmarks are inadequate, as they focus on isolated code snippets, employ unstable evaluation methods that lack reproducibility, and fail to connect the quality of input context with the security of the output.
> To address these gaps, we introduce `A.S.E` (AI Code Generation Security Evaluation), a benchmark for repository-level secure code generation. `A.S.E` constructs tasks from real-world repositories with documented CVEs, preserving full repository context like build systems and cross-file dependencies. Its reproducible, containerized evaluation framework uses expert-defined rules to provide stable, auditable assessments of security, build quality, and generation stability. 
>
> Our evaluation of leading LLMs on A.S.E reveals three key findings:
> 1. `Claude-3.7-Sonnet` achieves the best overall performance.
> 2. The security gap between proprietary and open-source models is narrow; `Qwen3-235B-A22B-Instruct` attains the top security score.
> 3. Concise, “fast-thinking” decoding strategies consistently outperform complex, “slow-thinking” reasoning for security patching.

Leaderboard:

| Rank | Model                               | License     | Overall | Security | Quality | Stability |
| ---- | ----------------------------------- | ----------- | ------- | -------- | ------- | --------- |
| 1    | Claude-3.7-Sonnet-20250219          | Proprietary | 63.01   | 46.72    | 91.58   | 75.00     |
| 2    | Claude-3.7-Sonnet-Thinking-20250219 | Proprietary | 61.04   | 44.65    | 89.85   | 72.92     |
| 3    | Qwen3-235B-A22B-Instruct-2507       | Open Source | 60.15   | 48.03    | 82.08   | 67.08     |
| 4    | Qwen3-Coder                         | Open Source | 59.31   | 42.69    | 85.16   | 81.54     |
| 5    | DeepSeek-V3-20250324                | Open Source | 58.59   | 40.89    | 85.87   | 82.94     |
| 6    | Claude-Sonnet-4-20250514            | Proprietary | 57.14   | 34.78    | 92.37   | 85.65     |
| 7    | Kimi-K2-20250711-Preview            | Open Source | 55.29   | 37.82    | 79.90   | 86.25     |
| 8    | GPT-4o-20241120                     | Proprietary | 55.10   | 45.65    | 72.46   | 59.67     |
| 9    | Qwen-Coder-Plus-20241106            | Proprietary | 53.55   | 37.98    | 73.78   | 86.27     |
| 10   | Claude-Opus-4-20250514              | Proprietary | 52.71   | 31.95    | 85.82   | 77.91     |
| 11   | Grok-3                              | Proprietary | 52.18   | 38.64    | 73.54   | 69.41     |
| 12   | DeepSeek-R1-20250528                | Open Source | 51.76   | 38.01    | 74.39   | 66.38     |
| 13   | Gemini-2.5-Pro-Exp-20250325         | Proprietary | 51.02   | 29.98    | 84.04   | 78.21     |
| 14   | Claude-Sonnet-4-Thinking-20250514   | Proprietary | 50.92   | 34.10    | 76.81   | 74.22     |
| 15   | Claude-Opus-4-Thinking-20250514     | Proprietary | 50.17   | 30.70    | 79.84   | 77.98     |
| 16   | GLM-4.5                             | Open Source | 49.80   | 35.92    | 70.24   | 71.74     |
| 17   | Grok-4                              | Proprietary | 42.40   | 29.53    | 59.78   | 67.42     |
| 18   | o4-mini-20250416                    | Proprietary | 41.35   | 27.87    | 60.74   | 64.07     |
| 19   | Grok-3-mini                         | Proprietary | 30.49   | 22.37    | 38.15   | 56.26     |
| 20   | Codex-mini-latest                   | Proprietary | 29.71   | 22.96    | 34.68   | 55.29     |
| 21   | Hunyuan-T1-20250321                 | Proprietary | 21.92   | 15.57    | 20.21   | 65.18     |
| 22   | Qwen3-235B-A22B-Thinking            | Open Source | 18.11   | 9.42     | 15.60   | 77.81     |
| 23   | GPT-4.1-20250414                    | Proprietary | 17.26   | 5.26     | 16.46   | 91.66     |
| 24   | Qwen3-235B-A22B                     | Open Source | 13.37   | 3.34     | 7.27    | 91.86     |
| 25   | o3-mini-20250131                    | Proprietary | 13.23   | 3.67     | 3.91    | 98.57     |
| 26   | o3-20250416                         | Proprietary | 10.22   | 0.36     | 0.36    | 98.91     |

----

My notes:

- **The highest security score of `48.03` out of `100.00` is no bueno**
- Also note a potential significant security regression between Claude 3.7 Sonnet and Claude 4 Sonnet.
- GPT-5 wasn't tested


Rank	Model	License	Overall	Security	Quality	Stability
1	Claude-3.7-Sonnet-20250219	Proprietary	63.01	46.72	91.58	75.00
2	Claude-3.7-Sonnet-Thinking-20250219	Proprietary	61.04	44.65	89.85	72.92
3	Qwen3-235B-A22B-Instruct-2507	Open Source	60.15	48.03	82.08	67.08
4	Qwen3-Coder	Open Source	59.31	42.69	85.16	81.54
5	DeepSeek-V3-20250324	Open Source	58.59	40.89	85.87	82.94
6	Claude-Sonnet-4-20250514	Proprietary	57.14	34.78	92.37	85.65
7	Kimi-K2-20250711-Preview	Open Source	55.29	37.82	79.90	86.25
8	GPT-4o-20241120	Proprietary	55.10	45.65	72.46	59.67
9	Qwen-Coder-Plus-20241106	Proprietary	53.55	37.98	73.78	86.27
10	Claude-Opus-4-20250514	Proprietary	52.71	31.95	85.82	77.91
11	Grok-3	Proprietary	52.18	38.64	73.54	69.41
12	DeepSeek-R1-20250528	Open Source	51.76	38.01	74.39	66.38
13	Gemini-2.5-Pro-Exp-20250325	Proprietary	51.02	29.98	84.04	78.21
14	Claude-Sonnet-4-Thinking-20250514	Proprietary	50.92	34.10	76.81	74.22
15	Claude-Opus-4-Thinking-20250514	Proprietary	50.17	30.70	79.84	77.98
16	GLM-4.5	Open Source	49.80	35.92	70.24	71.74
17	Grok-4	Proprietary	42.40	29.53	59.78	67.42
18	o4-mini-20250416	Proprietary	41.35	27.87	60.74	64.07
19	Grok-3-mini	Proprietary	30.49	22.37	38.15	56.26
20	Codex-mini-latest	Proprietary	29.71	22.96	34.68	55.29
21	Hunyuan-T1-20250321	Proprietary	21.92	15.57	20.21	65.18
22	Qwen3-235B-A22B-Thinking	Open Source	18.11	9.42	15.60	77.81
23	GPT-4.1-20250414	Proprietary	17.26	5.26	16.46	91.66
24	Qwen3-235B-A22B	Open Source	13.37	3.34	7.27	91.86
25	o3-mini-20250131	Proprietary	13.23	3.67	3.91	98.57
26	o3-20250416	Proprietary	10.22	0.36	0.36	98.91