🛡️SecRepoBench: Benchmarking Code Agents for Secure Code Completion in Real-World Repositories

1University of Maryland, 2Google Deepmind

Overview

SecRepoBench is a repository-level secure code completion benchmark. It contains 318 code completion tasks obtained from 27 popular GitHub C/C++ repositories covering 15 CWEs. Our benchmark can be used to evaluate both standalone LLMs with a context retriever and agent frameworks with access to the entire repository, which gives a comprehensive assessment of different code generation paradigms. SecRepoBench targets on code completion task, where developers use LLMs to complete code within a partially implemented feature inside a codebase. Compared to traditional software engineering jobs such as feature addition or vulnerability patching, this code completion task presents unique challenges by requiring the model not only to understand the pre-defined code context rather than build from scratch, but also to ensure both functional correctness and security simultaneously within the security-sensitive region.

Leaderboard

# Setup secure-pass@1 (%) pass@1 (%) secure %
1 OpenHands + OpenAI o3 53.5 76.4 67.9
2 OpenHands + GPT-5 51.3 79.6 66.0
3 Aider + GPT-5 50.6 69.5 66.0
4 Claude Code + Claude Sonnet 4.5 50.3 74.5 66.7
5 Aider + Claude Sonnet 4.5 45.6 56.6 65.4
6 OpenHands + Claude Sonnet 4.5 43.7 73.9 60.4
7 Aider + Claude Sonnet 4 43.4 64.8 60.4
8 Aider + OpenAI o3 43.4 60.7 61.6
9 OpenHands + Claude Sonnet 4 42.8 66.4 66.0
10 Aider + GPT-4.1 39.0 59.1 53.1
11 Aider + OpenAI o4-mini 39.0 58.2 55.0
12 OpenHands + OpenAI o4-mini 37.1 59.1 53.5
13 OpenHands + GPT-4.1 29.3 50.0 49.7
14 GPT-5 39.3 54.1 61.3
15 OpenAI o3 32.4 47.5 51.9
16 Claude Sonnet 4.5 31.1 52.2 46.2
17 Claude Sonnet 4 30.2 48.4 49.7
18 Claude Sonnet 3.7 28.0 40.3 40.9
19 GPT-4.1 27.7 43.4 42.5
20 OpenAI o4-mini 24.5 36.8 38.1
21 DeepSeek-R1 23.9 34.3 42.8
22 OpenAI o1 23.6 38.4 42.1
23 Qwen3-Coder 23.0 41.2 36.8
24 DeepSeek-V3 22.6 39.6 35.2
25 GPT-4o New 22.0 34.9 37.1
26 gpt-oss-120b 21.4 34.0 36.2
27 OpenAI o3-mini 21.4 33.0 35.8
28 Claude Sonnet 3.5 20.1 36.5 34.3
29 GPT-4o 19.5 33.0 39.6
30 Gemini 1.5 Pro 18.9 33.0 28.9
31 Llama 4 Maverick 16.7 26.7 29.6
32 Qwen3 235B 16.4 27.4 31.1
33 Gemini 2.0 Flash 15.4 27.7 28.0
34 Gemini 1.5 Flash 14.2 23.0 26.7
35 GPT-4o mini 13.8 25.5 28.0
36 Llama 3.1 70B 13.5 23.6 23.3
37 Qwen2.5-Coder 13.5 25.2 28.9
38 Claude Haiku 3 11.6 22.0 23.0
39 DeepSeek-Coder-V2-Lite-Instruct 8.8 13.5 19.8
40 Mistral NeMo 6.9 12.3 15.7
41 Llama 3.1 8B 5.0 9.8 9.4

Framework

SecRepoBench framework overview diagram

Each code completion task takes a target function with a masked region and the entire repository providing context as inputs to either a standalone LLM with a context retriever or an agent framework, which then generates code to fill the empty region. The generated code is compiled with the full repository and evaluated on two dimensions: correctness using developer-written unit tests and security using Proof-of-Concept exploits from OSS-Fuzz.

Evaluation

SecRepoBench thoroughly evaluates generated code across two critical dimensions: correctness and security.

Correctness. We require each task to have at least one relevant unit test (i.e., call the target function directly or indirectly) inside its developer-written test suite which must pass with the ground truth secure code (i.e., developer-patched code). SecRepoBench considers a code completion to be functionally correct if it passes all unit tests that the ground truth secure code passes, including the relevant ones. Otherwise, the code completion is considered as incorrect.

Security. Each task has a Proof-of-Concept (PoC) exploit from OSS-Fuzz which can cause a project to crash if it contains the underlying vulnerability. We compile the project with the generated code completion and execute it with the PoC input. SecRepoBench considers a code completion to be secure if it does not crash and vulnerable otherwise.

BibTeX

@article{shen2025secrepobench,
    title={SecRepoBench: Benchmarking Code Agents for Secure Code Completion in Real-World Repositories},
    author={Shen, Chihao and Dilgren, Connor and Chiniya, Purva and Griffith, Luke and Ding, Yu and Chen, Yizheng},
    journal={arXiv preprint arXiv:2504.21205},
    year={2025}
}