🛡️SecRepoBench: Benchmarking Code Agents for Secure Code Completion in Real-World Repositories

Chihao Shen¹, Connor Dilgren¹, Purva Chiniya¹, Luke Griffith¹, Yu Ding², Yizheng Chen¹

¹University of Maryland, ²Google Deepmind

Overview

SecRepoBench is a repository-level secure code completion benchmark. It contains 318 code completion tasks obtained from 27 popular GitHub C/C++ repositories covering 15 CWEs. Our benchmark can be used to evaluate both standalone LLMs with a context retriever and agent frameworks with access to the entire repository, which gives a comprehensive assessment of different code generation paradigms. SecRepoBench targets on code completion task, where developers use LLMs to complete code within a partially implemented feature inside a codebase. Compared to traditional software engineering jobs such as feature addition or vulnerability patching, this code completion task presents unique challenges by requiring the model not only to understand the pre-defined code context rather than build from scratch, but also to ensure both functional correctness and security simultaneously within the security-sensitive region.

Leaderboard

Type

#	Setup	secure-pass@1 (%)	pass@1 (%)	secure %
1	OpenHands + OpenAI o3	53.5	76.4	67.9
2	OpenHands + GPT-5	51.3	79.6	66.0
3	Aider + GPT-5	50.6	69.5	66.0
4	Claude Code + Claude Sonnet 4.5	50.3	74.5	66.7
5	Aider + Claude Sonnet 4.5	45.6	56.6	65.4
6	OpenHands + Claude Sonnet 4.5	43.7	73.9	60.4
7	Aider + Claude Sonnet 4	43.4	64.8	60.4
8	Aider + OpenAI o3	43.4	60.7	61.6
9	OpenHands + Claude Sonnet 4	42.8	66.4	66.0
10	Aider + GPT-4.1	39.0	59.1	53.1
11	Aider + OpenAI o4-mini	39.0	58.2	55.0
12	OpenHands + OpenAI o4-mini	37.1	59.1	53.5
13	OpenHands + GPT-4.1	29.3	50.0	49.7
14	GPT-5	39.3	54.1	61.3
15	OpenAI o3	32.4	47.5	51.9
16	Claude Sonnet 4.5	31.1	52.2	46.2
17	Claude Sonnet 4	30.2	48.4	49.7
18	Claude Sonnet 3.7	28.0	40.3	40.9
19	GPT-4.1	27.7	43.4	42.5
20	OpenAI o4-mini	24.5	36.8	38.1
21	DeepSeek-R1	23.9	34.3	42.8
22	OpenAI o1	23.6	38.4	42.1
23	Qwen3-Coder	23.0	41.2	36.8
24	DeepSeek-V3	22.6	39.6	35.2
25	GPT-4o New	22.0	34.9	37.1
26	gpt-oss-120b	21.4	34.0	36.2
27	OpenAI o3-mini	21.4	33.0	35.8
28	Claude Sonnet 3.5	20.1	36.5	34.3
29	GPT-4o	19.5	33.0	39.6
30	Gemini 1.5 Pro	18.9	33.0	28.9
31	Llama 4 Maverick	16.7	26.7	29.6
32	Qwen3 235B	16.4	27.4	31.1
33	Gemini 2.0 Flash	15.4	27.7	28.0
34	Gemini 1.5 Flash	14.2	23.0	26.7
35	GPT-4o mini	13.8	25.5	28.0
36	Llama 3.1 70B	13.5	23.6	23.3
37	Qwen2.5-Coder	13.5	25.2	28.9
38	Claude Haiku 3	11.6	22.0	23.0
39	DeepSeek-Coder-V2-Lite-Instruct	8.8	13.5	19.8
40	Mistral NeMo	6.9	12.3	15.7
41	Llama 3.1 8B	5.0	9.8	9.4
42	Codex + GPT-5.1-Codex-Max	56.3	75.8	70.4
43	Codex + GPT-5	52.5	71.1	70.8
44	Codex + GPT-5.1-Codex-Max (review)	60.1	76.4	75.5
45	Openhands + Gemini 3 Pro	47.5	69.5	66.0
46	Aider + Gemini 3 Pro	45.3	70.1	60.0
47	Gemini 3 Pro	27.4	48.1	37.1

Framework

Each code completion task takes a target function with a masked region and the entire repository providing context as inputs to either a standalone LLM with a context retriever or an agent framework, which then generates code to fill the empty region. The generated code is compiled with the full repository and evaluated on two dimensions: correctness using developer-written unit tests and security using Proof-of-Concept exploits from OSS-Fuzz.

Evaluation

SecRepoBench thoroughly evaluates generated code across two critical dimensions: correctness and security.

Correctness. We require each task to have at least one relevant unit test (i.e., call the target function directly or indirectly) inside its developer-written test suite which must pass with the ground truth secure code (i.e., developer-patched code). SecRepoBench considers a code completion to be functionally correct if it passes all unit tests that the ground truth secure code passes, including the relevant ones. Otherwise, the code completion is considered as incorrect.

Security. Each task has a Proof-of-Concept (PoC) exploit from OSS-Fuzz which can cause a project to crash if it contains the underlying vulnerability. We compile the project with the generated code completion and execute it with the PoC input. SecRepoBench considers a code completion to be secure if it does not crash and vulnerable otherwise.

BibTeX

@article{shen2025secrepobench,
    title={SecRepoBench: Benchmarking Code Agents for Secure Code Completion in Real-World Repositories},
    author={Shen, Chihao and Dilgren, Connor and Chiniya, Purva and Griffith, Luke and Ding, Yu and Chen, Yizheng},
    journal={arXiv preprint arXiv:2504.21205},
    year={2025}
}