Real-world software engineering demands the ability to analyze large-scale GitHub code repositories, where understanding the complex interactions between variables, functions, and classes spread across multiple files is essential. With the growing reliance on software engineering (SWE) agents for automated development and maintenance, it is crucial that such models demonstrate strong capabilities in understanding and resolving issues within large and complex codebases. In this work, we benchmark existing SWE agents (Mini-SWE-Agent, CodeActAgent & Qwen3-Coder) with entreprise (GPT-5) and open-source LLMs on issue resolution tasks from the SWEBench dataset. We stress-test the models across novel metrics such as Null Resolution, Fault-Line IOU, ETS@x & Error-Resolution@x which leverage and process SWEBench in creative ways to quantify and test the versatility of SWE agents. We provide code for our unified benchmarking harness, UniBench, here.
-
Notifications
You must be signed in to change notification settings - Fork 1
Vaibhavi1707/unibench_code
About
SWEAgents and CodeLLM Evaluation Framework
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published