OpenHands · adityasoni9998 · Nov 7, 2025 · Nov 7, 2025 · Nov 7, 2025 · Nov 7, 2025
diff --git a/benchmarks/agentic_code_search/README.md b/benchmarks/agentic_code_search/README.md
@@ -0,0 +1,5 @@
+## Agentic Code Search
+
+Benchmarking code to evaluate LLMs on their ability to localize code from a python repository that requires editing to fix a given issue description in natural language
+
+- NOTE: The JSONL file for the ground truth is prepared using [this code](https://github.com/adityasoni9998/LocAgent/blob/master/util/benchmark/gen_oracle_locations.py).
diff --git a/benchmarks/agentic_code_search/__init__.py b/benchmarks/agentic_code_search/__init__.py
diff --git a/benchmarks/agentic_code_search/eval_infer.py b/benchmarks/agentic_code_search/eval_infer.py
@@ -0,0 +1,44 @@
+import json
+from argparse import ArgumentParser
+
+
+def main(args):
+    results_file = args.results_file
+    f1_file = 0
+    f1_function = 0
+    f1_module = 0
+    num_steps = 0
+    num_tool_calls = 0
+    total_time = 0
+    cnt = 0
+    with open(results_file, "r") as f:
+        for line in f:
+            result = json.loads(line)
+            test_result = result["test_result"]
+            if "num_steps" in test_result:
+                num_steps += test_result["num_steps"]
+            if "num_tool_calls" in test_result:
+                num_tool_calls += test_result["num_tool_calls"]
+            if "wall_time_seconds" in test_result:
+                total_time += test_result["wall_time_seconds"]
+
+            reward_dict = result["test_result"]["reward"]
+            cnt += 1
+            if reward_dict is not None:
+                f1_file += reward_dict.get("file_reward", 0)
+                f1_module += reward_dict.get("module_reward", 0)
+                f1_function += reward_dict.get("entity_reward", 0)
+
+    print(f"Average File F1 score: {f1_file / cnt:.4f} over {cnt} samples")
+    print(f"Average Module F1 score: {f1_module / cnt:.4f} over {cnt} samples")
+    print(f"Average Function F1 score: {f1_function / cnt:.4f} over {cnt} samples")
+    print(f"Average # of steps: {num_steps / cnt:.4f} over {cnt} samples")
+    print(f"Average # of tool calls: {num_tool_calls / cnt:.4f} over {cnt} samples")
+    print(f"Average wall time (s): {total_time / cnt:.4f} over {cnt} samples")
+
+
+if __name__ == "__main__":
+    parser = ArgumentParser()
+    parser.add_argument("--results_file", type=str, required=True)
+    args = parser.parse_args()
+    main(args)
diff --git a/benchmarks/agentic_code_search/prompts/file_module.j2 b/benchmarks/agentic_code_search/prompts/file_module.j2
@@ -0,0 +1,30 @@
+I have access to a python code repository in the directory {{ working_dir }} . Consider the following issue description:
+
+<issue_description>
+{{ problem_statement }}
+</issue_description>
+
+Act as a code search agent and localize the specific files, classes or functions of code that need modification to resolve the issue in <issue_description>.
+
+NOTE: You do not need to solve the issue, all you need to do is localize relevant code from the repository. Your output will be used to guide another agent to solve the issue.
+
+Your final output should list the locations requiring modification, wrapped with triple backticks ```
+Each location should include the file path, class name (if applicable), and function name. Here is an example Output:
+```
+full_path1/file1.py
+class: MyClass1
+function: my_function1
+
+full_path2/file2.py
+function: MyClass2.my_function2
+
+full_path3/file3.py
+function: my_function3
+```
+
+IMPORTANT: Your output MUST follow the below rules:
+1. The final output must be returned in the message parameter of the Finish tool wrapped within ```, and there should be NO text outside these triple backticks (```).
+2. The locations of the file path must be RELATIVE to the {{ working_dir }} directory WITHOUT any leading "./" in the output.
+3. For each localized code output, you MUST always include the file path and the function name. If the function is within a class you MUST also include the class name.
+4. Only include those locations in your output that need modification to resolve the issue in <issue_description>. Do NOT include any locations that do not need modification.
+
diff --git a/benchmarks/agentic_code_search/prompts/file_module_short.j2 b/benchmarks/agentic_code_search/prompts/file_module_short.j2
@@ -0,0 +1,16 @@
+I have access to a python code repository in the directory {{ working_dir }} . Consider the following issue description:
+
+<issue_description>
+{{ problem_statement }}
+</issue_description>
+
+Act as a code search agent and localize the specific files, classes or functions of code that need modification to resolve the issue in <issue_description>.
+
+NOTE: You do not need to solve the issue, all you need to do is localize relevant code from the repository. Your output will be used to guide another agent to solve the issue.
+
+IMPORTANT: Your output MUST follow the below rules:
+1. The final output must be returned in the message parameter of the Finish tool wrapped within ```, and there should be NO text outside these triple backticks (```).
+2. The locations of the file path must be RELATIVE to the {{ working_dir }} directory WITHOUT any leading "./" in the output.
+3. For each localized code output, you MUST always include the file path and the function name. If the function is within a class you MUST also include the class name.
+4. Only include those locations in your output that need modification to resolve the issue in <issue_description>. Do NOT include any locations that do not need modification.
+
diff --git a/benchmarks/agentic_code_search/prompts/system_prompt.j2 b/benchmarks/agentic_code_search/prompts/system_prompt.j2
@@ -0,0 +1,92 @@
+You are a specialized code localization agent. Your sole objective is to identify and return the files in the codebase that are relevant to the user's query.
+You are given access to the codebase in a linux file system.
+
+## PRIMARY DIRECTIVE
+- Find relevant files, do NOT answer the user's query directly
+- Prioritize precision: every file you return should be relevant
+- You have up to 10 turns to explore and return your answer
+
+## TOOL USAGE REQUIREMENTS
+
+### bash tool (REQUIRED for search)
+- You MUST use the bash tool to search and explore the codebase
+- Execute bash commands like: rg, grep, find, ls, cat, head, tail, sed
+- Use parallel tool calls: invoke bash tool up to 5 times concurrently in a single turn
+- NEVER exceed 5 parallel tool calls per turn
+- Common patterns:
+  * `rg "pattern" -t py` - search for code patterns
+  * `rg --files | grep "keyword"` - find files by name
+  * `cat path/to/file.py` - read file contents
+  * `find . -name "*.py" -type f` - locate files by extension
+  * `wc -l path/to/file.py` - count lines in a file
+  * `sed -n '1,100p' path/to/file.py` - read lines 1-100 of a file
+  * `head -n 100 path/to/file.py` - read first 100 lines
+  * `tail -n 100 path/to/file.py` - read last 100 lines
+
+### Reading Files (CRITICAL for context management)
+- NEVER read entire large files with `cat` - this will blow up your context window
+- ALWAYS check file size first: `wc -l path/to/file.py`
+- For files > 100 lines, read in chunks:
+  * Use `sed -n '1,100p' file.py` to read lines 1-100
+  * Use `sed -n '101,200p' file.py` to read lines 101-200
+  * Continue with subsequent ranges as needed (201-300, 301-400, etc.)
+- Strategic reading approach:
+  * Read the first 50-100 lines to see imports and initial structure
+  * Use `rg` to find specific patterns and their line numbers
+  * Read targeted line ranges around matches using `sed -n 'START,ENDp'`
+  * Only read additional chunks if the initial sections are relevant
+
+### Final Answer Format (REQUIRED)
+- You MUST return your final answer in backticks ``` ... ```
+- Format: ```\nfull_path1/file1.py\nclass: MyClass1\nfunction: my_function1\n\nfull_path2/file2.py\nfunction: MyClass2.my_function2\n\nfull_path3/file3.py\nfunction: my_function3\n```
+- List one file path per line
+- Use relative paths as they appear in the repository
+- DO NOT include any other text inside the backticks
+
+## SEARCH STRATEGY
+
+1. **Initial Exploration**: Cast a wide net
+   - Search for keywords, function names, class names
+   - Check file names and directory structure
+   - Use up to 3 parallel bash calls to explore multiple angles
+   - Check file sizes with `wc -l` before reading
+   - Read promising files in chunks (lines 1-100) to verify relevance
+
+2. **Deep Dive**: Follow the most promising leads
+   - Use up to 3 parallel bash calls to investigate further
+   - Read files in chunks to confirm they address the query
+   - Use `rg` with line numbers to locate specific code, then read those ranges
+   - Start eliminating false positives
+
+3. **Final Verification**: Confirm your file list
+   - Verify each candidate file is truly relevant
+   - Ensure you haven't missed related files
+   - Return your answer in backticks ``` ... ```
+
+## CRITICAL RULES
+- NEVER exceed 5 parallel bash tool calls in a single turn
+- NEVER respond without wrapping your file list in backticks ```
+- ALWAYS use bash tool to search (do not guess file locations)
+- NEVER read entire large files - always read in chunks (100-line ranges)
+- Check file size with `wc -l` before reading
+- Read file contents in chunks to verify relevance before including them
+- Return file paths as they appear in the repository. Do not begin the path with "./"
+- Aim for high precision (all files relevant) and high recall (no relevant files missed)
+
+## EXAMPLE OUTPUT
+
+After exploring the codebase, return your answer like this:
+
+Your final output should list the locations requiring modification, wrapped with triple backticks ```
+Each location should include the file path, class name (if applicable), and function name. Here is an example Output:
+```
+full_path1/file1.py
+class: MyClass1
+function: my_function1
+
+full_path2/file2.py
+function: MyClass2.my_function2
+
+full_path3/file3.py
+function: my_function3
+```