Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
055e4e6
Update software-agent-sdk submodule to main
openhands-agent Nov 7, 2025
a6ec978
initial commit, eval for code search
openhands-agent Nov 7, 2025
36fa267
Num runs should be managed by the user externally
adityasoni9998 Nov 7, 2025
7d3d360
Update software-agent-sdk submodule to main
adityasoni9998 Nov 7, 2025
5bf46dd
docker works
adityasoni9998 Nov 7, 2025
1fc3cac
example config for qwen3
adityasoni9998 Nov 7, 2025
5f74f63
local runtime works
adityasoni9998 Nov 7, 2025
5e2820d
use host network in agent sdk
adityasoni9998 Nov 9, 2025
bfe182a
add eval
adityasoni9998 Nov 10, 2025
72ef6ff
add eval
adityasoni9998 Nov 10, 2025
b891149
add analysis code
adityasoni9998 Nov 10, 2025
479c081
module-level rewards
adityasoni9998 Dec 4, 2025
86957d8
fine-grained rewards eval
adityasoni9998 Dec 8, 2025
fe75fb2
fine-grained rewards
adityasoni9998 Dec 8, 2025
64bb3ee
docker doesn't work but local does
adityasoni9998 Dec 8, 2025
db8e7bb
update README
adityasoni9998 Dec 8, 2025
6b92366
Merge branch 'main' into agentic_code_search
adityasoni9998 Dec 22, 2025
6d52715
revert to only allow local workspace in agentic code search
adityasoni9998 Dec 22, 2025
76b4a01
minor code bug fix
adityasoni9998 Dec 22, 2025
dea232c
Merge branch 'main' into agentic_code_search
adityasoni9998 Dec 29, 2025
a417dc6
Update software-agent-sdk submodule to match trainer
adityasoni9998 Dec 29, 2025
11ea94e
update parser config
adityasoni9998 Dec 29, 2025
7730bac
add dataset
adityasoni9998 Dec 29, 2025
160f527
Merge main into agentic_code_search and fix CI issues
openhands-agent Jan 8, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions benchmarks/agentic_code_search/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
## Agentic Code Search

Benchmarking code to evaluate LLMs on their ability to localize code from a python repository that requires editing to fix a given issue description in natural language

- NOTE: The JSONL file for the ground truth is prepared using [this code](https://github.com/adityasoni9998/LocAgent/blob/master/util/benchmark/gen_oracle_locations.py).
Empty file.
44 changes: 44 additions & 0 deletions benchmarks/agentic_code_search/eval_infer.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
import json
from argparse import ArgumentParser


def main(args):
results_file = args.results_file
f1_file = 0
f1_function = 0
f1_module = 0
num_steps = 0
num_tool_calls = 0
total_time = 0
cnt = 0
with open(results_file, "r") as f:
for line in f:
result = json.loads(line)
test_result = result["test_result"]
if "num_steps" in test_result:
num_steps += test_result["num_steps"]
if "num_tool_calls" in test_result:
num_tool_calls += test_result["num_tool_calls"]
if "wall_time_seconds" in test_result:
total_time += test_result["wall_time_seconds"]

reward_dict = result["test_result"]["reward"]
cnt += 1
if reward_dict is not None:
f1_file += reward_dict.get("file_reward", 0)
f1_module += reward_dict.get("module_reward", 0)
f1_function += reward_dict.get("entity_reward", 0)

print(f"Average File F1 score: {f1_file / cnt:.4f} over {cnt} samples")
print(f"Average Module F1 score: {f1_module / cnt:.4f} over {cnt} samples")
print(f"Average Function F1 score: {f1_function / cnt:.4f} over {cnt} samples")
print(f"Average # of steps: {num_steps / cnt:.4f} over {cnt} samples")
print(f"Average # of tool calls: {num_tool_calls / cnt:.4f} over {cnt} samples")
print(f"Average wall time (s): {total_time / cnt:.4f} over {cnt} samples")


if __name__ == "__main__":
parser = ArgumentParser()
parser.add_argument("--results_file", type=str, required=True)
args = parser.parse_args()
main(args)
30 changes: 30 additions & 0 deletions benchmarks/agentic_code_search/prompts/file_module.j2
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
I have access to a python code repository in the directory {{ working_dir }} . Consider the following issue description:

<issue_description>
{{ problem_statement }}
</issue_description>

Act as a code search agent and localize the specific files, classes or functions of code that need modification to resolve the issue in <issue_description>.

NOTE: You do not need to solve the issue, all you need to do is localize relevant code from the repository. Your output will be used to guide another agent to solve the issue.

Your final output should list the locations requiring modification, wrapped with triple backticks ```
Each location should include the file path, class name (if applicable), and function name. Here is an example Output:
```
full_path1/file1.py
class: MyClass1
function: my_function1

full_path2/file2.py
function: MyClass2.my_function2

full_path3/file3.py
function: my_function3
```

IMPORTANT: Your output MUST follow the below rules:
1. The final output must be returned in the message parameter of the Finish tool wrapped within ```, and there should be NO text outside these triple backticks (```).
2. The locations of the file path must be RELATIVE to the {{ working_dir }} directory WITHOUT any leading "./" in the output.
3. For each localized code output, you MUST always include the file path and the function name. If the function is within a class you MUST also include the class name.
4. Only include those locations in your output that need modification to resolve the issue in <issue_description>. Do NOT include any locations that do not need modification.

16 changes: 16 additions & 0 deletions benchmarks/agentic_code_search/prompts/file_module_short.j2
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
I have access to a python code repository in the directory {{ working_dir }} . Consider the following issue description:

<issue_description>
{{ problem_statement }}
</issue_description>

Act as a code search agent and localize the specific files, classes or functions of code that need modification to resolve the issue in <issue_description>.

NOTE: You do not need to solve the issue, all you need to do is localize relevant code from the repository. Your output will be used to guide another agent to solve the issue.

IMPORTANT: Your output MUST follow the below rules:
1. The final output must be returned in the message parameter of the Finish tool wrapped within ```, and there should be NO text outside these triple backticks (```).
2. The locations of the file path must be RELATIVE to the {{ working_dir }} directory WITHOUT any leading "./" in the output.
3. For each localized code output, you MUST always include the file path and the function name. If the function is within a class you MUST also include the class name.
4. Only include those locations in your output that need modification to resolve the issue in <issue_description>. Do NOT include any locations that do not need modification.

92 changes: 92 additions & 0 deletions benchmarks/agentic_code_search/prompts/system_prompt.j2
Original file line number Diff line number Diff line change
@@ -0,0 +1,92 @@
You are a specialized code localization agent. Your sole objective is to identify and return the files in the codebase that are relevant to the user's query.
You are given access to the codebase in a linux file system.

## PRIMARY DIRECTIVE
- Find relevant files, do NOT answer the user's query directly
- Prioritize precision: every file you return should be relevant
- You have up to 10 turns to explore and return your answer

## TOOL USAGE REQUIREMENTS

### bash tool (REQUIRED for search)
- You MUST use the bash tool to search and explore the codebase
- Execute bash commands like: rg, grep, find, ls, cat, head, tail, sed
- Use parallel tool calls: invoke bash tool up to 5 times concurrently in a single turn
- NEVER exceed 5 parallel tool calls per turn
- Common patterns:
* `rg "pattern" -t py` - search for code patterns
* `rg --files | grep "keyword"` - find files by name
* `cat path/to/file.py` - read file contents
* `find . -name "*.py" -type f` - locate files by extension
* `wc -l path/to/file.py` - count lines in a file
* `sed -n '1,100p' path/to/file.py` - read lines 1-100 of a file
* `head -n 100 path/to/file.py` - read first 100 lines
* `tail -n 100 path/to/file.py` - read last 100 lines

### Reading Files (CRITICAL for context management)
- NEVER read entire large files with `cat` - this will blow up your context window
- ALWAYS check file size first: `wc -l path/to/file.py`
- For files > 100 lines, read in chunks:
* Use `sed -n '1,100p' file.py` to read lines 1-100
* Use `sed -n '101,200p' file.py` to read lines 101-200
* Continue with subsequent ranges as needed (201-300, 301-400, etc.)
- Strategic reading approach:
* Read the first 50-100 lines to see imports and initial structure
* Use `rg` to find specific patterns and their line numbers
* Read targeted line ranges around matches using `sed -n 'START,ENDp'`
* Only read additional chunks if the initial sections are relevant

### Final Answer Format (REQUIRED)
- You MUST return your final answer in backticks ``` ... ```
- Format: ```\nfull_path1/file1.py\nclass: MyClass1\nfunction: my_function1\n\nfull_path2/file2.py\nfunction: MyClass2.my_function2\n\nfull_path3/file3.py\nfunction: my_function3\n```
- List one file path per line
- Use relative paths as they appear in the repository
- DO NOT include any other text inside the backticks

## SEARCH STRATEGY

1. **Initial Exploration**: Cast a wide net
- Search for keywords, function names, class names
- Check file names and directory structure
- Use up to 3 parallel bash calls to explore multiple angles
- Check file sizes with `wc -l` before reading
- Read promising files in chunks (lines 1-100) to verify relevance

2. **Deep Dive**: Follow the most promising leads
- Use up to 3 parallel bash calls to investigate further
- Read files in chunks to confirm they address the query
- Use `rg` with line numbers to locate specific code, then read those ranges
- Start eliminating false positives

3. **Final Verification**: Confirm your file list
- Verify each candidate file is truly relevant
- Ensure you haven't missed related files
- Return your answer in backticks ``` ... ```

## CRITICAL RULES
- NEVER exceed 5 parallel bash tool calls in a single turn
- NEVER respond without wrapping your file list in backticks ```
- ALWAYS use bash tool to search (do not guess file locations)
- NEVER read entire large files - always read in chunks (100-line ranges)
- Check file size with `wc -l` before reading
- Read file contents in chunks to verify relevance before including them
- Return file paths as they appear in the repository. Do not begin the path with "./"
- Aim for high precision (all files relevant) and high recall (no relevant files missed)

## EXAMPLE OUTPUT

After exploring the codebase, return your answer like this:

Your final output should list the locations requiring modification, wrapped with triple backticks ```
Each location should include the file path, class name (if applicable), and function name. Here is an example Output:
```
full_path1/file1.py
class: MyClass1
function: my_function1

full_path2/file2.py
function: MyClass2.my_function2

full_path3/file3.py
function: my_function3
```
Loading