-
Notifications
You must be signed in to change notification settings - Fork 27
Open
Description
Summary
Recent SWTBench runs showed a large drop in resolved instances. We fixed one cause (top-level files like reproduction.py leaking into patches), but there are a few more low-touch tweaks that should recover accuracy.
Proposal
-
Aggressive patch filtering in conversion
- In , drop diffs outside test paths (, , , ). Extend current setup-file/top-level stripping. If a prediction ends up with zero test diffs, omit it from submissions.
- Extend setup/config filter to include , , and other infra files that can destabilize test selection.
-
Directive hygiene in harness
- In (SWTBench harness utils), ignore non-test files; skip , root files, and . Only generate directives for changed test files to avoid running helper scripts as tests.
- If multiple files are changed, cap directives to the changed test files instead of running the whole suite by default.
-
Prompt tightening
- Update to forbid changes outside tests and adding , insist on real failing assertions on the existing bug path (no stubs/mocks), and remind to place tests under the project test tree and use the project runner. Optional: if the test passes on base, strengthen it until it fails.
-
Submission precheck (optional)
- Before conversion, reject predictions with no test file changes; optionally quick-run changed tests on base to ensure they fail.
Rationale
- Many unresolved runs were caused by helper scripts and non-test edits being treated as test directives, leading to failures even after the golden patch. OpenHands strips these; aligning conversion, harness, and prompt should restore the ~60% success seen previously.
Tasks
- Add stricter file filters in SWTBench conversion (test-only, infra exclusion, zero-test guard)
- Harden directive generation to ignore non-tests/root scripts
- Tighten prompt language accordingly
- (Optional) add pre-submission sanity checks for test presence/failure
Metadata
Metadata
Assignees
Labels
No labels