Improve SWTBench pass rate with patch and harness hygiene

## Summary
Recent SWTBench runs showed a large drop in resolved instances. We fixed one cause (top-level files like reproduction.py leaking into patches), but there are a few more low-touch tweaks that should recover accuracy.

## Proposal
- **Aggressive patch filtering in conversion**
  - In , drop diffs outside test paths (, , , ). Extend current setup-file/top-level stripping. If a prediction ends up with zero test diffs, omit it from submissions.
  - Extend setup/config filter to include , , and other infra files that can destabilize test selection.

- **Directive hygiene in harness**
  - In  (SWTBench harness utils), ignore non-test files; skip , root files, and . Only generate directives for changed test files to avoid running helper scripts as tests.
  - If multiple files are changed, cap directives to the changed test files instead of running the whole suite by default.

- **Prompt tightening**
  - Update  to forbid changes outside tests and adding , insist on real failing assertions on the existing bug path (no stubs/mocks), and remind to place tests under the project test tree and use the project runner. Optional: if the test passes on base, strengthen it until it fails.

- **Submission precheck (optional)**
  - Before conversion, reject predictions with no test file changes; optionally quick-run changed tests on base to ensure they fail.

## Rationale
- Many unresolved runs were caused by helper scripts and non-test edits being treated as test directives, leading to failures even after the golden patch. OpenHands strips these; aligning conversion, harness, and prompt should restore the ~60% success seen previously.

## Tasks
- [ ] Add stricter file filters in SWTBench conversion (test-only, infra exclusion, zero-test guard)
- [ ] Harden directive generation to ignore non-tests/root scripts
- [ ] Tighten  prompt language accordingly
- [ ] (Optional) add pre-submission sanity checks for test presence/failure


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve SWTBench pass rate with patch and harness hygiene #330

Summary

Proposal

Rationale

Tasks

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Improve SWTBench pass rate with patch and harness hygiene #330

Description

Summary

Proposal

Rationale

Tasks

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions