Skip to content

Improve SWTBench pass rate with patch and harness hygiene #330

@simonrosenberg

Description

@simonrosenberg

Summary

Recent SWTBench runs showed a large drop in resolved instances. We fixed one cause (top-level files like reproduction.py leaking into patches), but there are a few more low-touch tweaks that should recover accuracy.

Proposal

  • Aggressive patch filtering in conversion

    • In , drop diffs outside test paths (, , , ). Extend current setup-file/top-level stripping. If a prediction ends up with zero test diffs, omit it from submissions.
    • Extend setup/config filter to include , , and other infra files that can destabilize test selection.
  • Directive hygiene in harness

    • In (SWTBench harness utils), ignore non-test files; skip , root files, and . Only generate directives for changed test files to avoid running helper scripts as tests.
    • If multiple files are changed, cap directives to the changed test files instead of running the whole suite by default.
  • Prompt tightening

    • Update to forbid changes outside tests and adding , insist on real failing assertions on the existing bug path (no stubs/mocks), and remind to place tests under the project test tree and use the project runner. Optional: if the test passes on base, strengthen it until it fails.
  • Submission precheck (optional)

    • Before conversion, reject predictions with no test file changes; optionally quick-run changed tests on base to ensure they fail.

Rationale

  • Many unresolved runs were caused by helper scripts and non-test edits being treated as test directives, leading to failures even after the golden patch. OpenHands strips these; aligning conversion, harness, and prompt should restore the ~60% success seen previously.

Tasks

  • Add stricter file filters in SWTBench conversion (test-only, infra exclusion, zero-test guard)
  • Harden directive generation to ignore non-tests/root scripts
  • Tighten prompt language accordingly
  • (Optional) add pre-submission sanity checks for test presence/failure

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions