Add prebaked SWT-bench eval image support #318

simonrosenberg · 2026-01-15T08:08:06Z

Prebake SWT-bench eval env/base images and preload them in eval jobs.

add swtbench image utilities to list required env/base tags and build/push prebaked eval env images
workflow inputs to build prebaked eval env images and publish to GHCR
optional micromamba patch flag so env hashes stay stable by default

Acceptance: swtbench run-eval with eval_limit=1 completed using prebaked env/base images from ghcr.io/openhands/swtbench-eval.

Co-authored-by: openhands <openhands@all-hands.dev>

neubig

Do you have the instance-size one eval ready yet, or is this still wip?

simonrosenberg · 2026-01-15T15:37:58Z

Do you have the instance-size one eval ready yet, or is this still wip?

do you mean: using highcore nodes and parallelizing multiple evals on it? That's already merged
https://github.com/OpenHands/evaluation/pull/131/changes

neubig · 2026-01-15T18:06:10Z

Sorry for the vague message. The "acceptance criterion" for this PR was that you were able to run an eval with a single instance. I wanted to ask if you had successfully done that yet.

openhands-ai · 2026-01-16T01:22:39Z

Looks like there are a few issues preventing this PR from being merged!

GitHub Actions are failing:
- Build SWT-Bench Images

If you'd like me to help, just leave a comment, like

@OpenHands please fix the failing actions on PR #318 at branch `swtbench-prebaked-eval`

Feel free to include any additional details that might help me get this PR into a better state.

_{^{You can manage your notification settings}}

simonrosenberg · 2026-01-16T09:32:15Z

Sorry for the vague message. The "acceptance criterion" for this PR was that you were able to run an eval with a single instance. I wanted to ask if you had successfully done that yet.

Ah OK!
Yes it works for one image. Also what's good is building the images for eval is one-shot since it doesn't depend on our agent version.
However, some images are extremely slow to build (on 40 images, a single image took +3hours while all others take less than 3 minutes) so it might take a while to build the 433 images. But the mamba fallback seems to work too honestly.

simonrosenberg · 2026-01-16T14:27:54Z

Sorry for the vague message. The "acceptance criterion" for this PR was that you were able to run an eval with a single instance. I wanted to ask if you had successfully done that yet.

testing now on 100 images

Co-authored-by: openhands <openhands@all-hands.dev>

…sing" This reverts commit d10418c.

Co-authored-by: openhands <openhands@all-hands.dev>

simonrosenberg and others added 16 commits January 14, 2026 23:33

Add prebaked SWT-bench eval image build support

b6ed901

Co-authored-by: openhands <openhands@all-hands.dev>

Add arch override for SWT-bench eval image builds

f00bc1e

Co-authored-by: openhands <openhands@all-hands.dev>

Let swtbench workflow build prebaked eval env images

ee1b5a6

Co-authored-by: openhands <openhands@all-hands.dev>

Expose prebaked eval images publicly in swtbench workflow

c0d1432

Co-authored-by: openhands <openhands@all-hands.dev>

Make micromamba patch optional for eval image builds

ce669ad

Drop micromamba fallback in eval image build

e523d95

Remove redundant eval-only workflow

655d2ce

Add verbose diagnostics around eval image build step

cfab9a9

Use buildx/cli path for eval images and add runner diagnostics

154a159

Add batching/retries and buildx settings to SWT-bench eval builds

6380c41

Remove local platform override; keep buildx batching/retries

328ef48

Skip rebuilding existing swtbench images

5102770

Reuse remote SWT-bench images and fix dataset cwd

9610efd

Only build/push when registry missing

0e1ddce

Always pull remote base images; drop local fallback

7a0f182

Drop micromamba patch; fail if prebaked images missing

a323729

simonrosenberg mentioned this pull request Jan 15, 2026

Prebake SWT-bench eval image to avoid runtime mamba hangs #317

Open

Format with pre-commit

8ac449d

neubig reviewed Jan 15, 2026

View reviewed changes

Push eval images as they are built

1f7248a

Merge branch 'main' into swtbench-prebaked-eval

fd45583

simonrosenberg and others added 4 commits January 16, 2026 16:35

Fallback to micromamba when prebaked swtbench eval images missing

d10418c

Co-authored-by: openhands <openhands@all-hands.dev>

Revert "Fallback to micromamba when prebaked swtbench eval images mis…

ab15ea7

…sing" This reverts commit d10418c.

Remove unused micromamba patching from swtbench eval

af6d559

Co-authored-by: openhands <openhands@all-hands.dev>

Remove unused eval arch override path for swtbench prebaked images

1531ce5

Co-authored-by: openhands <openhands@all-hands.dev>

Strip top-level files from SWTBench patches

f1d114e

simonrosenberg requested a review from neubig January 16, 2026 18:00

neubig approved these changes Jan 16, 2026

View reviewed changes

Enable is_swt flag in swtbench eval

1c5c599

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add prebaked SWT-bench eval image support #318

Add prebaked SWT-bench eval image support #318

Uh oh!

simonrosenberg commented Jan 15, 2026

Uh oh!

neubig left a comment

Uh oh!

simonrosenberg commented Jan 15, 2026 •

edited

Loading

Uh oh!

neubig commented Jan 15, 2026

Uh oh!

openhands-ai bot commented Jan 16, 2026

Uh oh!

simonrosenberg commented Jan 16, 2026 •

edited

Loading

Uh oh!

simonrosenberg commented Jan 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add prebaked SWT-bench eval image support #318

Are you sure you want to change the base?

Add prebaked SWT-bench eval image support #318

Uh oh!

Conversation

simonrosenberg commented Jan 15, 2026

Uh oh!

neubig left a comment

Choose a reason for hiding this comment

Uh oh!

simonrosenberg commented Jan 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

neubig commented Jan 15, 2026

Uh oh!

openhands-ai bot commented Jan 16, 2026

Uh oh!

simonrosenberg commented Jan 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

simonrosenberg commented Jan 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

simonrosenberg commented Jan 15, 2026 •

edited

Loading

simonrosenberg commented Jan 16, 2026 •

edited

Loading