Skip to content

Conversation

@simonrosenberg
Copy link
Collaborator

Prebake SWT-bench eval env/base images and preload them in eval jobs.

  • add swtbench image utilities to list required env/base tags and build/push prebaked eval env images
  • workflow inputs to build prebaked eval env images and publish to GHCR
  • optional micromamba patch flag so env hashes stay stable by default

Acceptance: swtbench run-eval with eval_limit=1 completed using prebaked env/base images from ghcr.io/openhands/swtbench-eval.

Copy link
Contributor

@neubig neubig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have the instance-size one eval ready yet, or is this still wip?

@simonrosenberg
Copy link
Collaborator Author

simonrosenberg commented Jan 15, 2026

Do you have the instance-size one eval ready yet, or is this still wip?

do you mean: using highcore nodes and parallelizing multiple evals on it? That's already merged
https://github.com/OpenHands/evaluation/pull/131/changes

@neubig
Copy link
Contributor

neubig commented Jan 15, 2026

Sorry for the vague message. The "acceptance criterion" for this PR was that you were able to run an eval with a single instance. I wanted to ask if you had successfully done that yet.

@openhands-ai
Copy link

openhands-ai bot commented Jan 16, 2026

Looks like there are a few issues preventing this PR from being merged!

  • GitHub Actions are failing:
    • Build SWT-Bench Images

If you'd like me to help, just leave a comment, like

@OpenHands please fix the failing actions on PR #318 at branch `swtbench-prebaked-eval`

Feel free to include any additional details that might help me get this PR into a better state.

You can manage your notification settings

@simonrosenberg
Copy link
Collaborator Author

simonrosenberg commented Jan 16, 2026

Sorry for the vague message. The "acceptance criterion" for this PR was that you were able to run an eval with a single instance. I wanted to ask if you had successfully done that yet.

Ah OK!
Yes it works for one image. Also what's good is building the images for eval is one-shot since it doesn't depend on our agent version.
However, some images are extremely slow to build (on 40 images, a single image took +3hours while all others take less than 3 minutes) so it might take a while to build the 433 images. But the mamba fallback seems to work too honestly.

@simonrosenberg
Copy link
Collaborator Author

Sorry for the vague message. The "acceptance criterion" for this PR was that you were able to run an eval with a single instance. I wanted to ask if you had successfully done that yet.

testing now on 100 images

simonrosenberg and others added 4 commits January 16, 2026 16:35
Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: openhands <openhands@all-hands.dev>
@simonrosenberg simonrosenberg requested a review from neubig January 16, 2026 18:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants