Skip to content

Conversation

@MarcusSorealheis
Copy link
Collaborator

@MarcusSorealheis MarcusSorealheis commented Jan 26, 2026

Description

This change allows the worker to complete initialization (warming CAS connections,
verifying disk space, etc.) before accepting tasks. This is particularly
useful for cold pod startup where initialization can take several minutes.

Fixes # (issue)

Type of change

Please delete options that aren't relevant.

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to
    not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

Please also list any relevant details for your test configuration

Checklist

  • Updated documentation if needed
  • Tests added/amended
  • bazel test //... passes locally
  • PR is contained in a single commit, using git amend see some docs

This change is Reviewable

Workers can now delay registration with the scheduler until they are
fully ready to accept tasks. This prevents the scheduler from assigning
tasks to workers that are still initializing (e.g., cold pods that need
time to warm CAS connections, download container images, etc.).

New LocalWorkerConfig options:
- warmup_delay_secs: Delay in seconds before registering (default: 0)
- verify_cas_before_register: Verify CAS health before registering
- readiness_check_script: Custom script to run as readiness check
- readiness_timeout_secs: Timeout for readiness checks (default: 600s)

The warmup phase executes after connecting to the scheduler endpoint
but before calling register_worker(). If a readiness script is
configured, it will retry until success or timeout.

This addresses the root cause of stalled tasks when cold workers
register before being ready to execute actions.
…d-worker-registration

* 'main' of github.com:TraceMachina/nativelink:
  Support ignorable platform properties (#2120)
Add example files demonstrating the new worker warmup feature:

- worker_readiness_check.sh: Example readiness check script that
  verifies Docker daemon, disk space, CAS connectivity, and GPU
  availability before worker registration.

- worker_with_warmup.json5: Example worker configuration showing
  how to use warmup_delay_secs, readiness_check_script, and
  readiness_timeout_secs options.

These examples help users configure workers to delay registration
until they are fully ready to accept tasks, preventing stalled jobs
during cold pod startups.
Adds three new tests to verify warmup delay functionality:
- warmup_delay_is_applied_before_registration: Verifies that the configured
  warmup delay is applied before worker registers with scheduler
- readiness_check_script_success: Verifies readiness check script runs and
  allows registration when it returns success
- readiness_check_script_with_warmup_delay: Verifies both features work
  together (readiness check runs first, then warmup delay)

Also adds setup_local_worker_with_real_sleep() test utility that uses actual
tokio::time::sleep instead of the no-op mock, enabling tests to verify
timing behavior.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Apply formatjson5 and shfmt formatting to:
- worker_with_warmup.json5 - Use JSON5 style (unquoted keys, trailing commas)
- worker_readiness_check.sh - Add spaces around redirects (2> /dev/null)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@amankrx
Copy link
Collaborator

amankrx commented Jan 27, 2026

/build-image

@github-actions
Copy link

Image built and pushed!

ghcr.io/TraceMachina/nativelink:43337d1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants