Skip to content

Conversation

@roclark
Copy link
Contributor

@roclark roclark commented Nov 13, 2025

Added LeptonRayCluster and LeptonRayJob classes backed by the RayCluster and RayJob helpers, respectively, to launch Ray jobs and clusters using the RayCluster feature on DGX Cloud Lepton. Developers can now launch a RayCluster and RayJobs via the SDK remotely on their DGX Cloud Lepton node groups.

roclark and others added 9 commits November 13, 2025 10:22
Add support for launching a RayCluster on DGX Cloud Lepton and submitting
RayJobs on the clusters using the lepton SDK. This uses the new RayCluster
feature on DGX Cloud Lepton to dynamically spawn clusters up and down via
the Python SDK and jobs can be submitted to deployed clusters directly.

Signed-Off-By: Robert Clark <roclark@nvidia.com>
Signed-off-by: Zoey Zhang <zozhang@nvidia.com>
… RayCluster and linting

Signed-off-by: Zoey Zhang <zozhang@nvidia.com>
Removed the placeholder Slurm packager handling comments from the Lepton
RayCluster code. For now, the "workdir" parameter should be used for
transferring local data to the remote Ray cluster.

Signed-Off-By: Robert Clark <roclark@nvidia.com>
Fix issue to ensure the proper head node resource shape is used if it
isn't explicitly given by the user.

Signed-Off-By: Robert Clark <roclark@nvidia.com>
Updated the comments in the LeptonRayCluster and LeptonRayJob classes to
accurately reflect the code.

Signed-Off-By: Robert Clark <roclark@nvidia.com>
The RayJob logs stream would sometimes timeout and reset, causing a very
long output of logs in the terminal as it continually resets.

Signed-Off-By: Robert Clark <roclark@nvidia.com>
The head node resource shape for a LeptonRayCluster should be optional. If
it isn't specified by the user, it should default to the same shape used
for the worker nodes.

Signed-Off-By: Robert Clark <roclark@nvidia.com>
Added an example to the Ray quick-start guide on how to use RayClusters
and RayJobs with NeMo-Run on DGX Cloud Lepton.

Signed-Off-By: Robert Clark <roclark@nvidia.com>
@roclark roclark added the enhancement New feature or request label Nov 13, 2025
Signed-Off-By: Robert Clark <roclark@nvidia.com>
Copy link
Contributor

@zoeyz101 zoeyz101 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple small changes requested and some minor questions

| `run.ray.job.RayJob` | Lifecycle of a Ray **job** (submit ⇒ monitor ⇢ logs ⇢ cancel). | same |

The two helpers share a uniform API; the chosen *Executor* decides whether we talk to the **KubeRay** operator (K8s) or a **Slurm** job under the hood.
The two helpers share a uniform API; the chosen *Executor* decides whether we talk to the **KubeRay** operator (K8s), **DGX Cloud Lepton's RayCluster**, or a **Slurm** job under the hood.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

three instead of two helper and I would say three helper executors

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I interpreted the helpers being the RayCluster and RayJob objects with the different executors being the backends.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I interpreted this as the helpers being the executors. @hemildesai can you confirm? RayJob and RayCluster do not have the same API

Signed-Off-By: Robert Clark <roclark@nvidia.com>
Allows users to specify how long to wait for a RayCluster to be created
on DGX Cloud Lepton.

Signed-Off-By: Robert Clark <roclark@nvidia.com>
Signed-Off-By: Robert Clark <roclark@nvidia.com>
Signed-Off-By: Robert Clark <roclark@nvidia.com>
Signed-Off-By: Robert Clark <roclark@nvidia.com>
Signed-Off-By: Robert Clark <roclark@nvidia.com>
@roclark roclark force-pushed the roclark/leptonraycluster branch from 65a6cbe to c45e6b0 Compare November 19, 2025 17:27
@roclark roclark force-pushed the roclark/leptonraycluster branch from c45e6b0 to b0a22a8 Compare November 19, 2025 17:43
@roclark roclark force-pushed the roclark/leptonraycluster branch from b0a22a8 to 9991173 Compare November 19, 2025 17:48
@roclark roclark force-pushed the roclark/leptonraycluster branch from 9991173 to 7252cb1 Compare November 19, 2025 17:53
@roclark roclark force-pushed the roclark/leptonraycluster branch from 7252cb1 to 34b93cc Compare November 19, 2025 17:56
@roclark roclark force-pushed the roclark/leptonraycluster branch from 34b93cc to b04d893 Compare November 19, 2025 18:03
@roclark roclark force-pushed the roclark/leptonraycluster branch from b04d893 to f637645 Compare November 19, 2025 18:31
Signed-Off-By: Robert Clark <roclark@nvidia.com>
Copy link
Contributor

@zoeyz101 zoeyz101 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hemildesai - I updated the pyproject.toml file as this update requires a newer minimum version of leptonai and it looked like the ray packages were missing from extras. Do you prefer that I update the uv.lock file with this update, or is there a mechanism to update it automatically when pyproject.toml is updated?

@roclark roclark merged commit d3be9ac into NVIDIA-NeMo:main Dec 4, 2025
20 of 23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants