Add RayCluster support for DGX Cloud Lepton #389

roclark · 2025-11-13T22:25:39Z

Added LeptonRayCluster and LeptonRayJob classes backed by the RayCluster and RayJob helpers, respectively, to launch Ray jobs and clusters using the RayCluster feature on DGX Cloud Lepton. Developers can now launch a RayCluster and RayJobs via the SDK remotely on their DGX Cloud Lepton node groups.

Add support for launching a RayCluster on DGX Cloud Lepton and submitting RayJobs on the clusters using the lepton SDK. This uses the new RayCluster feature on DGX Cloud Lepton to dynamically spawn clusters up and down via the Python SDK and jobs can be submitted to deployed clusters directly. Signed-Off-By: Robert Clark <roclark@nvidia.com>

Signed-off-by: Zoey Zhang <zozhang@nvidia.com>

… RayCluster and linting Signed-off-by: Zoey Zhang <zozhang@nvidia.com>

Removed the placeholder Slurm packager handling comments from the Lepton RayCluster code. For now, the "workdir" parameter should be used for transferring local data to the remote Ray cluster. Signed-Off-By: Robert Clark <roclark@nvidia.com>

Fix issue to ensure the proper head node resource shape is used if it isn't explicitly given by the user. Signed-Off-By: Robert Clark <roclark@nvidia.com>

Updated the comments in the LeptonRayCluster and LeptonRayJob classes to accurately reflect the code. Signed-Off-By: Robert Clark <roclark@nvidia.com>

The RayJob logs stream would sometimes timeout and reset, causing a very long output of logs in the terminal as it continually resets. Signed-Off-By: Robert Clark <roclark@nvidia.com>

The head node resource shape for a LeptonRayCluster should be optional. If it isn't specified by the user, it should default to the same shape used for the worker nodes. Signed-Off-By: Robert Clark <roclark@nvidia.com>

Added an example to the Ray quick-start guide on how to use RayClusters and RayJobs with NeMo-Run on DGX Cloud Lepton. Signed-Off-By: Robert Clark <roclark@nvidia.com>

nemo_run/run/ray/lepton.py

Signed-Off-By: Robert Clark <roclark@nvidia.com>

zoeyz101

A couple small changes requested and some minor questions

zoeyz101 · 2025-11-14T15:56:53Z

docs/guides/ray.md

 | `run.ray.job.RayJob`         | Lifecycle of a Ray **job**   (submit ⇒ monitor ⇢ logs ⇢ cancel). | same |

-The two helpers share a uniform API; the chosen *Executor* decides whether we talk to the **KubeRay** operator (K8s) or a **Slurm** job under the hood.
+The two helpers share a uniform API; the chosen *Executor* decides whether we talk to the **KubeRay** operator (K8s), **DGX Cloud Lepton's RayCluster**, or a **Slurm** job under the hood.


three instead of two helper and I would say three helper executors

I interpreted the helpers being the RayCluster and RayJob objects with the different executors being the backends.

I interpreted this as the helpers being the executors. @hemildesai can you confirm? RayJob and RayCluster do not have the same API

docs/guides/ray.md

nemo_run/run/ray/lepton.py

nemo_run/run/ray/job.py

nemo_run/run/ray/lepton.py

Signed-Off-By: Robert Clark <roclark@nvidia.com>

Allows users to specify how long to wait for a RayCluster to be created on DGX Cloud Lepton. Signed-Off-By: Robert Clark <roclark@nvidia.com>

Signed-Off-By: Robert Clark <roclark@nvidia.com>

nemo_run/run/ray/lepton.py

Signed-Off-By: Robert Clark <roclark@nvidia.com>

zoeyz101

LGTM!

roclark · 2025-11-25T21:45:32Z

uv.lock

@hemildesai - I updated the pyproject.toml file as this update requires a newer minimum version of leptonai and it looked like the ray packages were missing from extras. Do you prefer that I update the uv.lock file with this update, or is there a mechanism to update it automatically when pyproject.toml is updated?

roclark and others added 9 commits November 13, 2025 10:22

making name unique and add resource shape for lepton raycluster

38269ef

Signed-off-by: Zoey Zhang <zozhang@nvidia.com>

adding head node reference in RayCluster, support defining secrets in…

5f23d75

… RayCluster and linting Signed-off-by: Zoey Zhang <zozhang@nvidia.com>

Fix RayCluster head resource shape

cabaea0

Fix issue to ensure the proper head node resource shape is used if it isn't explicitly given by the user. Signed-Off-By: Robert Clark <roclark@nvidia.com>

Update LeptonRay comments

96d3ac5

Updated the comments in the LeptonRayCluster and LeptonRayJob classes to accurately reflect the code. Signed-Off-By: Robert Clark <roclark@nvidia.com>

Fix RayJob logs streaming connection dropping

89106cf

The RayJob logs stream would sometimes timeout and reset, causing a very long output of logs in the terminal as it continually resets. Signed-Off-By: Robert Clark <roclark@nvidia.com>

Make RayCluster head resource shape optional

5932818

The head node resource shape for a LeptonRayCluster should be optional. If it isn't specified by the user, it should default to the same shape used for the worker nodes. Signed-Off-By: Robert Clark <roclark@nvidia.com>

Add doc for DGXC Lepton RayClusters

8fbfacc

Added an example to the Ray quick-start guide on how to use RayClusters and RayJobs with NeMo-Run on DGX Cloud Lepton. Signed-Off-By: Robert Clark <roclark@nvidia.com>

roclark requested review from hemildesai and zoeyz101 November 13, 2025 22:25

roclark added the enhancement New feature or request label Nov 13, 2025

roclark had a problem deploying to public November 13, 2025 22:27 — with GitHub Actions Failure

github-advanced-security bot found potential problems Nov 13, 2025

View reviewed changes

nemo_run/run/ray/lepton.py Fixed Show fixed Hide fixed

nemo_run/run/ray/lepton.py Fixed Show fixed Hide fixed

roclark had a problem deploying to public November 13, 2025 22:27 — with GitHub Actions Failure

Update license date

6814188

Signed-Off-By: Robert Clark <roclark@nvidia.com>

roclark had a problem deploying to public November 13, 2025 22:29 — with GitHub Actions Failure

zoeyz101 requested changes Nov 14, 2025

View reviewed changes

roclark added 4 commits November 14, 2025 11:00

Fix Ray guide typo

8465573

Signed-Off-By: Robert Clark <roclark@nvidia.com>

Make cluster readiness timeout a variable

b8d96a8

Allows users to specify how long to wait for a RayCluster to be created on DGX Cloud Lepton. Signed-Off-By: Robert Clark <roclark@nvidia.com>

Remove implicit returns

9cef6d1

Signed-Off-By: Robert Clark <roclark@nvidia.com>

Remove unused local variable

4d138d6

Signed-Off-By: Robert Clark <roclark@nvidia.com>

github-advanced-security bot found potential problems Nov 14, 2025

View reviewed changes

nemo_run/run/ray/lepton.py Fixed Show fixed Hide fixed

roclark had a problem deploying to public November 14, 2025 17:17 — with GitHub Actions Failure

Fix linting errors

92cb237

Signed-Off-By: Robert Clark <roclark@nvidia.com>

roclark had a problem deploying to public November 14, 2025 17:20 — with GitHub Actions Failure

Fix formatting errors

6436277

Signed-Off-By: Robert Clark <roclark@nvidia.com>

roclark had a problem deploying to public November 19, 2025 17:00 — with GitHub Actions Failure

roclark had a problem deploying to public November 19, 2025 17:01 — with GitHub Actions Failure

roclark force-pushed the roclark/leptonraycluster branch from 65a6cbe to c45e6b0 Compare November 19, 2025 17:27

roclark had a problem deploying to public November 19, 2025 17:29 — with GitHub Actions Failure

roclark force-pushed the roclark/leptonraycluster branch from c45e6b0 to b0a22a8 Compare November 19, 2025 17:43

roclark had a problem deploying to public November 19, 2025 17:45 — with GitHub Actions Failure

roclark force-pushed the roclark/leptonraycluster branch from b0a22a8 to 9991173 Compare November 19, 2025 17:48

roclark had a problem deploying to public November 19, 2025 17:52 — with GitHub Actions Failure

roclark force-pushed the roclark/leptonraycluster branch from 9991173 to 7252cb1 Compare November 19, 2025 17:53

roclark had a problem deploying to public November 19, 2025 17:54 — with GitHub Actions Failure

roclark had a problem deploying to public November 19, 2025 17:55 — with GitHub Actions Failure

roclark force-pushed the roclark/leptonraycluster branch from 7252cb1 to 34b93cc Compare November 19, 2025 17:56

roclark had a problem deploying to public November 19, 2025 17:58 — with GitHub Actions Failure

roclark force-pushed the roclark/leptonraycluster branch from 34b93cc to b04d893 Compare November 19, 2025 18:03

roclark had a problem deploying to public November 19, 2025 18:05 — with GitHub Actions Failure

roclark force-pushed the roclark/leptonraycluster branch from b04d893 to f637645 Compare November 19, 2025 18:31

roclark had a problem deploying to public November 19, 2025 18:33 — with GitHub Actions Failure

Add Lepton RayCluster tests

5cec02e

Signed-Off-By: Robert Clark <roclark@nvidia.com>

roclark force-pushed the roclark/leptonraycluster branch from f637645 to 5cec02e Compare November 19, 2025 18:35

roclark had a problem deploying to public November 19, 2025 18:36 — with GitHub Actions Failure

roclark had a problem deploying to public November 19, 2025 18:37 — with GitHub Actions Failure

zoeyz101 approved these changes Nov 25, 2025

View reviewed changes

roclark commented Nov 25, 2025

View reviewed changes

roclark merged commit d3be9ac into NVIDIA-NeMo:main Dec 4, 2025
20 of 23 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add RayCluster support for DGX Cloud Lepton #389

Add RayCluster support for DGX Cloud Lepton #389

Uh oh!

roclark commented Nov 13, 2025

Uh oh!

Uh oh!

Uh oh!

zoeyz101 left a comment

Uh oh!

zoeyz101 Nov 14, 2025

Uh oh!

roclark Nov 14, 2025

Uh oh!

zoeyz101 Nov 14, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zoeyz101 left a comment

Uh oh!

roclark Nov 25, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add RayCluster support for DGX Cloud Lepton #389

Add RayCluster support for DGX Cloud Lepton #389

Uh oh!

Conversation

roclark commented Nov 13, 2025

Uh oh!

Uh oh!

Uh oh!

zoeyz101 left a comment

Choose a reason for hiding this comment

Uh oh!

zoeyz101 Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

roclark Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

zoeyz101 Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zoeyz101 left a comment

Choose a reason for hiding this comment

Uh oh!

roclark Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants