-
Notifications
You must be signed in to change notification settings - Fork 87
Add RayCluster support for DGX Cloud Lepton #389
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Add support for launching a RayCluster on DGX Cloud Lepton and submitting RayJobs on the clusters using the lepton SDK. This uses the new RayCluster feature on DGX Cloud Lepton to dynamically spawn clusters up and down via the Python SDK and jobs can be submitted to deployed clusters directly. Signed-Off-By: Robert Clark <roclark@nvidia.com>
Signed-off-by: Zoey Zhang <zozhang@nvidia.com>
… RayCluster and linting Signed-off-by: Zoey Zhang <zozhang@nvidia.com>
Removed the placeholder Slurm packager handling comments from the Lepton RayCluster code. For now, the "workdir" parameter should be used for transferring local data to the remote Ray cluster. Signed-Off-By: Robert Clark <roclark@nvidia.com>
Fix issue to ensure the proper head node resource shape is used if it isn't explicitly given by the user. Signed-Off-By: Robert Clark <roclark@nvidia.com>
Updated the comments in the LeptonRayCluster and LeptonRayJob classes to accurately reflect the code. Signed-Off-By: Robert Clark <roclark@nvidia.com>
The RayJob logs stream would sometimes timeout and reset, causing a very long output of logs in the terminal as it continually resets. Signed-Off-By: Robert Clark <roclark@nvidia.com>
The head node resource shape for a LeptonRayCluster should be optional. If it isn't specified by the user, it should default to the same shape used for the worker nodes. Signed-Off-By: Robert Clark <roclark@nvidia.com>
Added an example to the Ray quick-start guide on how to use RayClusters and RayJobs with NeMo-Run on DGX Cloud Lepton. Signed-Off-By: Robert Clark <roclark@nvidia.com>
Signed-Off-By: Robert Clark <roclark@nvidia.com>
zoeyz101
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple small changes requested and some minor questions
| | `run.ray.job.RayJob` | Lifecycle of a Ray **job** (submit ⇒ monitor ⇢ logs ⇢ cancel). | same | | ||
|
|
||
| The two helpers share a uniform API; the chosen *Executor* decides whether we talk to the **KubeRay** operator (K8s) or a **Slurm** job under the hood. | ||
| The two helpers share a uniform API; the chosen *Executor* decides whether we talk to the **KubeRay** operator (K8s), **DGX Cloud Lepton's RayCluster**, or a **Slurm** job under the hood. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
three instead of two helper and I would say three helper executors
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I interpreted the helpers being the RayCluster and RayJob objects with the different executors being the backends.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I interpreted this as the helpers being the executors. @hemildesai can you confirm? RayJob and RayCluster do not have the same API
Signed-Off-By: Robert Clark <roclark@nvidia.com>
Allows users to specify how long to wait for a RayCluster to be created on DGX Cloud Lepton. Signed-Off-By: Robert Clark <roclark@nvidia.com>
Signed-Off-By: Robert Clark <roclark@nvidia.com>
Signed-Off-By: Robert Clark <roclark@nvidia.com>
Signed-Off-By: Robert Clark <roclark@nvidia.com>
Signed-Off-By: Robert Clark <roclark@nvidia.com>
65a6cbe to
c45e6b0
Compare
c45e6b0 to
b0a22a8
Compare
b0a22a8 to
9991173
Compare
9991173 to
7252cb1
Compare
7252cb1 to
34b93cc
Compare
34b93cc to
b04d893
Compare
b04d893 to
f637645
Compare
Signed-Off-By: Robert Clark <roclark@nvidia.com>
f637645 to
5cec02e
Compare
zoeyz101
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hemildesai - I updated the pyproject.toml file as this update requires a newer minimum version of leptonai and it looked like the ray packages were missing from extras. Do you prefer that I update the uv.lock file with this update, or is there a mechanism to update it automatically when pyproject.toml is updated?
Added
LeptonRayClusterandLeptonRayJobclasses backed by theRayClusterandRayJobhelpers, respectively, to launch Ray jobs and clusters using the RayCluster feature on DGX Cloud Lepton. Developers can now launch a RayCluster and RayJobs via the SDK remotely on their DGX Cloud Lepton node groups.