-
Notifications
You must be signed in to change notification settings - Fork 863
fix: fix resourcePerNode override not applied with Volcano scheduler #2982
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
| perPod := ps.SinglePodRequests | ||
| if ps.Ancestor != nil && *ps.Ancestor == constants.AncestorTrainer && trainJob.Spec.Trainer != nil && trainJob.Spec.Trainer.ResourcesPerNode != nil { | ||
| res := trainJob.Spec.Trainer.ResourcesPerNode | ||
| if res.Requests != nil && len(res.Requests) > 0 { | ||
| perPod = res.Requests | ||
| } else if res.Limits != nil && len(res.Limits) > 0 { | ||
| perPod = res.Limits | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this work @sksingh2005!
I think, we should consider approach to update PodSet resources in the MLPolicy plugins as I shared here: #2980 (comment)
Ideally, SinglePodRequests should have correct value once we create PodGroup object for Volcano/Co-scheduling. We can also consider to rename SinglePodRequests to SinglePodResources to cover requests/limits.
We might also want to use this value further in the extension framework, when we create JobSet:
| // Eventually, we should find better way to propagate resources from TrainJob to JobSet. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the changes sir
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please can you fix the unit tests? Also, we should consider to switch to SinglePodResources to cover requests and limits assignment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should consider to switch to SinglePodResources to cover requests and limits assignment.
this is with respect to framework_test file sir ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andreyvelich sir fixed the unit test with ok in all mpi , torch and framework as they were not passing earlier
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sir i did not understood "Also, we should consider to switch to SinglePodResources to cover requests and limits assignment."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest that we change SinglePodRequests to SinglePodResources for the Info's PodSet object:
trainer/pkg/runtime/runtime.go
Lines 74 to 76 in a2487b4
| // The total PodSet requests can be calculated with | |
| // SinglePodRequests x Count. | |
| SinglePodRequests corev1.ResourceList |
And update plugins accordingly.
@tenzen-y @astefanutti Do you have any objections with that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should i proceed changing this sir ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andreyvelich sir made the changes of SinglePodRequests to SinglePodResources in latest commit
Pull Request Test Coverage Report for Build 20424329036Details
💛 - Coveralls |
|
@andreyvelich sir any update i have to make in the changes? |
Signed-off-by: sksingh2005 <shashanksgh3@gmail.com>
…ainer.resourcesPerNode when specified on a TrainJob Signed-off-by: sksingh2005 <shashanksgh3@gmail.com>
Signed-off-by: sksingh2005 <shashanksgh3@gmail.com>
Signed-off-by: sksingh2005 <shashanksgh3@gmail.com>
Signed-off-by: sksingh2005 <shashanksgh3@gmail.com>
Signed-off-by: sksingh2005 <shashanksgh3@gmail.com>
Signed-off-by: sksingh2005 <shashanksgh3@gmail.com>
Signed-off-by: sksingh2005 <shashanksgh3@gmail.com>
Signed-off-by: sksingh2005 <shashanksgh3@gmail.com>
| Ancestor: ancestor, | ||
| Count: ptr.To(max(count, 1)), | ||
| Volumes: podSpecApply.Volumes, | ||
| SinglePodResources: resourcehelpers.PodRequests(&corev1.Pod{Spec: typedPodSpec}, resourcehelpers.PodResourcesOptions{}), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the late reply @sksingh2005!
As I mentioned in this comment: #2982 (comment), SinglePodResources should consistent both limits and requests, so the type of it should be:
SinglePodResources corev1.ResourceRequirementsThen, you can populate request and limits as follows:
SinglePodRequests: corev1.ResourceRequirements{
Requests: resourcehelpers.PodRequests(&corev1.Pod{Spec: typedPodSpec}, resourcehelpers.PodResourcesOptions{}),
Limits: resourcehelpers.PodLimits(&corev1.Pod{Spec: typedPodSpec}, resourcehelpers.PodResourcesOptions{}),
},We should also update JobSet builder to correctly use these values from Info object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, I might be wrong since we should only apply resources to the JobSet's node container, but PodLimits() and PodRequests() calculates resources from all Pod containers and initContainers.
So we might need to introduce container level resources to the Info object.
@sksingh2005 Let's post-pone the JobSet builder updates to the next PRs.
Please can you rollback changes to use SinglePodRequests corev1.ResourceList in the Info object for now.
After that, just update the Torch, MPI, and PlainML plugin to fix this issue: #2980
Does it sound good to you @kubeflow/kubeflow-trainer-team ?
What this PR does / why we need it:
Why We Need It
Which issue(s) this PR fixes
Fixes #2980
Fixes #2525
Checklist: