Skip to content

Conversation

@sksingh2005
Copy link

@sksingh2005 sksingh2005 commented Nov 17, 2025

What this PR does / why we need it:

  • Computes PodGroup.spec.minResources for Volcano using TrainJob.spec.trainer.resourcesPerNode for the trainer PodSet, instead of the base TrainingRuntime template.
  • Continues to derive minMember from spec.trainer.numNodes , preserving existing behavior.
  • Adds a unit test to validate that minResources reflects per-job overrides (multiplied by the trainer node count).

Why We Need It

  • Ensures gang scheduling reserves the correct resources for each job instance, avoiding under-reservation that can delay or mis-schedule distributed training.
  • Aligns with the expected Volcano behavior where PodGroup.minResources equals the sum of pod resource requests for all members, matching users’ intent when they set resourcesPerNode at submission time.
  • Improves consistency between runtime defaults and job-level overrides, bringing the computed PodGroup in line with the actual pods that JobSet will run.

Which issue(s) this PR fixes
Fixes #2980
Fixes #2525

Checklist:

  • Docs included if any changes are user facing

@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign electronic-waste for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Comment on lines 167 to 175
perPod := ps.SinglePodRequests
if ps.Ancestor != nil && *ps.Ancestor == constants.AncestorTrainer && trainJob.Spec.Trainer != nil && trainJob.Spec.Trainer.ResourcesPerNode != nil {
res := trainJob.Spec.Trainer.ResourcesPerNode
if res.Requests != nil && len(res.Requests) > 0 {
perPod = res.Requests
} else if res.Limits != nil && len(res.Limits) > 0 {
perPod = res.Limits
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this work @sksingh2005!
I think, we should consider approach to update PodSet resources in the MLPolicy plugins as I shared here: #2980 (comment)

Ideally, SinglePodRequests should have correct value once we create PodGroup object for Volcano/Co-scheduling. We can also consider to rename SinglePodRequests to SinglePodResources to cover requests/limits.

We might also want to use this value further in the extension framework, when we create JobSet:

// Eventually, we should find better way to propagate resources from TrainJob to JobSet.

cc @astefanutti @tenzen-y

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the changes sir

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please can you fix the unit tests? Also, we should consider to switch to SinglePodResources to cover requests and limits assignment.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should consider to switch to SinglePodResources to cover requests and limits assignment.
this is with respect to framework_test file sir ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andreyvelich sir fixed the unit test with ok in all mpi , torch and framework as they were not passing earlier

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sir i did not understood "Also, we should consider to switch to SinglePodResources to cover requests and limits assignment."

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest that we change SinglePodRequests to SinglePodResources for the Info's PodSet object:

// The total PodSet requests can be calculated with
// SinglePodRequests x Count.
SinglePodRequests corev1.ResourceList

And update plugins accordingly.

@tenzen-y @astefanutti Do you have any objections with that?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should i proceed changing this sir ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andreyvelich sir made the changes of SinglePodRequests to SinglePodResources in latest commit

@sksingh2005 sksingh2005 changed the title volcano integeration for resourcePerNode override in trainjob fix(trainjob): fix resourcePerNode override not applied with Volcano scheduler Nov 17, 2025
@sksingh2005 sksingh2005 changed the title fix(trainjob): fix resourcePerNode override not applied with Volcano scheduler fix: fix resourcePerNode override not applied with Volcano scheduler Nov 17, 2025
@google-oss-prow google-oss-prow bot added size/S and removed size/L labels Nov 18, 2025
@coveralls
Copy link

coveralls commented Nov 19, 2025

Pull Request Test Coverage Report for Build 20424329036

Details

  • 22 of 27 (81.48%) changed or added relevant lines in 6 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+0.2%) to 51.589%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pkg/runtime/framework/plugins/plainml/plainml.go 1 6 16.67%
Totals Coverage Status
Change from base Build 20418168718: 0.2%
Covered Lines: 1250
Relevant Lines: 2423

💛 - Coveralls

@sksingh2005
Copy link
Author

sksingh2005 commented Dec 11, 2025

@andreyvelich sir any update i have to make in the changes?

Signed-off-by: sksingh2005 <shashanksgh3@gmail.com>
…ainer.resourcesPerNode when specified on a TrainJob

Signed-off-by: sksingh2005 <shashanksgh3@gmail.com>
Signed-off-by: sksingh2005 <shashanksgh3@gmail.com>
Signed-off-by: sksingh2005 <shashanksgh3@gmail.com>
Signed-off-by: sksingh2005 <shashanksgh3@gmail.com>
Signed-off-by: sksingh2005 <shashanksgh3@gmail.com>
Signed-off-by: sksingh2005 <shashanksgh3@gmail.com>
Signed-off-by: sksingh2005 <shashanksgh3@gmail.com>
Signed-off-by: sksingh2005 <shashanksgh3@gmail.com>
Ancestor: ancestor,
Count: ptr.To(max(count, 1)),
Volumes: podSpecApply.Volumes,
SinglePodResources: resourcehelpers.PodRequests(&corev1.Pod{Spec: typedPodSpec}, resourcehelpers.PodResourcesOptions{}),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late reply @sksingh2005!
As I mentioned in this comment: #2982 (comment), SinglePodResources should consistent both limits and requests, so the type of it should be:

SinglePodResources corev1.ResourceRequirements

Then, you can populate request and limits as follows:

			SinglePodRequests: corev1.ResourceRequirements{
				Requests: resourcehelpers.PodRequests(&corev1.Pod{Spec: typedPodSpec}, resourcehelpers.PodResourcesOptions{}),
				Limits:   resourcehelpers.PodLimits(&corev1.Pod{Spec: typedPodSpec}, resourcehelpers.PodResourcesOptions{}),
			},

We should also update JobSet builder to correctly use these values from Info object.

cc @astefanutti @tenzen-y

Copy link
Member

@andreyvelich andreyvelich Jan 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, I might be wrong since we should only apply resources to the JobSet's node container, but PodLimits() and PodRequests() calculates resources from all Pod containers and initContainers.
So we might need to introduce container level resources to the Info object.

@sksingh2005 Let's post-pone the JobSet builder updates to the next PRs.

Please can you rollback changes to use SinglePodRequests corev1.ResourceList in the Info object for now.
After that, just update the Torch, MPI, and PlainML plugin to fix this issue: #2980

Does it sound good to you @kubeflow/kubeflow-trainer-team ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Volcano integration doesn't account for resourcesPerNode override in the trainjob Support TrainJob ResourcePerNode in CoScheduling plugin

3 participants