fix: fix resourcePerNode override not applied with Volcano scheduler #2982

sksingh2005 · 2025-11-17T20:14:32Z

What this PR does / why we need it:

Computes PodGroup.spec.minResources for Volcano using TrainJob.spec.trainer.resourcesPerNode for the trainer PodSet, instead of the base TrainingRuntime template.
Continues to derive minMember from spec.trainer.numNodes , preserving existing behavior.
Adds a unit test to validate that minResources reflects per-job overrides (multiplied by the trainer node count).

Why We Need It

Ensures gang scheduling reserves the correct resources for each job instance, avoiding under-reservation that can delay or mis-schedule distributed training.
Aligns with the expected Volcano behavior where PodGroup.minResources equals the sum of pod resource requests for all members, matching users’ intent when they set resourcesPerNode at submission time.
Improves consistency between runtime defaults and job-level overrides, bringing the computed PodGroup in line with the actual pods that JobSet will run.

Which issue(s) this PR fixes
Fixes #2980
Fixes #2525

Checklist:

Docs included if any changes are user facing

google-oss-prow · 2025-11-17T20:14:50Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign electronic-waste for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

andreyvelich · 2025-11-17T20:43:46Z

pkg/runtime/framework/plugins/volcano/volcano.go

+		perPod := ps.SinglePodRequests
+		if ps.Ancestor != nil && *ps.Ancestor == constants.AncestorTrainer && trainJob.Spec.Trainer != nil && trainJob.Spec.Trainer.ResourcesPerNode != nil {
+			res := trainJob.Spec.Trainer.ResourcesPerNode
+			if res.Requests != nil && len(res.Requests) > 0 {
+				perPod = res.Requests
+			} else if res.Limits != nil && len(res.Limits) > 0 {
+				perPod = res.Limits
+			}
+		}


Thanks for this work @sksingh2005!
I think, we should consider approach to update PodSet resources in the MLPolicy plugins as I shared here: #2980 (comment)

Ideally, SinglePodRequests should have correct value once we create PodGroup object for Volcano/Co-scheduling. We can also consider to rename SinglePodRequests to SinglePodResources to cover requests/limits.

We might also want to use this value further in the extension framework, when we create JobSet:

trainer/pkg/runtime/framework/plugins/jobset/builder.go

Line 144 in a364e31

// Eventually, we should find better way to propagate resources from TrainJob to JobSet.

cc @astefanutti @tenzen-y

Updated the changes sir

Please can you fix the unit tests? Also, we should consider to switch to SinglePodResources to cover requests and limits assignment.

we should consider to switch to SinglePodResources to cover requests and limits assignment.
this is with respect to framework_test file sir ?

@andreyvelich sir fixed the unit test with ok in all mpi , torch and framework as they were not passing earlier

sir i did not understood "Also, we should consider to switch to SinglePodResources to cover requests and limits assignment."

I suggest that we change SinglePodRequests to SinglePodResources for the Info's PodSet object:

trainer/pkg/runtime/runtime.go

Lines 74 to 76 in a2487b4

// The total PodSet requests can be calculated with

// SinglePodRequests x Count.

SinglePodRequests corev1.ResourceList

And update plugins accordingly.

@tenzen-y @astefanutti Do you have any objections with that?

Should i proceed changing this sir ?

@andreyvelich sir made the changes of SinglePodRequests to SinglePodResources in latest commit

coveralls · 2025-11-19T02:08:28Z

Pull Request Test Coverage Report for Build 20424329036

Details

22 of 27 (81.48%) changed or added relevant lines in 6 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage increased (+0.2%) to 51.589%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
pkg/runtime/framework/plugins/plainml/plainml.go	1	6	16.67%

Totals
Change from base Build 20418168718:	0.2%
Covered Lines:	1250
Relevant Lines:	2423

💛 - Coveralls

sksingh2005 · 2025-12-11T21:00:41Z

@andreyvelich sir any update i have to make in the changes?

Signed-off-by: sksingh2005 <shashanksgh3@gmail.com>

…ainer.resourcesPerNode when specified on a TrainJob Signed-off-by: sksingh2005 <shashanksgh3@gmail.com>

Signed-off-by: sksingh2005 <shashanksgh3@gmail.com>

andreyvelich · 2026-01-06T00:19:45Z

pkg/runtime/runtime.go

+			Ancestor:           ancestor,
+			Count:              ptr.To(max(count, 1)),
+			Volumes:            podSpecApply.Volumes,
+			SinglePodResources: resourcehelpers.PodRequests(&corev1.Pod{Spec: typedPodSpec}, resourcehelpers.PodResourcesOptions{}),


Sorry for the late reply @sksingh2005!
As I mentioned in this comment: #2982 (comment), SinglePodResources should consistent both limits and requests, so the type of it should be:

SinglePodResources corev1.ResourceRequirements

Then, you can populate request and limits as follows:

SinglePodRequests: corev1.ResourceRequirements{ Requests: resourcehelpers.PodRequests(&corev1.Pod{Spec: typedPodSpec}, resourcehelpers.PodResourcesOptions{}), Limits: resourcehelpers.PodLimits(&corev1.Pod{Spec: typedPodSpec}, resourcehelpers.PodResourcesOptions{}), },

We should also update JobSet builder to correctly use these values from Info object.

cc @astefanutti @tenzen-y

However, I might be wrong since we should only apply resources to the JobSet's node container, but PodLimits() and PodRequests() calculates resources from all Pod containers and initContainers.
So we might need to introduce container level resources to the Info object.

@sksingh2005 Let's post-pone the JobSet builder updates to the next PRs.

Please can you rollback changes to use SinglePodRequests corev1.ResourceList in the Info object for now.
After that, just update the Torch, MPI, and PlainML plugin to fix this issue: #2980

Does it sound good to you @kubeflow/kubeflow-trainer-team ?

google-oss-prow bot added the size/L label Nov 17, 2025

google-oss-prow bot requested review from jinchihe and kuizhiqing November 17, 2025 20:14

andreyvelich reviewed Nov 17, 2025

View reviewed changes

sksingh2005 changed the title ~~volcano integeration for resourcePerNode override in trainjob~~ fix(trainjob): fix resourcePerNode override not applied with Volcano scheduler Nov 17, 2025

sksingh2005 changed the title ~~fix(trainjob): fix resourcePerNode override not applied with Volcano scheduler~~ fix: fix resourcePerNode override not applied with Volcano scheduler Nov 17, 2025

sksingh2005 force-pushed the volcano branch from fbb79aa to c4c32b5 Compare November 18, 2025 06:05

andreyvelich mentioned this pull request Nov 18, 2025

Volcano integration doesn't account for resourcesPerNode override in the trainjob #2980

Open

AndEsterson mentioned this pull request Nov 18, 2025

fix: Allow TrainJobs to override runtime resources (#2980) #2983

Closed

google-oss-prow bot added size/S and removed size/L labels Nov 18, 2025

google-oss-prow bot added size/M size/L and removed size/S size/M labels Nov 19, 2025

sksingh2005 added 9 commits December 22, 2025 12:24

volcano integeration for resourcePerNode override in trainjob

ea56657

Signed-off-by: sksingh2005 <shashanksgh3@gmail.com>

Fixed the issue where PodGroup resources weren't scaling with spec.tr…

8eda018

…ainer.resourcesPerNode when specified on a TrainJob Signed-off-by: sksingh2005 <shashanksgh3@gmail.com>

limits removed

bc40818

Signed-off-by: sksingh2005 <shashanksgh3@gmail.com>

maintains consistency across all MLPolicy plugins

9fd1c16

Signed-off-by: sksingh2005 <shashanksgh3@gmail.com>

Fix pre-commit formatting

a2621e8

Signed-off-by: sksingh2005 <shashanksgh3@gmail.com>

Unit test updates for mpi and torch pluggins

578c1fb

Signed-off-by: sksingh2005 <shashanksgh3@gmail.com>

framework unit test prblm solved

a4b5470

Signed-off-by: sksingh2005 <shashanksgh3@gmail.com>

SinglePodRequests to SinglePodResources in Info podset object

d0008df

Signed-off-by: sksingh2005 <shashanksgh3@gmail.com>

Go fmt error resolve

f5d5c90

Signed-off-by: sksingh2005 <shashanksgh3@gmail.com>

sksingh2005 force-pushed the volcano branch from 8fefcc5 to f5d5c90 Compare December 22, 2025 06:55

andreyvelich mentioned this pull request Dec 24, 2025

Support TrainJob ResourcePerNode in CoScheduling plugin #2525

Open

sksingh2005 requested a review from andreyvelich January 5, 2026 16:15

andreyvelich reviewed Jan 6, 2026

View reviewed changes

	// The total PodSet requests can be calculated with
	// SinglePodRequests x Count.
	SinglePodRequests corev1.ResourceList

fix: fix resourcePerNode override not applied with Volcano scheduler #2982

Are you sure you want to change the base?

fix: fix resourcePerNode override not applied with Volcano scheduler #2982

Conversation

sksingh2005 commented Nov 17, 2025 • edited by andreyvelich Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

google-oss-prow bot commented Nov 17, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coveralls commented Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Test Coverage Report for Build 20424329036

Details

💛 - Coveralls

Uh oh!

sksingh2005 commented Dec 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andreyvelich Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

sksingh2005 commented Nov 17, 2025 •

edited by andreyvelich

Loading

coveralls commented Nov 19, 2025 •

edited

Loading

sksingh2005 commented Dec 11, 2025 •

edited

Loading

andreyvelich Jan 6, 2026 •

edited

Loading