WIP: KEP-4671: Introduce Workload Scheduling Cycle, graduate Workload API and gang scheduling to beta #5730

macsko · 2025-12-10T09:05:27Z

One-line PR description: Update KEP-4671 for beta in v1.36. Introduce Workload Scheduling Cycle phase.

Issue link: Gang Scheduling Support in Kubernetes #4671

Other comments: This PR is shows an initial idea of new workload scheduling phase. Further changes related to graduation to beta will be added gradually.

macsko · 2025-12-10T09:06:05Z

/cc @dom4ha @sanposhiho @wojtek-t @erictune

Argh4k · 2025-12-10T13:54:35Z

keps/sig-scheduling/4671-gang-scheduling/README.md

+
+Alternative 1 (Modify sorting logic):
+
+Modify the sorting logic within the existing `PriorityQueue` to put all pods


We would need to modify it once we have workload priority, right? Otherwise we can keep the current sorting algorithm and pull all gang members from the queue once we encounter the first member of the gang (and switch to workload scheduling), right?

Right, we could keep the current algorithm, but since the gang members can be in any internal queue, removing them from these queues would require some effort (running PreEnqueue, registering metrics, filling in internal structures used for queuing hints, etc.). In fact, this effort could be moved to properly implement queueing with more feature-proof alternative.

If we decide that if there is no workload priority then priority = min(pods priority) (as Wojtek suggested) then it becomes a big problem with this approach.

Yes, if we choose that priority calculation, then this alternative is just a bad idea

erictune

This looks great overall.

I have one proposal to consider renaming part of the API, and some clarifying questions.

keps/sig-scheduling/4671-gang-scheduling/README.md

erictune · 2025-12-11T00:04:08Z

keps/sig-scheduling/4671-gang-scheduling/README.md

+	// +required
+	Name string
+
+	// PodGroup is the name of the PodGroup within the Workload that this Pod


The term PodGroup has become ambiguous. It can mean:

a list item in Workload.podGroups

a specific set of pods having the same PodGroupReplicaKey
These are often the same thing but not always.

If we want to be clear what we mean, we can either:

Always call (1) a PodGroup and always call (2) a PodGroupReplica, including in the case when podGroupReplicaKey is nil

Or, rename PodGroup to PodGroup Template, and call (2) a PodGroup

This may seem pedantic, but when reviewing this KEP, I felt that current double meaning made the text imprecise in a number of places. We do have a chance to fix the naming with v1beta1, but I don't think we are supposed to rename when we go to GA.

My personal preference: when talking about (2), I would just call that either PodGroupReplica or alternatively (which is what I was using) PodGroup instance. WIth that, I wouldn't rename it - but these I my personal preferences.

If we rename it to PodGroupReplica, are we planning to introduce PodSubGroup under it, as we discussed here: #5711 (comment) ? And what about PodSubGroupReplica?

erictune · 2025-12-11T00:09:13Z

keps/sig-scheduling/4671-gang-scheduling/README.md


-// WorkloadSpec describes a workload in a portable way that scheduler and related
-// tools can understand.  
+// WorkloadMaxPodGroups is the maximum number of pod groups per Workload.


This is the number of items in the list PodGroups

FWIW - these updates are based on what was merged in the code.
I'm happy for these to be updated, but for these specifically maybe we should open a dedicated PR in k/k.

Agree with @wojtek-t. We can defer API description discussions to the k/k PR with API graduation (when it will be created)

erictune · 2025-12-11T00:10:04Z

keps/sig-scheduling/4671-gang-scheduling/README.md

+	ControllerRef *TypedLocalObjectReference
+
+	// PodGroups is the list of pod groups that make up the Workload.
+	// The maximum number of pod groups is 8. This field is immutable.


But the maximum number of PodGroup Replicas is not limited.

That's right. It would be even hard to enforce such limitation in the current form of replication - it's based solely on pods' workloadRefs.

erictune · 2025-12-11T00:33:19Z

keps/sig-scheduling/4671-gang-scheduling/README.md

+While this won't fully address the "optimal enough" part of requirement (2),
+it ensures that all gang pods are processed together.
+The single scheduling cycle, together with blocking resources using nomination,
+will address requirement (3).


In alpha, we avoided the case where two workloads each have some of their pods bound and some not bound, for an unbounded amount of time (deadlock). However, it is possible that both workloads could both keep timing out before getting scheduled. (what might be called "livelock").

In beta, IIUC, avoid the "livelock" case by trying to do all the pods of one workload before starting work on another workload.

Thanks for suggestion, applied

erictune · 2025-12-11T00:50:01Z

keps/sig-scheduling/4671-gang-scheduling/README.md

+
+When the scheduler pops a Pod from the active queue, it checks if that Pod
+belongs to an unscheduled `PodGroup` with a `Gang` policy. If so, the scheduler
+initiates the Workload Scheduling Cycle.


If a pod belongs to an already scheduled PodGroup, it is not clear what to do. We could:

Hold it back, on the assumption that it is a replacement for an already scheduled pod. When minCount additional pods show up, then handle all those at once.

Treat it as if it is a "gang scale up", and try to place it too (on a best-effort basis). If we do this, does it go through Workload cycle, or just normal pod-at-a-time path?

I thinking about races that could happen when a workload is failing and getting pods replaced. And I am thinking about the case when a workload wants to tolerate a small number of failures by having actual count > minCount.

Taking a step back from the implementation, we should think what we really need to do conceptually in that case.

If the pod is part of PodGroup, then what we conceptually want is to schedule that in the context of its PodGroup instance. I think there are effectively three cases here:

this PodGroup instance doesn't satisfy its minCount now, but with this pod it will satisfy it

this PodGroup instance doesn't satisfy its minCount now and with that pod it won't satisfy it either

this PodGroup instance already satisfies its minCount even without this pod

It's worth noting at this point, that with topology-aware scheduling introduce PodGroup instance could have tas requirements too when thinking what we should do with it.

The last point in my opinion means that we effectively should always go through Workload cycle, because we always want to consider it in the context of whole workload.
The primary question is if we want to go kick this workload cycle off with individual pod or wait for more and if so based on what criteria.
I think that the exact criteria will be evolving in the future (I can definitely imagine it depend on "preemption unit" that we're introducing in KEP-5710). So for now, I wouldn't try to settle down on the final solution.

I would suggest starting with:

always go through Workload cycle (to keep in mind whole podGroup instance as context)

for now, always schedule individual pod for the sake of making progress here (we will most probably adjust it in the future but it will be pretty localized change and I think it's fine)

It seems that the easiest way is to do what @wojtek-t described, i.e., go best effort (as for the basic policy), so take as many pods as we have available, and try to schedule them in the workload cycle. Only pods that have passed the workload cycle will be able to move on to the pod-by-pod cycle. In the future, we can try to create more intelligent grouping, but for now, let's focus on delivering a good enough implementation.

erictune · 2025-12-11T01:11:39Z

keps/sig-scheduling/4671-gang-scheduling/README.md

+     They are then effectively "reserved" on those nodes in the
+     scheduler's internal cache. Pods are then pushed to the
+     active queue (restoring their original timestamps to ensure fairness)
+     to pass through the standard scheduling and binding cycle,


IIUC, there are some cases where a nominated pod fails to pass through the standard scheduling cycle:

differences in the algorithm between workload cycle and standard cycle

refreshed snapshot changes node eligibility.

higher priority pod jumps ahead in active queue

Yes, these are the cases, but we are okay when they occur, as long as the new placement will be still valid. If it won't be anymore, scheduling of the gang will fail (at WaitOnPermit) and the gang will be retried.

erictune · 2025-12-11T01:24:19Z

keps/sig-scheduling/4671-gang-scheduling/README.md

+   tackle the hardest placement problems early.
+
+3. The scheduler iterates through the sorted sub-groups. It finds a feasible node
+   for each pod from a sub-group using standard filtering and scoring phases.


Do you plan to use the exact same plugins for Workload Cycle, versus Scheduling Cycle. Do you plan to have the SchedulerConfig options apply the same to both Cycles?

Yes - we should use the exact same plugins in both cases.

Added a sentence about that

wojtek-t

Added few comments but they are pretty localized - overall this is great proposal pretty aligned with how I was thinking about it too.

keps/prod-readiness/sig-scheduling/4671.yaml

keps/sig-scheduling/4671-gang-scheduling/README.md

wojtek-t · 2025-12-11T09:17:59Z

keps/sig-scheduling/4671-gang-scheduling/README.md

+	// +required
+	Name string
+
+	// PodGroup is the name of the PodGroup within the Workload that this Pod


My personal preference: when talking about (2), I would just call that either PodGroupReplica or alternatively (which is what I was using) PodGroup instance. WIth that, I wouldn't rename it - but these I my personal preferences.

wojtek-t · 2025-12-11T09:20:12Z

keps/sig-scheduling/4671-gang-scheduling/README.md


-// WorkloadSpec describes a workload in a portable way that scheduler and related
-// tools can understand.  
+// WorkloadMaxPodGroups is the maximum number of pod groups per Workload.


FWIW - these updates are based on what was merged in the code.
I'm happy for these to be updated, but for these specifically maybe we should open a dedicated PR in k/k.

wojtek-t · 2025-12-11T09:27:43Z

keps/sig-scheduling/4671-gang-scheduling/README.md

+While this won't fully address the "optimal enough" part of requirement (2),
+it ensures that all gang pods are processed together.
+The single scheduling cycle, together with blocking resources using nomination,
+will address requirement (3).


wojtek-t · 2025-12-11T09:55:10Z

keps/sig-scheduling/4671-gang-scheduling/README.md

+  Can report dedicated logs and metrics with less confusion to the user.
+* *Cons:* Significant and non-trivial architectural change to the scheduling queue
+  and `scheduleOne` loop.
+<<[/UNRESOLVED]>>


I would more value opinions from people more hands-on with scheduler code recently than myself, but on paper I think the third alternative seems preferred to me:

Alternative 1 - it will be extremely hard to reason about it if pods sit in different queues (backoff, active, ..) independently and the evolution point is extremely important imho - we know we will be evolving it a lot and preparing for that is super important

Alternative 2 - I share the concerns about corner cases (pod deleted is exactly what I have on my mind). Once we get to rescheduling cases (new pods appear but they should trigger removing and recreating some of the existing pods it will get even more complicated with even harder corner cases). Given that we know that rescheduling is super important, I'm also reluctant about it.

Alternative 3 - It's clearly the cleanest option. The primary drawback of the need for non-trivial changes is imho justified given we know that we will be evolving it significantly in the future.

So I have quite strong preference towards (3) at this point, but I'm happy to be challenged by implementation-specific counter-arguments.

That was my thought as well. I think the effort involved in adding workload queuing will be comparable to modifying the code for the previous alternatives, but maybe I don't see some significant drawbacks.

For the third alternative, do we want to introduce only the queue for PodGroups? If so then we have the same problem of pulling pods from backoff and unschedulable queues right? Or do we mean by the workload queue a structure that will hold all the pods related to workload scheduling?

Let's say that we add a Pod that is part of a group but does not make us meet the min count . Should it land in unschedulable queue as it does right now? Or some additional structure?

If the pod group failed scheduling, where do we put pods from it? We need to have them somewhere, so we can check them against cluster events in case some event makes the pod schedulable thus potentially makes the whole pod group schedulable.

I think, one way might be to introduce both:

a queue for pod groups (let's say podGroupsQ). This will contain the pod groups that can be popped for workload scheduling

a data structure for unscheduled pods (let's say workloadPods) that have to wait for workload scheduling to finish to proceed with their pod-by-pod cycles.

I imagine the pod transitions (for pods that have a workloadRef) would be:

pod is created -> when PreEnqueue passes for a pod, add it to the workloadPods, otherwise add to unschedulablePods -> [for gang pod group] if the >= minCount pods for the group is in a workloadPods, the group is added to the podGroupsQ -> pod group is processed and workload scheduling cycle is executed

When the workload cycle finishes successfully:

pod gets the NNN and is moved to the activeQ -> pod is popped and goes through its pod-by-pod scheduling cycle -> ...

When the workload cycle fails:

pod is moved to the unschedulablePods, where it waits for a cluster change to happen -> when the change happens for this pod or other from its pod group, the pod(s) are moved to the workloadPods -> workload manager detects the group should be retried and adds the group to the podGroupsQ -> processing continues as previously

wojtek-t · 2025-12-11T09:57:23Z

keps/sig-scheduling/4671-gang-scheduling/README.md

+   tackle the hardest placement problems early.
+
+3. The scheduler iterates through the sorted sub-groups. It finds a feasible node
+   for each pod from a sub-group using standard filtering and scoring phases.


Yes - we should use the exact same plugins in both cases.

wojtek-t · 2025-12-11T10:00:54Z

keps/sig-scheduling/4671-gang-scheduling/README.md

+     * If preemption is successful, the pod is nominated on the selected node.
+     * If preemption fails, the pod is considered unscheduled for this cycle.
+
+   The phase can effectively stop once `minCount` pods have a placement,


From the POV of optimizing locally, we should continue scheduling all pod (and not stop after minCount) - what changes is that the unschedulability of further pods shouldn't make the whole group unschedulable.

Given that we have minCount at the whole PodGroup level, the local optimization is actually what we should probably do anyway (it would be a different story if we would have a separate minCount per signature, but it's not the case).

wojtek-t · 2025-12-11T10:02:36Z

keps/sig-scheduling/4671-gang-scheduling/README.md

+     gang remains valid as long as `minCount` is satisfied globally (enforced at `WaitOnPermit`).
+     ```md
+      <<[UNRESOLVED Pod-by-pod cycle preemption]>>
+      Should gang pods be allowed to preempt anything in their pod-by-pod cycles?


It makes it much harder to reason about. So while we eventually may need it, I would start with saying "no" :)

wojtek-t · 2025-12-11T10:03:37Z

keps/sig-scheduling/4671-gang-scheduling/README.md

+
+For pod groups using the `Basic` policy, the Workload Scheduling Cycle is
+optional. In the `Beta` timeframe, we may opportunistically apply this cycle to
+`Basic` pod groups to leverage the batching performance benefits, but the


andreyvelich · 2025-12-11T16:36:29Z

keps/sig-scheduling/4671-gang-scheduling/README.md

+	// +required
+	Name string
+
+	// PodGroup is the name of the PodGroup within the Workload that this Pod


If we rename it to PodGroupReplica, are we planning to introduce PodSubGroup under it, as we discussed here: #5711 (comment) ? And what about PodSubGroupReplica?

keps/sig-scheduling/4671-gang-scheduling/README.md

andreyvelich · 2025-12-11T16:41:55Z

keps/sig-scheduling/4671-gang-scheduling/README.md

+	// +oneOf=PolicySelection
+	Basic *BasicSchedulingPolicy


IIRC, we discussed to remove the Basic policy for now. Did we decide to keep it?

I'm pretty convinced we need that - we want to be able to represent arbitrary workload with Workload structure and we some kind of "default" policy. My thoughts are here:
kubernetes/kubernetes#135179 (comment)

But I agree we need to explain that more clearly in the KEP.
@macsko

I'm going to add a new section about extending the basic policy

wojtek-t · 2025-12-12T10:40:36Z

keps/sig-scheduling/4671-gang-scheduling/README.md

+`Basic` pod groups to leverage the batching performance benefits, but the
+"all-or-nothing" (`minCount`) checks will be skipped; i.e., we will try to
+schedule as many pods from such PodGroup as possible.
+


I thought that I added this comment, but apparently forgot.

Maciek - can you please extend this subsection with how we're going to approach the usefulness of BasicPolicy - in particular the thoughts from here:
kubernetes/kubernetes#135179 (comment)

[feel free to propose an alternative approach, but we basically want to address the concerns mentioned there]

k8s-ci-robot · 2025-12-12T12:23:18Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: macsko
Once this PR has been reviewed and has the lgtm label, please ask for approval from wojtek-t. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

keps/prod-readiness/OWNERS
~~keps/sig-scheduling/OWNERS~~ [macsko]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot · 2025-12-12T15:34:33Z

@macsko: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-enhancements-verify	`eae3ddb`	link	true	`/test pull-enhancements-verify`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. labels Dec 10, 2025

k8s-ci-robot requested review from dom4ha and kikisdeliveryservice December 10, 2025 09:05

github-project-automation bot added this to SIG Scheduling Dec 10, 2025

k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Dec 10, 2025

k8s-ci-robot requested review from erictune, sanposhiho and wojtek-t December 10, 2025 09:06

macsko mentioned this pull request Dec 10, 2025

Gang Scheduling Support in Kubernetes #4671

Open

13 tasks

Argh4k reviewed Dec 10, 2025

View reviewed changes

erictune reviewed Dec 11, 2025

View reviewed changes

wojtek-t reviewed Dec 11, 2025

View reviewed changes

wojtek-t self-assigned this Dec 11, 2025

andreyvelich reviewed Dec 11, 2025

View reviewed changes

wojtek-t reviewed Dec 12, 2025

View reviewed changes

macsko force-pushed the gang_scheduling_beta branch from a63f0ef to 71ef75a Compare December 12, 2025 12:23

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Dec 12, 2025

macsko force-pushed the gang_scheduling_beta branch from 71ef75a to 9ce3fc7 Compare December 12, 2025 12:23

Add a section about scheduler changes for v1.36

eae3ddb

macsko force-pushed the gang_scheduling_beta branch from 9ce3fc7 to eae3ddb Compare December 12, 2025 15:30


		Alternative 1 (Modify sorting logic):

		Modify the sorting logic within the existing `PriorityQueue` to put all pods

WIP: KEP-4671: Introduce Workload Scheduling Cycle, graduate Workload API and gang scheduling to beta #5730

Are you sure you want to change the base?

WIP: KEP-4671: Introduce Workload Scheduling Cycle, graduate Workload API and gang scheduling to beta #5730

Conversation

macsko commented Dec 10, 2025

Uh oh!

macsko commented Dec 10, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

erictune left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wojtek-t left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

macsko Dec 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

macsko Dec 12, 2025 •

edited

Loading