Skip to content

Conversation

@macsko
Copy link
Member

@macsko macsko commented Dec 10, 2025

  • One-line PR description: Update KEP-4671 for beta in v1.36. Introduce Workload Scheduling Cycle phase.
  • Other comments: This PR is shows an initial idea of new workload scheduling phase. Further changes related to graduation to beta will be added gradually.

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. labels Dec 10, 2025
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Dec 10, 2025
@macsko
Copy link
Member Author

macsko commented Dec 10, 2025


Alternative 1 (Modify sorting logic):

Modify the sorting logic within the existing `PriorityQueue` to put all pods
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We would need to modify it once we have workload priority, right? Otherwise we can keep the current sorting algorithm and pull all gang members from the queue once we encounter the first member of the gang (and switch to workload scheduling), right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, we could keep the current algorithm, but since the gang members can be in any internal queue, removing them from these queues would require some effort (running PreEnqueue, registering metrics, filling in internal structures used for queuing hints, etc.). In fact, this effort could be moved to properly implement queueing with more feature-proof alternative.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we decide that if there is no workload priority then priority = min(pods priority) (as Wojtek suggested) then it becomes a big problem with this approach.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, if we choose that priority calculation, then this alternative is just a bad idea

Copy link
Contributor

@erictune erictune left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great overall.

I have one proposal to consider renaming part of the API, and some clarifying questions.

// +required
Name string
// PodGroup is the name of the PodGroup within the Workload that this Pod
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The term PodGroup has become ambiguous. It can mean:

  1. a list item in Workload.podGroups
  2. a specific set of pods having the same PodGroupReplicaKey
    These are often the same thing but not always.

If we want to be clear what we mean, we can either:

  • Always call (1) a PodGroup and always call (2) a PodGroupReplica, including in the case when podGroupReplicaKey is nil
  • Or, rename PodGroup to PodGroup Template, and call (2) a PodGroup

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may seem pedantic, but when reviewing this KEP, I felt that current double meaning made the text imprecise in a number of places. We do have a chance to fix the naming with v1beta1, but I don't think we are supposed to rename when we go to GA.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My personal preference: when talking about (2), I would just call that either PodGroupReplica or alternatively (which is what I was using) PodGroup instance. WIth that, I wouldn't rename it - but these I my personal preferences.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we rename it to PodGroupReplica, are we planning to introduce PodSubGroup under it, as we discussed here: #5711 (comment) ? And what about PodSubGroupReplica?

// WorkloadSpec describes a workload in a portable way that scheduler and related
// tools can understand.
// WorkloadMaxPodGroups is the maximum number of pod groups per Workload.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the number of items in the list PodGroups

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW - these updates are based on what was merged in the code.
I'm happy for these to be updated, but for these specifically maybe we should open a dedicated PR in k/k.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with @wojtek-t. We can defer API description discussions to the k/k PR with API graduation (when it will be created)

ControllerRef *TypedLocalObjectReference
// PodGroups is the list of pod groups that make up the Workload.
// The maximum number of pod groups is 8. This field is immutable.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the maximum number of PodGroup Replicas is not limited.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right. It would be even hard to enforce such limitation in the current form of replication - it's based solely on pods' workloadRefs.

While this won't fully address the "optimal enough" part of requirement (2),
it ensures that all gang pods are processed together.
The single scheduling cycle, together with blocking resources using nomination,
will address requirement (3).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In alpha, we avoided the case where two workloads each have some of their pods bound and some not bound, for an unbounded amount of time (deadlock). However, it is possible that both workloads could both keep timing out before getting scheduled. (what might be called "livelock").

In beta, IIUC, avoid the "livelock" case by trying to do all the pods of one workload before starting work on another workload.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for suggestion, applied


When the scheduler pops a Pod from the active queue, it checks if that Pod
belongs to an unscheduled `PodGroup` with a `Gang` policy. If so, the scheduler
initiates the Workload Scheduling Cycle.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a pod belongs to an already scheduled PodGroup, it is not clear what to do. We could:

  • Hold it back, on the assumption that it is a replacement for an already scheduled pod. When minCount additional pods show up, then handle all those at once.
  • Treat it as if it is a "gang scale up", and try to place it too (on a best-effort basis). If we do this, does it go through Workload cycle, or just normal pod-at-a-time path?

I thinking about races that could happen when a workload is failing and getting pods replaced. And I am thinking about the case when a workload wants to tolerate a small number of failures by having actual count > minCount.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taking a step back from the implementation, we should think what we really need to do conceptually in that case.

If the pod is part of PodGroup, then what we conceptually want is to schedule that in the context of its PodGroup instance. I think there are effectively three cases here:

  1. this PodGroup instance doesn't satisfy its minCount now, but with this pod it will satisfy it
  2. this PodGroup instance doesn't satisfy its minCount now and with that pod it won't satisfy it either
  3. this PodGroup instance already satisfies its minCount even without this pod

It's worth noting at this point, that with topology-aware scheduling introduce PodGroup instance could have tas requirements too when thinking what we should do with it.

The last point in my opinion means that we effectively should always go through Workload cycle, because we always want to consider it in the context of whole workload.
The primary question is if we want to go kick this workload cycle off with individual pod or wait for more and if so based on what criteria.
I think that the exact criteria will be evolving in the future (I can definitely imagine it depend on "preemption unit" that we're introducing in KEP-5710). So for now, I wouldn't try to settle down on the final solution.

I would suggest starting with:

  • always go through Workload cycle (to keep in mind whole podGroup instance as context)
  • for now, always schedule individual pod for the sake of making progress here (we will most probably adjust it in the future but it will be pretty localized change and I think it's fine)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that the easiest way is to do what @wojtek-t described, i.e., go best effort (as for the basic policy), so take as many pods as we have available, and try to schedule them in the workload cycle. Only pods that have passed the workload cycle will be able to move on to the pod-by-pod cycle. In the future, we can try to create more intelligent grouping, but for now, let's focus on delivering a good enough implementation.

They are then effectively "reserved" on those nodes in the
scheduler's internal cache. Pods are then pushed to the
active queue (restoring their original timestamps to ensure fairness)
to pass through the standard scheduling and binding cycle,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, there are some cases where a nominated pod fails to pass through the standard scheduling cycle:

  • differences in the algorithm between workload cycle and standard cycle
  • refreshed snapshot changes node eligibility.
  • higher priority pod jumps ahead in active queue

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, these are the cases, but we are okay when they occur, as long as the new placement will be still valid. If it won't be anymore, scheduling of the gang will fail (at WaitOnPermit) and the gang will be retried.

tackle the hardest placement problems early.

3. The scheduler iterates through the sorted sub-groups. It finds a feasible node
for each pod from a sub-group using standard filtering and scoring phases.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you plan to use the exact same plugins for Workload Cycle, versus Scheduling Cycle. Do you plan to have the SchedulerConfig options apply the same to both Cycles?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes - we should use the exact same plugins in both cases.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a sentence about that

Copy link
Member

@wojtek-t wojtek-t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added few comments but they are pretty localized - overall this is great proposal pretty aligned with how I was thinking about it too.

// +required
Name string
// PodGroup is the name of the PodGroup within the Workload that this Pod
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My personal preference: when talking about (2), I would just call that either PodGroupReplica or alternatively (which is what I was using) PodGroup instance. WIth that, I wouldn't rename it - but these I my personal preferences.

// WorkloadSpec describes a workload in a portable way that scheduler and related
// tools can understand.
// WorkloadMaxPodGroups is the maximum number of pod groups per Workload.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW - these updates are based on what was merged in the code.
I'm happy for these to be updated, but for these specifically maybe we should open a dedicated PR in k/k.

While this won't fully address the "optimal enough" part of requirement (2),
it ensures that all gang pods are processed together.
The single scheduling cycle, together with blocking resources using nomination,
will address requirement (3).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Can report dedicated logs and metrics with less confusion to the user.
* *Cons:* Significant and non-trivial architectural change to the scheduling queue
and `scheduleOne` loop.
<<[/UNRESOLVED]>>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would more value opinions from people more hands-on with scheduler code recently than myself, but on paper I think the third alternative seems preferred to me:

  1. Alternative 1 - it will be extremely hard to reason about it if pods sit in different queues (backoff, active, ..) independently and the evolution point is extremely important imho - we know we will be evolving it a lot and preparing for that is super important

  2. Alternative 2 - I share the concerns about corner cases (pod deleted is exactly what I have on my mind). Once we get to rescheduling cases (new pods appear but they should trigger removing and recreating some of the existing pods it will get even more complicated with even harder corner cases). Given that we know that rescheduling is super important, I'm also reluctant about it.

  3. Alternative 3 - It's clearly the cleanest option. The primary drawback of the need for non-trivial changes is imho justified given we know that we will be evolving it significantly in the future.

So I have quite strong preference towards (3) at this point, but I'm happy to be challenged by implementation-specific counter-arguments.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was my thought as well. I think the effort involved in adding workload queuing will be comparable to modifying the code for the previous alternatives, but maybe I don't see some significant drawbacks.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the third alternative, do we want to introduce only the queue for PodGroups? If so then we have the same problem of pulling pods from backoff and unschedulable queues right? Or do we mean by the workload queue a structure that will hold all the pods related to workload scheduling?

Let's say that we add a Pod that is part of a group but does not make us meet the min count . Should it land in unschedulable queue as it does right now? Or some additional structure?

If the pod group failed scheduling, where do we put pods from it? We need to have them somewhere, so we can check them against cluster events in case some event makes the pod schedulable thus potentially makes the whole pod group schedulable.

Copy link
Member Author

@macsko macsko Dec 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, one way might be to introduce both:

  • a queue for pod groups (let's say podGroupsQ). This will contain the pod groups that can be popped for workload scheduling
  • a data structure for unscheduled pods (let's say workloadPods) that have to wait for workload scheduling to finish to proceed with their pod-by-pod cycles.

I imagine the pod transitions (for pods that have a workloadRef) would be:

pod is created -> when PreEnqueue passes for a pod, add it to the workloadPods, otherwise add to unschedulablePods -> [for gang pod group] if the >= minCount pods for the group is in a workloadPods, the group is added to the podGroupsQ -> pod group is processed and workload scheduling cycle is executed

When the workload cycle finishes successfully:

pod gets the NNN and is moved to the activeQ -> pod is popped and goes through its pod-by-pod scheduling cycle -> ...

When the workload cycle fails:

pod is moved to the unschedulablePods, where it waits for a cluster change to happen -> when the change happens for this pod or other from its pod group, the pod(s) are moved to the workloadPods -> workload manager detects the group should be retried and adds the group to the podGroupsQ -> processing continues as previously

tackle the hardest placement problems early.

3. The scheduler iterates through the sorted sub-groups. It finds a feasible node
for each pod from a sub-group using standard filtering and scoring phases.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes - we should use the exact same plugins in both cases.

* If preemption is successful, the pod is nominated on the selected node.
* If preemption fails, the pod is considered unscheduled for this cycle.

The phase can effectively stop once `minCount` pods have a placement,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the POV of optimizing locally, we should continue scheduling all pod (and not stop after minCount) - what changes is that the unschedulability of further pods shouldn't make the whole group unschedulable.

Given that we have minCount at the whole PodGroup level, the local optimization is actually what we should probably do anyway (it would be a different story if we would have a separate minCount per signature, but it's not the case).

gang remains valid as long as `minCount` is satisfied globally (enforced at `WaitOnPermit`).
```md
<<[UNRESOLVED Pod-by-pod cycle preemption]>>
Should gang pods be allowed to preempt anything in their pod-by-pod cycles?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes it much harder to reason about. So while we eventually may need it, I would start with saying "no" :)


For pod groups using the `Basic` policy, the Workload Scheduling Cycle is
optional. In the `Beta` timeframe, we may opportunistically apply this cycle to
`Basic` pod groups to leverage the batching performance benefits, but the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@wojtek-t wojtek-t self-assigned this Dec 11, 2025
// +required
Name string
// PodGroup is the name of the PodGroup within the Workload that this Pod
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we rename it to PodGroupReplica, are we planning to introduce PodSubGroup under it, as we discussed here: #5711 (comment) ? And what about PodSubGroupReplica?

Comment on lines 423 to 424
// +oneOf=PolicySelection
Basic *BasicSchedulingPolicy
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, we discussed to remove the Basic policy for now. Did we decide to keep it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty convinced we need that - we want to be able to represent arbitrary workload with Workload structure and we some kind of "default" policy. My thoughts are here:
kubernetes/kubernetes#135179 (comment)

But I agree we need to explain that more clearly in the KEP.
@macsko

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to add a new section about extending the basic policy

`Basic` pod groups to leverage the batching performance benefits, but the
"all-or-nothing" (`minCount`) checks will be skipped; i.e., we will try to
schedule as many pods from such PodGroup as possible.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought that I added this comment, but apparently forgot.

Maciek - can you please extend this subsection with how we're going to approach the usefulness of BasicPolicy - in particular the thoughts from here:
kubernetes/kubernetes#135179 (comment)

[feel free to propose an alternative approach, but we basically want to address the concerns mentioned there]

@macsko macsko force-pushed the gang_scheduling_beta branch from a63f0ef to 71ef75a Compare December 12, 2025 12:23
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: macsko
Once this PR has been reviewed and has the lgtm label, please ask for approval from wojtek-t. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Dec 12, 2025
@macsko macsko force-pushed the gang_scheduling_beta branch from 71ef75a to 9ce3fc7 Compare December 12, 2025 12:23
@macsko macsko force-pushed the gang_scheduling_beta branch from 9ce3fc7 to eae3ddb Compare December 12, 2025 15:30
@k8s-ci-robot
Copy link
Contributor

@macsko: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-enhancements-verify eae3ddb link true /test pull-enhancements-verify

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

6 participants