[KEP-5710]: Workload-aware preemption KEP #5711

wojtek-t · 2025-11-28T14:26:13Z

One-line PR description: First draft of Workload-aware preemption KEP
Issue link: Workload-aware preemption #5710

wojtek-t · 2025-11-28T14:26:33Z

44past4 · 2025-12-01T09:21:16Z

keps/sig-scheduling/5710-workload-aware-preemption/README.md

+1. Identify the list of potential victims:
+   - all running workloads with (preemption) priority lower than the new workload W
+   - all individual pods (not being part of workloads) with priority lower than the new workload W


Having two independent priorities for a workload - one for scheduling and one for the preemption or the single preemption priority which can be dynamically updated can potentially lead to a cycle in the preemption.

Let's assume that we have an existing workload A with high scheduling priority and low preemption priority running in a cluster.

Now let's assume that we want to schedule a workload B which has medium scheduling priority and medium preemption priority.

Workload B will preempt workload A and will start to run because its scheduling priority > preemption priority of the workload A.

However when workload A will restart and it will be rescheduled it will preempt workload B and will start to run because its scheduling priority > preemption priority of workload B.

The same issue can happen if we will have only one priority but this priority will be reduced while the workload is running. After preemption when the workload will reappear with the original higher priority it can preempt the workload which has preempted it.

One potential solution / mitigation to the described problem could be stating that preemption priority >= scheduling priority. This way after restarting the preempted workload will not be able to preempt the preemptor workload.

Thanks for point that out!

Yeah - "preemption priority >= scheduling priority" is definitely desired. I don't think we have any usecases that would benefit from the reversed.

That said, I need to think a bit more if that is enough. I think it prevents the cycles if we assume static priorities, but it can still potentially trigger cycles if the priorities will be changing. OTOH, if the priorities are changing this is probably desired.

Let me think about it a bit more and I will update the KEP to reflect the thoughts later this week.

OK - I have added an unresolved section about that to the Workload priorities section above describing the problem, potential solution and alternatives. Let's continue the discussion there.

sanposhiho · 2025-12-01T13:28:35Z

/assign

erictune

Great to see this, and I like how it is decoupled from the other work planned for 1.36.

keps/sig-scheduling/5710-workload-aware-preemption/README.md

erictune · 2025-12-02T18:58:16Z

keps/sig-scheduling/5710-workload-aware-preemption/README.md

+can't reprieve any of those, learning about that would require O(N) full workload schedulings
+with N being number of workload/pods violating PDB.
+<<[/UNRESOLVED]>>
+```


Let's assume that nodes are either high-pod-per node count, or low pod-per-node count. Its a bimodal distribution.

Let's further assume that if Gang scheduling is used, then the node is going to usually be low pod-per-node count.

So, then we can do the following:

Individual Pod as preemptor - assume high pod-per-node, use current algorithm, which is optimized for many pods per node, consider all victims.

Gang as preemptor - assume low pod-per-node in all cases, consider a maximum of e.g. 4 reprieves per node, to keep compute time down, and just stop reprieving in the case where there are more things on the node.,

Every split in the algorith/code path makes it harder to reason about. This is why I'm trying to avoid that whenever possible.

Additionally, while I agree with you that in majority of cases it will be true, there are definitely usecases where people run gang workloads with many pods per node. So in my opinion the split as proposed could potentially result in decisions that would be really far from the optimal ones.

In the spirit of trying to simplify and unify stuff as much as possible I actually adjusted the algorithm so that we can have a single scheme that addresses all four usecases that we have. I think this is much better option.

PTAL

xigang · 2025-12-03T00:55:18Z

/cc

Argh4k · 2025-12-03T10:07:07Z

keps/sig-scheduling/5710-workload-aware-preemption/README.md

+
+```
+<<[UNRESOLVED delayed preemption]>>
+Should we leave it as part of this KEP or should this be moved to the Gang-Scheduling one?


I believe that it should be moved to another KEP, I feel that it is completely independent of workload aware preemption and can work just with the current preemption + gang scheduling.

The rationale behind having it here was that it also serves the goal of "reducing disruptions".
I think there are two primary options:

keep it here and reference from "workload KEP"

move it to "workload KEP" and reference from here

I'm happy with either options based on what is the preference of majority.

Argh4k · 2025-12-03T10:07:32Z

keps/sig-scheduling/5710-workload-aware-preemption/README.md

+1. From remaining potential victims, we start to reprieve pods starting from the highest priority
+   and working down until the set of remaining victims still keeps the node feasible.
+
+Once we compute the feasibility and list of victims for all nodes, we score that and choose the


Nit: it's possible that we will not do that for all nodes in the cluster. We find feasible nodes until we have max(numNodes * 0.1, 100) nodes from which we can choose from: https://github.com/kubernetes/kubernetes/blob/ec1bf8a4f3a5f054065225dc8275c66b93310d17/pkg/scheduler/framework/preemption/preemption.go#L363-L364

Good catch - updated (although I don't think it changes anything for this particular proposal).

Probably not for the initial implementation but it's worth to keep it in mind once we look into the scalability of workload preemption

Argh4k · 2025-12-03T10:08:35Z

keps/sig-scheduling/5710-workload-aware-preemption/README.md

+   - all running workloads with (preemption) priority lower than the new workload W
+   - all individual pods (not being part of workloads) with priority lower than the new workload W
+
+1. If removing all the potential victims would not make the new workload W schedulable,


I think we should point out that this depends on workload aware scheduling which is not yet implemented and is planned for 1.36.

Argh4k · 2025-12-03T10:09:09Z

keps/sig-scheduling/5710-workload-aware-preemption/README.md

+1. If removing all the potential victims would not make the new workload W schedulable,
+   the workload is unschedulable even with preemption.
+
+```


Nit: you need to indent this "code block" to keep the numbering continuous.

keps/sig-scheduling/5710-workload-aware-preemption/README.md

macsko · 2025-12-03T16:08:15Z

keps/sig-scheduling/5710-workload-aware-preemption/README.md

+
+1. Identify the list of potential victims:
+   - all running workloads with (preemption) priority lower than the new workload W
+   - all individual pods (not being part of workloads) with priority lower than the new workload W


What if there is a workload and an individual pod, where only one is needed to make the new workload schedulable. Which one will be chosen?

I think we should choose pod, but I don't have super strong preference. I added a point about sorting to reflect that but I'm happy to take any suggestions there.

I guess if they have the same priority then: single pod > pod from workload with gang preemtable = false > workload with gang preemtable = true?

macsko · 2025-12-03T16:12:30Z

keps/sig-scheduling/5710-workload-aware-preemption/README.md

+1. Extend `SchedulingFramework` with two new steps: `RunGetResourcesPlugins` and
+   `WaitForGetResources`. These will be called immediately after `WaitOnPermit` phase and
+   before running `RunPreBindPlugins`. The `RunGetResourcesPlugins` will simply be calling
+   `GetResources` methods from all plugins implementing it. And `WaitForGetResources` will
+   work similarly to `WaitOnPermit`, serving as a barrier to ensure all the resources are
+   already available to use. The implementation will work similarly to `WaitOnPermit` to
+   ensure that `GetResources` was executed for all pods from within a `PodGroup`.


How will the preemption targets be released when we after all don't run the RunGetResourcesPlugins? For example, when a gang turns out being unschedulable

That's very good question. I think we want something conceptually similar to "Reserve/Unreserve" pattern from DRA.

So scheduling phase will effectively serve as "reserve" phase and we we will have a sibling method of "unschedule" that will be able to re-assume the victims.

It requires some description though.

macsko · 2025-12-03T16:17:47Z

keps/sig-scheduling/5710-workload-aware-preemption/README.md

+We need to look at the cluster as a whole. With that in mind, keeping the algorithm efficient
+becomes a challenge, thus we modify to the approach below.
+
+To check if a workload W can be scheduled on a given cluster with preemption we:


Shouldn't we talk about a "gang pod group" rather than a "workload"?

I don't have strong opinion here - let me change it.

Argh4k · 2025-12-04T10:28:06Z

Do we want to add as a part of this KEP a description of how the preemption fits the workload aware scheduling (codewise)? Or do we want to have this other way around, have the KEP for workload aware scheduling reference this one when talking about preemption?

In the gang scheduling KEP we talk about adding a "Workload" phase where we will end up with a pods from Gang with a nominated node names. I assume that this preemption will be a part of this phase. The open question is what actually will be the outcome of the preemption:

will the workload premption trigger the preemption, counting on delayed preemption to actuate it
will the workload preemption mark pods for preemption and the trigger will be done by the current preemption in the pod post filter? This is actually a preferred option by me as it will also take into consideration changes that happened in the cluster between workload scheduling cycle and pod scheduling.
something else?

Argh4k · 2025-12-04T10:41:09Z

keps/sig-scheduling/5710-workload-aware-preemption/README.md

+
+As part of minimizing preemptions goal, arguably the most important thing to do is to avoid unnecessary
+preemptions. However, this is not true for the current gang scheduling implementation.
+In the current implementation, preemption is triggered in the `PostFiler`. However, it's entirely


So the reasoning here is that we want delayed preemption because it helps with the current gang scheduling implementation. But I believe that actually in this doc we could describe why we need it in terms of the workload preemption and IIUC this is to have an option to run workload preemption as part of the workload scheduling without immediately actuating the preemptions.

I added this also in a PR discussion, I think it would be beneficial to have a section on what will be the outcome of workload preemption and if it does not actuate the preemptions, what actually will do that.

So the reasoning here is that we want delayed preemption because it helps with the current gang scheduling implementation. But I believe that actually in this doc we could describe why we need it in terms of the workload preemption and IIUC this is to have an option to run workload preemption as part of the workload scheduling without immediately actuating the preemptions.

Great point - I updated this paragraph to reflect that.

I added this also in a PR discussion, I think it would be beneficial to have a section on what will be the outcome of workload preemption and if it does not actuate the preemptions, what actually will do that.

I hope that an update KEP for gang scheduling that will describe the workload scheduling phase will be opened pretty soon and it will describe it. And I will be able to just link to it here :)
@macsko ^^

Argh4k · 2025-12-04T10:42:53Z

keps/sig-scheduling/5710-workload-aware-preemption/README.md

+1. New field in the workload object (delayed preemption will not bring much value in
+   case of scheduling individual pods, though there would be significant benefit from
+   unification, so probably this isn't ideal option).
+1. Storing it in private kube-scheduler' structures (PodInfo for individual pods and


This does not allow external schedulers to use the same concept for victims nomination.

I would like to keep external schedulers out of scope for now - added explicitly to the non-goals section.

wojtek-t

I tried to address most of the comments, I will try to respond/address the remaining ones later today/tomorrow.

keps/sig-scheduling/5710-workload-aware-preemption/README.md

wojtek-t · 2025-12-04T12:11:30Z

keps/sig-scheduling/5710-workload-aware-preemption/README.md

+1. From remaining potential victims, we start to reprieve pods starting from the highest priority
+   and working down until the set of remaining victims still keeps the node feasible.
+
+Once we compute the feasibility and list of victims for all nodes, we score that and choose the


Good catch - updated (although I don't think it changes anything for this particular proposal).

wojtek-t · 2025-12-04T12:13:00Z

keps/sig-scheduling/5710-workload-aware-preemption/README.md

+1. Identify the list of potential victims:
+   - all running workloads with (preemption) priority lower than the new workload W
+   - all individual pods (not being part of workloads) with priority lower than the new workload W


OK - I have added an unresolved section about that to the Workload priorities section above describing the problem, potential solution and alternatives. Let's continue the discussion there.

wojtek-t · 2025-12-04T12:14:27Z

keps/sig-scheduling/5710-workload-aware-preemption/README.md

+We need to look at the cluster as a whole. With that in mind, keeping the algorithm efficient
+becomes a challenge, thus we modify to the approach below.
+
+To check if a workload W can be scheduled on a given cluster with preemption we:


I don't have strong opinion here - let me change it.

wojtek-t · 2025-12-04T12:23:33Z

keps/sig-scheduling/5710-workload-aware-preemption/README.md

+
+1. Identify the list of potential victims:
+   - all running workloads with (preemption) priority lower than the new workload W
+   - all individual pods (not being part of workloads) with priority lower than the new workload W


I think we should choose pod, but I don't have super strong preference. I added a point about sorting to reflect that but I'm happy to take any suggestions there.

keps/sig-scheduling/5710-workload-aware-preemption/README.md

wojtek-t · 2025-12-04T12:26:57Z

keps/sig-scheduling/5710-workload-aware-preemption/README.md

+
+```
+<<[UNRESOLVED delayed preemption]>>
+Should we leave it as part of this KEP or should this be moved to the Gang-Scheduling one?


The rationale behind having it here was that it also serves the goal of "reducing disruptions".
I think there are two primary options:

keep it here and reference from "workload KEP"

move it to "workload KEP" and reference from here

I'm happy with either options based on what is the preference of majority.

wojtek-t · 2025-12-04T12:28:37Z

keps/sig-scheduling/5710-workload-aware-preemption/README.md

+1. New field in the workload object (delayed preemption will not bring much value in
+   case of scheduling individual pods, though there would be significant benefit from
+   unification, so probably this isn't ideal option).
+1. Storing it in private kube-scheduler' structures (PodInfo for individual pods and


I would like to keep external schedulers out of scope for now - added explicitly to the non-goals section.

wojtek-t · 2025-12-04T13:05:05Z

keps/sig-scheduling/5710-workload-aware-preemption/README.md

+- workload C has scheduling priority `low` but preemption cost `high`
+In such case, the preemption cost would result in choosing workload B for preemption. But
+if it gets recreated, it will preempt workload C causing unnecessary cascading preemption.
+This is the reason why a cost-based model was discarded.


@erictune - I thought a bit more about the idea of "preemption priority" vs "preemption cost" that we chatted offline.
I acknowledge the deficiencies of currently proposed model, but I think that the switching to preemption cost and just scoring-based approach will not prevent us from cascading preemptions, which we should really try to avoid.

I tried to update the KEP to reflect that - PTAL and I'm happy to chat more about it.

dom4ha

Overall this is what I was thinking of as well. The major change that this approach brings is that we no longer can say which pod determines victims, but rather which Workload/PodGroup determines them.

dom4ha · 2025-12-04T11:56:30Z

keps/sig-scheduling/5710-workload-aware-preemption/README.md

+<<[/UNRESOLVED]>>
+```
+
+1. For remaining potential victims, using binary search across priorities find the minimal priority P


I think we need to add one more step before we could consider this feature beta.

Once we've identified the minimum priority P, we should reschedule again all victims and leave those which were in fact not affected. I can't imagine we preempt workloads that are in fact not affected. So we need to introduce a concept of "workload rescheduling" with basic implementation that just checks whether a workload fits its current place or not).

There is another step which we indeed could consider an optimization (not a beta blocker). We can do reversed binary-search over workloads at the same priority (we need to have some secondary workload importance order) and try to schedule new workload leaving as many as possible existing workloads.

One note, I think in the scheduler codebase and this KEP we use term reprieve to say that we want to keep the pod nominated for preemption in its original place. In my mind rescheduling would mean trying to find another place for this pod.

dom4ha · 2025-12-04T12:01:00Z

keps/sig-scheduling/5710-workload-aware-preemption/README.md

+The following algorithm is by far no optimal, but is simple to reason about and I would suggest it as
+a starting point:
+- assume that all potential victims on the list are removed and schedule the new workload W
+- go over the remaining potential victims starting from the highest priority and check if these can


I'd do it only once we determined the minimum priority P and then try to re-schedule existing workloads, but not in every iteration. There is my separate comment about it.

I believe we agreed with Wojtek that this was not an alternative but an additional enhancement to the original algorithm. After updates to the KEP it's in the algorithm.

wojtek-t

I have restructure the preemption algorithm here - I think that unifies and simplifies a lot of things.

keps/sig-scheduling/5710-workload-aware-preemption/README.md

wojtek-t · 2025-12-04T14:35:16Z

keps/sig-scheduling/5710-workload-aware-preemption/README.md

+can't reprieve any of those, learning about that would require O(N) full workload schedulings
+with N being number of workload/pods violating PDB.
+<<[/UNRESOLVED]>>
+```


Every split in the algorith/code path makes it harder to reason about. This is why I'm trying to avoid that whenever possible.

Additionally, while I agree with you that in majority of cases it will be true, there are definitely usecases where people run gang workloads with many pods per node. So in my opinion the split as proposed could potentially result in decisions that would be really far from the optimal ones.

In the spirit of trying to simplify and unify stuff as much as possible I actually adjusted the algorithm so that we can have a single scheme that addresses all four usecases that we have. I think this is much better option.

PTAL

erictune

LGTM

keps/sig-scheduling/5710-workload-aware-preemption/README.md

erictune · 2025-12-04T20:14:48Z

keps/sig-scheduling/5710-workload-aware-preemption/README.md

+can't reprieve any of those, learning about that would require O(N) full workload schedulings
+with N being number of workload/pods violating PDB.
+<<[/UNRESOLVED]>>
+```


keps/sig-scheduling/5710-workload-aware-preemption/README.md

k8s-ci-robot · 2025-12-04T20:28:32Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: erictune, wojtek-t
Once this PR has been reviewed and has the lgtm label, please ask for approval from sanposhiho. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

keps/sig-scheduling/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

andreyvelich

Thanks @wojtek-t, overall looks great!
I left a few questions.

andreyvelich · 2025-12-04T20:42:13Z

keps/sig-scheduling/5710-workload-aware-preemption/README.md

+- Define the scheduler changes needed to implement workload-aware preemption
+- Provide full backward compatibility for all existing scheduling features
+
+### Non-Goals


What about partial preemption of a Workload?
I would imagine with DependsOn API in the JobSet that is something we should talk about at some point.
E.g. supporting Argo workflows in Kueue: kubernetes-sigs/kueue#74

cc @kannon92 @tenzen-y @mimowo

Can you clarify what exactly you mean here?

We definitely don't want to require the whole gang to always be preempted together, this should be optional. This is reflected together.

We don't yet want to allow arbitrary granularity, though I wouldn't exclude defining lower granularity units later.

But now sure if any of those actually is what you're seeking for with this comment.

If in the future we allow users to preempted group of pods from gang, how that will work? Will we introduce a new API for that?

keps/sig-scheduling/5710-workload-aware-preemption/README.md

andreyvelich · 2025-12-04T20:48:03Z

keps/sig-scheduling/5710-workload-aware-preemption/README.md

+type GangSchedulingPolicy struct {
+    // Existing field(s).
+
+    // IsGangPreemptable defines whether all pods from this group should
+    // be preempted in all-or-nothing fashion.
+    IsGangPreemtable *bool
+}


Shall we try to design an API that is future proof?
What if in the future we allow to partially preempt group of pods from gang for elastic training?

Here is how I was thinking about it:

at some point we will introduce "PodSubGroup" as it was described in the original gang scheduling doc: https://tiny.cc/hvhs001 (whatever the name will be)

at this point, PodSubGroup may actually become the preemption unit

we will have a corresponding boolean flag at the level of PodSubGroup at this point

if you want PodSubGroup be the preemption unit, you will set that field instead of setting it here

The above model will be compatible with this addition.

If that doesn't address your usecase, can you please explain your usecase in more detail?

PodSubGroup sounds great! Do we have any tentative API design for it ?
Are we planning to introduce this as part of PodGroup API object?

type PodGroup struct { Name *string ... PodSubGroups []PodSubGroup }

I am also curious what if in the future, someone want to preempt multiple PodSubGroups within single a PodGroup ?

Just like an idea, we can introduce PreemptionPolicy API which can describe such groups:

type Workload struct { PreemptionPolicy *PreemptionPolicy } type PreemptionPolicy struct { PriorityClassName *string PreemptionPriorityClassName *string TargetPodGroups []PreemptionGroup } type PreemptionGroup struct { // Name of the group. Name string // Target PodGroup or PodSubGroup name to be preempted together. TargetPodGroup []string }

PodSubGroup sounds great! Do we have any tentative API design for it ?

It was described in the original doc as future extensions:
https://docs.google.com/document/d/1ulO5eUnAsBWzqJdk_o5L-qdq5DIVwGcE7gWzCQ80SCM/edit?tab=t.3pkx7y4zvho2
(you're co-author there :) )

Regarding PreemptionPolicy - the reason why I didn't go with that is that I believe that it doesn't make sense if we aren't gang-scheduling (so it only make sense in gang-scheduling mode). I can't imagine any usecase where preemption unit is larger than scheduling unit - and without a gang policy scheduling unit is an individual pod.

I'm happy to adjust the API but we need to take that somehow into account and cross fields validations are always more confusing to users.

Oh, you are right!

I can't imagine any usecase where preemption unit is larger than scheduling unit

That is a good point, however how can we preempt multiple gangs within the workload together?

Let's say our Workload is a workflow that contains multiple steps, some steps use gang policy, some of them don't.
Additionally, users might want to preempt only desired steps from the workflow.

How would preemption work in that case?

keps/sig-scheduling/5710-workload-aware-preemption/README.md

Argh4k · 2025-12-05T08:42:59Z

keps/sig-scheduling/5710-workload-aware-preemption/README.md

+    // IsGangPreemptable defines whether all pods from this group should
+    // be preempted in all-or-nothing fashion.
+    IsGangPreemtable *bool
+}


Have we considered whether something similar to preemptionPolicy: Never makes sense for Workloads? Do we know whether there are use cases for a workload that should just wait for the place on the cluster without preempting other pods/workloads but it also requires the whole gang to start at once?

Yes - I can definitely imagine usecases where it makes sense (CI workloads as an example).

But this is also a power of not reinventing the concept of priority from scratch and using existing PriorityClasses - by using it at the workload level, we effectively get all of its features roughly for free.

sanposhiho · 2025-12-07T11:52:22Z

keps/sig-scheduling/5710-workload-aware-preemption/README.md

+we want to support and is compatible with the current pod-based preemption algorithm. This means
+we will be able to achieve in-place replacement with relatively localized changes.
+
+### Delayed preemption


Actually, it would make one big difference from our current preemption: What if the pod's deletion take time and meanwhile there could be other places to be available for this workload?
Today, a preemptor pod doesn't wait for victim pod deletion(s) to be completed. That helps, if other places become available meanwhile, a preemptor pod could still be scheduled there.
The time to complete the deletion could be a lot longer/worse when it comes to a victim workload because a victim workload could contain thousands of pods. (Also, it's typical for ML clusters that victim pods have to do something fancy (checkpointing etc) at the termination.)
The current proposal looks like a preemptor pod is just going to be blocked WaitForGetResources? But, is it really ideal that a high priority preemptor workload might have to wait for a long time to get all victim pods deleted, while on the other hand there might be some new empty spaces in the cluster where the preemptor workload can be scheduled actually.

This is great point - I thought about it in the past, but later completely forgot it.

One point that I would phrase differently is: I'm less concerned about workload consisting of multiple pods. This is true, but on its own it generally will not be the primary factor for problems.
The primary factor for problems most often will be grace period and that is a problem independently whether we're preempting whole workload or just an individual pod.

When I thought about it in the past, the only reasonable answer I came up with was:

let's introduce a timeout and if all the victims are not preempted within that timeout, we return all the pods back to the scheduling queue

However, it should not result in clearing up the nominatedNodeName for these - in other words we continue to assume that they are still waiting for preemptions to happen and be scheduled there

in this case we try to schedule them again - if we can schedule them without preemption we go for it, if preemption is still required we actually don't change it

I think that works on paper and seems good enough conceptually, but the question is whether it doesn't break some implementation assumptions that I'm not aware of.
@sanposhiho @macsko @dom4ha @tosi3k

sanposhiho · 2025-12-07T12:02:14Z

keps/sig-scheduling/5710-workload-aware-preemption/README.md

+    // IsGangPreemptable defines whether all pods from this group should
+    // be preempted in all-or-nothing fashion.


The comment is unclear: What means if it's false? whether it can be preempted partially or it cannot be preempted at all?

The switch to enum should make it cleaner now.

sanposhiho · 2025-12-07T12:03:15Z

keps/sig-scheduling/5710-workload-aware-preemption/README.md

+
+    // IsGangPreemptable defines whether all pods from this group should
+    // be preempted in all-or-nothing fashion.
+    IsGangPreemtable *bool


I feel like we should consider enum here instead of bool? like preemptionPolicy: gang || never, and we can add new value(s) later? e.g., partially.

I wouldn't introduce Never here explicitly - we can use PriorityClass for that where we are already able to represent these kinds of concepts:
https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/api/core/v1/types.go#L2794-L2801

But overall - the comment about switching to enum makes sense to me - for now we will have:
individual pod (default) and gang (better names needed) and eventually we can extend it further.

I will adjust that.

Switched to enum.

x13n · 2025-12-08T09:46:41Z

keps/sig-scheduling/5710-workload-aware-preemption/README.md

+- it needs to preempt one of the already running workloads
+- workload B has scheduling priority `med` but preemption cost `low`
+- workload C has scheduling priority `low` but preemption cost `high`
+In such case, the preemption cost would result in choosing workload B for preemption. But


Flyby comment: wouldn't it be sufficient to target lowest priority workloads possible and use cost only as a tie breaker?

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Nov 28, 2025

k8s-ci-robot requested review from dom4ha and macsko November 28, 2025 14:26

k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. labels Nov 28, 2025

github-project-automation bot added this to SIG Scheduling Nov 28, 2025

k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Nov 28, 2025

wojtek-t force-pushed the workload_aware_preemption branch from 672aa68 to ce04eca Compare December 1, 2025 08:21

Workload-aware preemption KEP

0ff3958

wojtek-t force-pushed the workload_aware_preemption branch from ce04eca to 0ff3958 Compare December 1, 2025 08:52

44past4 reviewed Dec 1, 2025

View reviewed changes

k8s-ci-robot assigned sanposhiho Dec 1, 2025

wojtek-t mentioned this pull request Dec 2, 2025

Workload-aware preemption #5710

Open

4 tasks

erictune suggested changes Dec 2, 2025

View reviewed changes

github-project-automation bot moved this to Needs Review in SIG Scheduling Dec 2, 2025

k8s-ci-robot requested a review from xigang December 3, 2025 00:55

Argh4k reviewed Dec 3, 2025

View reviewed changes

keps/sig-scheduling/5710-workload-aware-preemption/README.md Outdated Show resolved Hide resolved

macsko reviewed Dec 3, 2025

View reviewed changes

Argh4k reviewed Dec 4, 2025

View reviewed changes

wojtek-t commented Dec 4, 2025

View reviewed changes

dom4ha reviewed Dec 4, 2025

View reviewed changes

wojtek-t force-pushed the workload_aware_preemption branch from 5758be0 to 4293d98 Compare December 4, 2025 14:30

wojtek-t commented Dec 4, 2025

View reviewed changes

wojtek-t force-pushed the workload_aware_preemption branch from 4293d98 to b24a962 Compare December 4, 2025 14:55

erictune approved these changes Dec 4, 2025

View reviewed changes

andreyvelich reviewed Dec 4, 2025

View reviewed changes

Argh4k reviewed Dec 5, 2025

View reviewed changes

wojtek-t force-pushed the workload_aware_preemption branch from b24a962 to 873a281 Compare December 5, 2025 13:04

sanposhiho reviewed Dec 7, 2025

View reviewed changes

x13n reviewed Dec 8, 2025

View reviewed changes

helayoty moved this from Needs Review to In Progress in SIG Scheduling Dec 10, 2025

andreyvelich mentioned this pull request Dec 11, 2025

WIP: KEP-4671: Introduce Workload Scheduling Cycle, graduate Workload API and gang scheduling to beta #5730

Open

Expand on review comments

6a2cc3f

wojtek-t force-pushed the workload_aware_preemption branch from 873a281 to 6a2cc3f Compare December 12, 2025 13:16

		// IsGangPreemptable defines whether all pods from this group should
		// be preempted in all-or-nothing fashion.

[KEP-5710]: Workload-aware preemption KEP #5711

Are you sure you want to change the base?

[KEP-5710]: Workload-aware preemption KEP #5711

Conversation

wojtek-t commented Nov 28, 2025

Uh oh!

wojtek-t commented Nov 28, 2025

Uh oh!

44past4 Dec 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sanposhiho commented Dec 1, 2025

Uh oh!

erictune left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xigang commented Dec 3, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Argh4k commented Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wojtek-t left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

44past4 Dec 1, 2025 •

edited

Loading

Argh4k commented Dec 4, 2025 •

edited

Loading