Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add requiredDuringSchedulingRequiredDuringExecution to ClusterResourcePlacement affinity #715

Open
nojnhuh opened this issue Mar 7, 2024 · 12 comments

Comments

@nojnhuh
Copy link
Member

nojnhuh commented Mar 7, 2024

In ClusterResourcePlacement's affinity definitions, adding requiredDuringSchedulingRequiredDuringExecution would enable the scheduler to react to underlying changes to a member cluster over time that affect its ability to run certain workloads.

One concrete use case might be to ensure that workloads only run on clusters that contain GPU nodes. As nodes are added to and removed from a cluster, whether or not any GPU nodes exist in a cluster may change over time. As a cluster operator detects these changes and updates some label on the member clusters to indicate whether or not GPU nodes are available, Fleet would automatically reschedule workloads that require GPU nodes onto a different member cluster.

@nojnhuh
Copy link
Member Author

nojnhuh commented Mar 8, 2024

@ryanzhang-oss I'm starting to dig into this so if you could please assign me to this issue I'd appreciate it!

@ryanzhang-oss ryanzhang-oss assigned nojnhuh and unassigned nojnhuh Mar 12, 2024
@ryanzhang-oss
Copy link
Contributor

@nojnhuh Even k8s does not support requiredDuringSchedulingRequiredDuringExecution, I wonder why do we want to support that? Also what does "requiredDuringSchedulingRequiredDuringExecution" mean semantically?

@nojnhuh
Copy link
Member Author

nojnhuh commented Mar 12, 2024

This would mean the same thing as the placeholder definition for the same field for placing a Pod on a Node, but for scheduling workloads onto clusters: https://github.com/kubernetes/kubernetes/blob/634fc1b4836b3a500e0d715d71633ff67690526a/staging/src/k8s.io/api/core/v1/types.go#L3449-L3456

// If the affinity requirements specified by this field are not met at
// scheduling time, the pod will not be scheduled onto the node.
// If the affinity requirements specified by this field cease to be met
// at some point during pod execution (e.g. due to an update), the system
// will try to eventually evict the pod from its node.

This would help with the use case I outlined above where conditions on a member cluster change such that it's no longer suitable to run certain workloads. Then fleet can reschedule affected workloads without relying on a change to the ClusterResourcePlacement to trigger the reschedule.

@ryanzhang-oss
Copy link
Contributor

ryanzhang-oss commented Mar 12, 2024

Thanks @nojnhuh.

Just to clarify, there are two cases, to schedule a workload to newly GPU available cluster is actually a requiredDuringScheduleTime case since the workload is not scheduled if there is no GPU cluster available. The workload will be scheduled to a cluster automatically when we detect that GPU is added to the cluster. This is already supported today in fleet.

On the flip side, when a workload is already running in a cluster, we don't evict it unless the cluster is deleted, the same behavior as k8s. I think there is a reason why k8s never implements that feature. The main reason is continuously trying to reschedule all workloads will add huge load to our scheduler which is the performance bottleneck. Since we didn't get any feature request to support this from our customers, we don't think the benefit out weight the huge performance hit.

We can revisit this if there are strong use cases coming from customers and even with that, I suspect we need to scope down the semantics to ensure performance.

@jackfrancis
Copy link
Member

There is an active KEP right now in upstream k8s to solve for RequiredDuringSchedulingRequiredDuringExecution:

The intent to solve for this is longstanding:

Additionally, the widely used descheduler project implements this as well for folks who have needed this functionality prior to its landing in k/k:

The main reason is continuously trying to reschedule all workloads will add huge load to our scheduler which is the performance bottleneck

The above is a true statement. We wouldn't want to continuously reschedule. Rather we would want to continuously determine "do I need to reschedule?", which would looks something like (1) ensuring that ClusterResourcePlacement status is current and reflects the underlying state of the scheduled resources and (2) introspecting those status and engaging a reschedule trigger when the declared status goal state status (e.g., Running) was unrealized beyond some configurable TTL.

I would like to be both a customer and implementer of this in fleet, so it makes sense to me to keep the issue open as a reference for the resultant PR.

@ryanzhang-oss ryanzhang-oss reopened this Mar 12, 2024
@ryanzhang-oss
Copy link
Contributor

ryanzhang-oss commented Mar 12, 2024

Thanks, Jack. I am keeping this issue open. However, I don't think there is a way to determine "do I need to reschedule" without actually scheduling it. Also, just continuously "determine" is already a huge cost.

IMO, the right way to solve this problem is to deploy a descheduler instead of within the scheduler. We are planning for a descheduler already.

In any case, we would like to see a design first before moving forward with any code change.

@ryanzhang-oss
Copy link
Contributor

- Add `node-affinity-eviction` controller to ensure pods being evicted if the selectors are no longer met at runtime.

This is a "descheduler" to me

@jackfrancis
Copy link
Member

Thx for re-opening!

However, I don't think there is a way to determine "do I need to reschedule" without actually scheduling it.

This is the way:

  1. A discrete actor performs the descheduling on a particular cluster (either the standard descheduler does it, or in the future, if k/k itself supports it then it would do it).
  2. A multi-cluster actor (e.g., ClusterResourcePlacement) looks for scheduled workloads that are "not running" as the trigger for "do I need to reschedule?".

The multi-cluster actor does not need to actually schedule anything in order to determine if it needs to be rescheduled. It simply needs to be aware of the delta between its desired goal state (this workload is operational on cluster XYZ) and the actual goal state (this workload is stuck Pending on cluster XYZ). When such a delta is observed, the entire E2E multi-cluster scheduling operation kicks in, with the new nuance that cluster XYZ is no longer considered as a target cluster for scheduling (we already know that the workload doesn't run there).

@ryanzhang-oss
Copy link
Contributor

ryanzhang-oss commented Mar 13, 2024

So IIRC,

  1. We need a single cluster de-scheduler first which is out of the scope of this project.
  2. We need a way for the multi-cluster agent to know what does "running" mean for any resources.

I wonder how do you solve the second part?

In addition, the second part is actually already part of the advanced rollout feature as we will provide options for customer when we detect the resources placed are not in goal state. Currently, we don't plan to offer "reschedule" option but that's not hard to add.
The hard part is actually how to determine what "running" means for an arbitrary resource. I don't see any way other than providing a "hook" for the users to tell us but it's a quite involved process. We are working on the UX.

@zhiying-lin
Copy link
Contributor

The current KEP4329 has not been approved yet. I have the same question listed in the https://github.com/kubernetes/enhancements/pull/4329/files#r1478023120. not sure what's the benefit of adding into the node affinity instead of using the descheduler and what's the boundary between these two? Probably we can hold until sig gets to the conclusion?

@jackfrancis
Copy link
Member

In addition, the second part is actually already part of the advanced rollout feature as we will provide options for customer when we detect the resources placed are not in goal state. Currently, we don't plan to offer "reschedule" option but that's not hard to add.

Cool, are there PRs that are implementing "advanced rollout"?

@zhiying-lin
Copy link
Contributor

https://github.com/Azure/fleet/pull/689/files is the one we support checking the availability of the native resource. More PRs are coming.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants