-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP-4885: Windows CPU and Memory Affinity #4888
base: master
Are you sure you want to change the base?
KEP-4885: Windows CPU and Memory Affinity #4888
Conversation
jsturtevant
commented
Sep 30, 2024
- One-line PR description: Add CPU and memory Affinity support to Windows nodes with CPUManager, MemoryManager and Topology manager
- Issue link: Windows CPU and Memory Affinity #4885
- Other comments:
593f44b
to
0ecf8b5
Compare
Signed-off-by: James Sturtevant <[email protected]>
0ecf8b5
to
1204ef3
Compare
keps/sig-windows/4885-windows-cpu-and-memory-affinity/README.md
Outdated
Show resolved
Hide resolved
keps/sig-windows/4885-windows-cpu-and-memory-affinity/README.md
Outdated
Show resolved
Hide resolved
keps/sig-windows/4885-windows-cpu-and-memory-affinity/README.md
Outdated
Show resolved
Hide resolved
keps/sig-windows/4885-windows-cpu-and-memory-affinity/README.md
Outdated
Show resolved
Hide resolved
keps/sig-windows/4885-windows-cpu-and-memory-affinity/README.md
Outdated
Show resolved
Hide resolved
keps/sig-windows/4885-windows-cpu-and-memory-affinity/README.md
Outdated
Show resolved
Hide resolved
keps/sig-windows/4885-windows-cpu-and-memory-affinity/README.md
Outdated
Show resolved
Hide resolved
keps/sig-windows/4885-windows-cpu-and-memory-affinity/README.md
Outdated
Show resolved
Hide resolved
c5b33b5
to
3456c1c
Compare
lgtm Thanks for opening this @jsturtevant |
Signed-off-by: James Sturtevant <[email protected]>
3456c1c
to
eef698f
Compare
/milestone v1.32 |
Signed-off-by: James Sturtevant <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks for working on this, @jsturtevant
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: jsturtevant, kevpar The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No objections for cpumanager and topologymanager, we can use some tunings on the KEP but looks straightforward and uncontroversial. I have some concerns about memory manager, let's iterate over there
keps/sig-windows/4885-windows-cpu-and-memory-affinity/README.md
Outdated
Show resolved
Hide resolved
keps/sig-windows/4885-windows-cpu-and-memory-affinity/README.md
Outdated
Show resolved
Hide resolved
keps/sig-windows/4885-windows-cpu-and-memory-affinity/README.md
Outdated
Show resolved
Hide resolved
For sig-windows, we also see a risk to enabling a feature that has already Stable or fully featured on Linux. To mitigate this risk we have opted to create a | ||
separate KEP with a feature flag so we can communicate our status effectively. | ||
|
||
Another risk is the testing implementation for these features is mostly in e2e_node which doesn't currently support Windows. As a mitigation there was [some exploration ](https://github.com/jsturtevant/kubernetes/tree/e2e_node-windows) to see if these tests could be enabled on Windows so we can progress this feature with confidence in the testing suite. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
e2e tests are "just" suggested for alpha, while they are beta blockers. I have no objections about starting this work in alpha with unit tests (which all need to run and pass on windows, though, but we should be here already) and work in parallel to enable e2e testing on windows.
On Windows systems with more than 64 cores the CPU's will be split into groups, | ||
each processor is identified by its group number and its group-relative processor number. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this grouping somehow related to HW topology or is completely independent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
independent, though NUMA nodes will not span across groups.
NOTE: Also set `disable-supported` to `true` or `false` in `kep.yaml`. | ||
--> | ||
|
||
Yes. Restarting of the pods will be required to remove the CPU/Memory affinity. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OTOH running workload are not expected to be affected by both kubelet restart and downgrade once the feature was enabled and then disabled (right?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a kubelet restart shouldn't affect running workloads.
keps/sig-windows/4885-windows-cpu-and-memory-affinity/README.md
Outdated
Show resolved
Hide resolved
[supported limits]: https://git.k8s.io/community//sig-scalability/configs-and-limits/thresholds.md | ||
--> | ||
|
||
We will monitor for cpu consumption to query the CPU topology. If required we may wish to implement a caching strategy. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
on linux the HW topology is read only once at kublet start, but there are plans to enable dynamic node resizing. Let's keep that into account when designing the windows solution
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
During testing I was seeing repeat calls to the cadvisor.getMachineInfo()
method, which is why I mention this here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fair enough, and worth a deeper dive
Signed-off-by: James Sturtevant <[email protected]>
|
||
## Infrastructure Needed (Optional) | ||
|
||
<!-- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What kind of nodes would you need to test this features for windows?
How are you envisioning a e2e test for these features given that windows does not have e2e_node?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see sig-windows re-using the existing testing infra and adding some basic tests in https://github.com/kubernetes/kubernetes/tree/master/test/e2e/windows for alpha to ensure the features work.
|
||
##### e2e tests | ||
|
||
- e2e_node will need to be enabled for windows to add coverage |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to be a very important point. Are you okay with blocking this feature based on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The goal would be to enable this for Beta. I don't see a sustainable way for sig-windows to enable enough tests to ensure quality for these features without leveraging existing tests. Duplicating all the test in e2e for windows would be a large undertaking as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it should be clear that you intend to add a e2e_node like test for this one. I read this as we are going to enable all of the tests as part of this feature..
###### How can someone using this feature know that it is working for their instance? | ||
|
||
- [x] Events | ||
- Event Reason: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What events?
On Windows systems with more than 64 cores the CPU's will be split into groups, | ||
each processor is identified by its group number and its group-relative processor number. | ||
|
||
In Cri we will add the following structure to the `WindowsContainerResources` in CRI: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In Cri we will add the following structure to the `WindowsContainerResources` in CRI: | |
In CRI we will add the following structure to the `WindowsContainerResources` in CRI: |
|
||
This kep outlines how to add support for the CPU, Memory and Topology Managers in kubelet for Windows. | ||
The Managers are already available and support in kubelet on Linux and there have been requests to sig-windows | ||
to add support on Windows to help with workloads that require co-located workloads. The goal of the kep is to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
KEP
This kep outlines how to add support for the CPU, Memory and Topology Managers in kubelet for Windows. | ||
The Managers are already available and support in kubelet on Linux and there have been requests to sig-windows | ||
to add support on Windows to help with workloads that require co-located workloads. The goal of the kep is to | ||
add Windows support without significant changes to the Managers logic while providing the same feature sets available |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would be helpful to link the actual functionality KEP and documentation for reference