diff --git a/keps/prod-readiness/sig-windows/4885.yaml b/keps/prod-readiness/sig-windows/4885.yaml new file mode 100644 index 00000000000..677242775f3 --- /dev/null +++ b/keps/prod-readiness/sig-windows/4885.yaml @@ -0,0 +1,6 @@ +# The KEP must have an approver from the +# "prod-readiness-approvers" group +# of http://git.k8s.io/enhancements/OWNERS_ALIASES +kep-number: 4885 +alpha: + approver: "@johnbelamaric" diff --git a/keps/sig-windows/4885-windows-cpu-and-memory-affinity/README.md b/keps/sig-windows/4885-windows-cpu-and-memory-affinity/README.md new file mode 100644 index 00000000000..07af5428f36 --- /dev/null +++ b/keps/sig-windows/4885-windows-cpu-and-memory-affinity/README.md @@ -0,0 +1,594 @@ +# KEP-4885: Windows CPU and Memory Affinity + + +- [Release Signoff Checklist](#release-signoff-checklist) +- [Summary](#summary) +- [Motivation](#motivation) + - [Goals](#goals) + - [Non-Goals](#non-goals) +- [Proposal](#proposal) + - [User Stories (Optional)](#user-stories-optional) + - [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional) + - [Risks and Mitigations](#risks-and-mitigations) +- [Design Details](#design-details) + - [Windows CPU Discovery](#windows-cpu-discovery) + - [Windows Memory considerations](#windows-memory-considerations) + - [Kubelet memory management](#kubelet-memory-management) + - [Windows Topology manager considerations](#windows-topology-manager-considerations) + - [Test Plan](#test-plan) + - [Prerequisite testing updates](#prerequisite-testing-updates) + - [Unit tests](#unit-tests) + - [Integration tests](#integration-tests) + - [e2e tests](#e2e-tests) + - [Graduation Criteria](#graduation-criteria) + - [Alpha](#alpha) + - [Beta](#beta) + - [GA](#ga) + - [Deprecation](#deprecation) + - [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy) + - [Version Skew Strategy](#version-skew-strategy) +- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire) + - [Feature Enablement and Rollback](#feature-enablement-and-rollback) + - [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning) + - [Monitoring Requirements](#monitoring-requirements) + - [Dependencies](#dependencies) + - [Scalability](#scalability) + - [Troubleshooting](#troubleshooting) +- [Implementation History](#implementation-history) +- [Drawbacks](#drawbacks) +- [Alternatives](#alternatives) +- [Infrastructure Needed (Optional)](#infrastructure-needed-optional) + + +## Release Signoff Checklist + +Items marked with (R) are required *prior to targeting to a milestone / release*. + +- [x] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR) +- [ ] (R) KEP approvers have approved the KEP status as `implementable` +- [x] (R) Design details are appropriately documented +- [x] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors) + - [ ] e2e Tests for all Beta API Operations (endpoints) + - [x] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) + - [ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free +- [x] (R) Graduation criteria is in place + - [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) +- [ ] (R) Production readiness review completed +- [ ] (R) Production readiness review approved +- [ ] "Implementation History" section is up-to-date for milestone +- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io] +- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes + +[kubernetes.io]: https://kubernetes.io/ +[kubernetes/enhancements]: https://git.k8s.io/enhancements +[kubernetes/kubernetes]: https://git.k8s.io/kubernetes +[kubernetes/website]: https://git.k8s.io/website + +## Summary + +This kep outlines how to add support for the CPU, Memory and Topology Managers in kubelet for Windows. +The Managers are already available and support in kubelet on Linux and there have been requests to sig-windows +to add support on Windows to help with workloads that require co-located workloads. The goal of the kep is to +add Windows support without significant changes to the Managers logic while providing the same feature sets available +on Linux today. + +## Motivation + +Currently enabling low latency workloads co-hosted on the same nodes in Windows Server create noisy neighbor behaviors +preventing them from achieving their expected performance goals. +The CPU, Memory and Topology Managers feature is needed to add the necessary isolation to accomplish both high performance and co-hosting efficiency. +The feature is enabled and available in Linux and Windows users are asking for the same features on Windows. + +### Goals + +- Enable CPU manager for Windows allowing for CPU affinity for configured pods +- Enable Memory Manager for Windows allowing for memory affinity for configured pods +- Enable Topology Manager for Windows allowing for coordination of Memory and CPU affinity at the node level for scheduled pods + +### Non-Goals + +- We do not wish to create new managers and instead re-use the existing logic provided +- Modify or bypass any existing feature gated features. Existing Policy features gates will still be used to progress specific policies related to the managers. + +## Proposal + +The proposal requires very little changes to the code for the managers and instead extends the [Windows](https://learn.microsoft.com/en-us/windows/win32/procthread/processor-groups) concepts to a CAdvisor mapping to enable the [topology structure in kubelet](https://github.com/kubernetes/kubernetes/blob/cede96336a809a67546ca08df0748e4253ec270d/pkg/kubelet/cm/cpumanager/topology/topology.go#L34-L39). + +There are no plans to change the core logic for selecting CPU's and NUMA nodes in the CPU/Memory/Tolopology managers from the existing KEPS ([memory-manager](keps/sig-node/1769-memory-manager)/[cpu-manager](keps/sig-node/3570-cpu-manager)/[topology-manager](keps/sig-node/693-topology-manager")). The logic is currently in platform agnostic +structures so the selection process is does not require changes for adoption on Windows. The Windows specific considerations for each of the managers will be covered in separate sections in this document. + + +### User Stories (Optional) + +The User stories on Windows are similar to Linux: + +https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/3570-cpumanager#user-stories-optional +https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/1769-memory-manager#user-stories +https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/693-topology-manager#user-stories-optional + +### Notes/Constraints/Caveats (Optional) + +Windows does not have an API to constrain workloads to a specific NUMA node. This is addressed in the Memory Manager section below. + +### Risks and Mitigations + + +The technical risks are the same from existing KEP's: + - https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/3570-cpumanager#risks-and-mitigations + - https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/1769-memory-manager#risks-and-mitigations + - https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/693-topology-manager#risks-and-mitigations + +For sig-windows, we also see a risk to enabling a feature that has already Stable or fully featured on Linux. To mitigate this risk we have opted to create a +separate KEP with a feature flag so we can communicate our status effectively. + +Another risk is the testing implementation for these features is mostly in e2e_node which doesn't currently support Windows. As a mitigation there was [some exploration](https://github.com/jsturtevant/kubernetes/tree/e2e_node-windows) to see if these tests could be enabled on Windows so we can progress this feature with confidence in the testing suite. + +## Design Details + +### Windows CPU Discovery + +The Windows Kubelet provides an implementation for the [cadvisor api](https://github.com/kubernetes/kubernetes/blob/fbaf9b0353a61c146632ac195dfeb1fbaffcca1e/pkg/kubelet/cadvisor/cadvisor_windows.go#L50) +in order to provide Windows stats to other components without modification. +The ability to provide the `cadvisorapi.MachineInfo` api is already partially mapped +in on the Windows client. By mapping the Windows specific topology API's to +cadvisor API, no changes are required to the CPU Manager. + +The [Windows concepts](https://learn.microsoft.com/windows/win32/procthread/processor-groups) are mapped to [Linux concepts](https://github.com/kubernetes/kubernetes/blob/cede96336a809a67546ca08df0748e4253ec270d/pkg/kubelet/cm/cpumanager/topology/topology.go#L34-L39) with the following: + +| Kubelet Term | Description | Cadvisor term | Windows term | +| --- | --- | --- | --- | +| CPU | logical CPU | thread | Logical processor | +| Core | physical CPU | Core | Core | +| Socket | socket | Socket | Physical Processor | +| NUMA Node | NUMA cell | Node | Numa node | + +The result of this mapping gives the following output from CPU manager after the conversion into kubelet's memory structure: + +```json +"Detected CPU topology" +topology={"NumCPUs":8,"NumCores":4,"NumSockets":1,"NumNUMANodes":1,"CPUDetails":{ +"0":{"NUMANodeID":0,"SocketID":1,"CoreID":0}, +"1":{"NUMANodeID":0,"SocketID":1,"CoreID":0}, +"2":{"NUMANodeID":0,"SocketID":1,"CoreID":2}, +"3":{"NUMANodeID":0,"SocketID":1,"CoreID":2}, +"4":{"NUMANodeID":0,"SocketID":1,"CoreID":4}, +"5":{"NUMANodeID":0,"SocketID":1,"CoreID":4}, +"6":{"NUMANodeID":0,"SocketID":1,"CoreID":6}, +"7":{"NUMANodeID":0,"SocketID":1,"CoreID":6}}} +``` + +The Windows API's used will be +- [getlogicalprocessorinformationex](https://learn.microsoft.com/windows/win32/api/sysinfoapi/nf-sysinfoapi-getlogicalprocessorinformationex) +- [nf-winbase-getnumaavailablememorynodeex](https://learn.microsoft.com/windows/win32/api/winbase/nf-winbase-getnumaavailablememorynodeex) + +One difference between the Windows API and Linux is the concept of [Processor groups](https://learn.microsoft.com/windows/win32/procthread/processor-groups). +On Windows systems with more than 64 cores the CPU's will be split into groups, +each processor is identified by its group number and its group-relative processor number. + +In Cri we will add the following structure to the `WindowsContainerResources` in CRI: + +```protobuf +message WindowsCpuGroupAffinity { + // CPU mask relative to this CPU group. + uint64 cpu_mask = 1; + // CPU group that this CPU belongs to. + uint32 cpu_group = 2; +} +``` + +Since the Kubelet API's are looking for a distinct ProcessorId, the processorid's will be calculated by looping +through the mask and calculating the ids with `(group *64) + procesorid` resulting in unique processor id's from `group 0` as `0-63` and +processor Id's from `group 1` as `64-127` and so on. This translation will be done only in kubelet, the `cpu_mask` will be used when +communicating with the container runtime. + +```golang +for i := 0; i < 64; i++ { + if groupaffinity.Mask&(1< + +###### How can this feature be enabled / disabled in a live cluster? + + + +- [x] Feature gate (also fill in values in `kep.yaml`) + - Feature gate name: WindowsCPUAndMemoryAffinity + - Components depending on the feature gate: +- [ ] Other + - Describe the mechanism: + - Will enabling / disabling the feature require downtime of the control + plane? + No + - Will enabling / disabling the feature require downtime or reprovisioning + of a node? + Yes it uses a feature gate. Memory and CPU managers have a state file that requires cleanup. + +###### Does enabling the feature change any default behavior? + +No, Additional settings are required to enable the features. The default policies for CPU/Memory manager will be `None`, meaning that they will not interact with running of pods. + +###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)? + + + +Yes. Restarting of the pods will be required to remove the CPU/Memory affinity. + +###### What happens if we reenable the feature if it was previously rolled back? + +###### Are there any tests for feature enablement/disablement? + + + +### Rollout, Upgrade and Rollback Planning + + + +###### How can a rollout or rollback fail? Can it impact already running workloads? + + + +Impact is node local, and doesn't affect rest of the cluster. + +It is possible that the state file from the memory/cpu manager will have inconsistent data during the rollout, because of the kubelet restart, but you can easily to fix it by removing memory manager state file and run kubelet restart. It should not affect any running workloads. + + +###### What specific metrics should inform a rollback? + + + +The pod may fail with the admission error because the kubelet can not provide all resources. You can see the error messages under the pod events. + +There are existing metrics provided by Managers that can be monitored: + +```golang +// Metrics to track the CPU manager behavior +CPUManagerPinningRequestsTotalKey = "cpu_manager_pinning_requests_total" +CPUManagerPinningErrorsTotalKey = "cpu_manager_pinning_errors_total" +CPUManagerSharedPoolSizeMilliCoresKey = "cpu_manager_shared_pool_size_millicores" +CPUManagerExclusiveCPUsAllocationCountKey = "cpu_manager_exclusive_cpu_allocation_count" + +// Metrics to track the Memory manager behavior +MemoryManagerPinningRequestsTotalKey = "memory_manager_pinning_requests_total" +MemoryManagerPinningErrorsTotalKey = "memory_manager_pinning_errors_total" + +// Metrics to track the Topology manager behavior +TopologyManagerAdmissionRequestsTotalKey = "topology_manager_admission_requests_total" +TopologyManagerAdmissionErrorsTotalKey = "topology_manager_admission_errors_total" +TopologyManagerAdmissionDurationKey = "topology_manager_admission_duration_ms" +``` + +###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested? + + + +###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.? + + + +### Monitoring Requirements + + + +We will use the existing Metrics provided by CPU/Memory Manager. + +https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/3570-cpumanager#monitoring-requirements +https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/1769-memory-manager#monitoring-requirements + +###### How can an operator determine if the feature is in use by workloads? + + + +The memory/cpu manager will be under the pod resources API. And there are proposed metrics to improve this in [kubernetes/kubernetes#127155](https://github.com/kubernetes/kubernetes/pull/127155) + +###### How can someone using this feature know that it is working for their instance? + +- [x] Events + - Event Reason: +- [ ] API .status + - Condition name: + - Other field: +- [ ] Other (treat as last resort) + - Details: + +###### What are the reasonable SLOs (Service Level Objectives) for the enhancement? + +n/a + +###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service? + +These will be the same as cpu/memory/topology manager. + +###### Are there any missing metrics that would be useful to have to improve observability of this feature? + +Since the CPU/Memory/Topology manager are already implemented most of the metrics are implemented. If we find missing +metrics on Windows we will address as we move to Beta/Stable. + +### Dependencies + + +###### Does this feature depend on any specific services running in the cluster? + +This will require changes to CRI and containerd Windows agents. + +### Scalability + + + +###### Will enabling / using this feature result in any new API calls? + +No + +###### Will enabling / using this feature result in introducing new API types? + +No + +###### Will enabling / using this feature result in any new calls to the cloud provider? + +No + +###### Will enabling / using this feature result in increasing size or count of the existing API objects? + +No + +###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs? + +No + +###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components? + +We will monitor for cpu consumption to query the CPU topology. If required we may wish to implement a caching strategy while also +supporting any new support for dynamic node resizing. + +###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)? + +Memory and CPU's could be exhausted resulting in Pods not being scheduled. + +### Troubleshooting + + + +###### How does this feature react if the API server and/or etcd is unavailable? + +N/a + +###### What are other known failure modes? + +The failure modes for pods on the node are the same as in CPU/Memory/topology Manager + +###### What steps should be taken if SLOs are not being met to determine the problem? + +## Implementation History + + + +## Drawbacks + + + +## Alternatives + + + +## Infrastructure Needed (Optional) + + diff --git a/keps/sig-windows/4885-windows-cpu-and-memory-affinity/kep.yaml b/keps/sig-windows/4885-windows-cpu-and-memory-affinity/kep.yaml new file mode 100644 index 00000000000..72a2256fe12 --- /dev/null +++ b/keps/sig-windows/4885-windows-cpu-and-memory-affinity/kep.yaml @@ -0,0 +1,48 @@ +title: Windows CPU and Memory Affinity +kep-number: 4885 +authors: + - "@jsturtevant" +owning-sig: sig-windows +participating-sigs: + - sig-node +status: implementable +creation-date: 2024-09-03 +reviewers: + - "@ffromani" + - "@aravindhp" + - "@kiashok" +approvers: + - "@mrunalp" + - "@marosset" + +see-also: + - "keps/sig-node/1769-memory-manager" + - "keps/sig-node/3570-cpu-manager" + - "keps/sig-node/693-topology-manager" +replaces: + + +# The target maturity stage in the current dev cycle for this KEP. +stage: alpha + +# The most recent milestone for which work toward delivery of this KEP has been +# done. This can be the current (upcoming) milestone, if it is being actively +# worked on. +latest-milestone: "v1.32" + +# The milestone at which this feature was, or is targeted to be, at each stage. +milestone: + alpha: "v1.32" + beta: "" + stable: "" + +# The following PRR answers are required at alpha release +# List the feature gate name and the components for which it must be enabled +feature-gates: + - name: WindowsCPUAndMemoryAffinity + components: + - kubelet +disable-supported: true + +# The following PRR answers are required at beta release +metrics: