kubernetes · jsturtevant · Jun 18, 2024 · Jun 18, 2024 · Jun 20, 2024 · Jun 28, 2024
diff --git a/keps/sig-node/1769-memory-manager/README.md b/keps/sig-node/1769-memory-manager/README.md
@@ -58,6 +58,8 @@
  - [Mechanism I (pod eviction by kubelet)](#mechanism-i-pod-eviction-by-kubelet)
  - [Mechanism II (Out-of-Memory (OOM) killer by kernel/OS)](#mechanism-ii-out-of-memory-oom-killer-by-kernelos)
  - [Mechanism III (obey cgroup limit, by OOM killer)](#mechanism-iii-obey-cgroup-limit-by-oom-killer)
+ - [Windows considerations](#windows-considerations)
+ - [Kubelet memory management](#kubelet-memory-management)
 <!-- /toc -->
 
 ## Release Signoff Checklist
@@ -778,3 +780,30 @@ The Memory Manager sets and enforces cgroup memory limit for ("on behalf of") a
 [hugepage-issue]: https://github.com/kubernetes/kubernetes/issues/80716
 [memory-issue]: https://github.com/kubernetes/kubernetes/issues/81009
 
+### Windows considerations
+
+[Numa nodes](https://learn.microsoft.com/en-us/windows/win32/procthread/numa-support) can not be directly assigned or guaranteed via the Windows API but the windows sub system attempts to use memory assigned to the CPU to improve performance. 
+It is possible to indicate to a process which Numa node is preferred but a limitation of the Windows API's is that [PROC_THREAD_ATTRIBUTE_PREFERRED_NODE](https://learn.microsoft.com/windows/win32/api/processthreadsapi/nf-processthreadsapi-updateprocthreadattribute)
+does not support setting multiple Numa nodes for a single Job object (i.e. Container) so is not usable in the context of Windows containers which have multiple processes. 
+
+To work around these limitations, the kubelet will query the OS to get the affinity masks associated with each of the Numa nodes selected by the memory manager and update the CPU Group affinity accordingly in the CRI field. This will result in the memory from the Numa node being used. There are a couple scenarios that need to be considered:
+
+- Memory manager is enabled, cpu manager is not: kubelet will look up all the cpu's associated with the selected Numa nodes and assign the CPU Group affinity. For example if NumaNode 0 is selected by memory manager, and NumaNode 0 has the first four CPU's in Windows CPU group 0 the result would be `cpu affinity: 0000001111, group 0`. 
+- Memory manager is enabled, CPU manager is enabled
+ - cpu manager selects fewer CPU's than Numa nodes and CPU's fall with in Numa node: Kubelet will only set only the CPU's selected by the cpu-manager as the memory from the memory manager will be used by default. 
+ - cpu manager selects more CPU's than Numa nodes and CPU's fall within/or outside Numa node: kubelet will set selected only CPU's from cpu-manager
+ - cpu manager selects few CPU's than Numa nodes and CPU's fall outside the Numa Node: Kubelet would set the CPU's by cpu-manager plus all the CPU's associated with the Numa node. 
+
+Using Memory manager's internal mapping this should provide the desired behavior in most cases. Since Memory affinity isn't guaranteed, It is possible that a CPU could access memory from a different Numa 
+Node than it is currently in, resulting in decreased performance. For this reason, we will add documentation, a log warning message in kubelet, and an warning event 
+to help raise awareness of this possibility. If access from the CPUs different than the assigned Numa Node is undesirable then `single-numa-node` 
+and the CPU manager should be configured in the Topology Manager policy setting which would force Kubelet to only select a Numa node if it will have enough memory 
+and CPU's available. In the future, in the case of workloads that span multiple Numa nodes, it may be desirable for Topology manager to have a new policy specific 
+for Windows. This would require a separate KEP to add a new policy.
+
+#### Kubelet memory management 
+
+Windows support for [kubelet's memory eviction](https://github.com/kubernetes/kubernetes/pull/122922) was enabled in 1.31 and would follow the same patterns
+as [Mechanism I](#mechanism-i-pod-eviction-by-kubelet).
+Windows does not have an OOM killer and so Mechanisms II and III are out of scope in the section 
+related to the [Kubernetes Node Memory Management](#kubernetes-nodes-memory-management-mechanisms-and-their-relation-to-the-memory-manager).
diff --git a/keps/sig-node/3570-cpumanager/README.md b/keps/sig-node/3570-cpumanager/README.md
@@ -14,6 +14,7 @@
  - [Risks and Mitigations](#risks-and-mitigations)
 - [Design Details](#design-details)
  - [Discovering CPU topology](#discovering-cpu-topology)
+ - [Windows CPU Discovery](#windows-cpu-discovery)
  - [CPU Manager interfaces (sketch)](#cpu-manager-interfaces-sketch)
  - [Configuring the CPU Manager](#configuring-the-cpu-manager)
  - [Policy 1: &quot;none&quot; cpuset control [default]](#policy-1-none-cpuset-control-default)
@@ -208,6 +209,78 @@ Alternate options considered for discovering topology:
 1. Execute a mature external topology program like [`mpi-hwloc`][hwloc] --
  potentially adding support for the hwloc file format to the Kubelet.
 
+#### Windows CPU Discovery
+
+The Windows Kubelet provides an implementation for the [cadvisor api](https://github.com/kubernetes/kubernetes/blob/fbaf9b0353a61c146632ac195dfeb1fbaffcca1e/pkg/kubelet/cadvisor/cadvisor_windows.go#L50) 
+in order to provide Windows stats to other components without modification. 
+The ability to provide the `cadvisorapi.MachineInfo` api is already partially mapped
+in on the Windows client. By mapping the Windows specific topology API's to 
+cadvisor API, no changes are required to the CPU Manager.
+
+The [Windows concepts](https://learn.microsoft.com/windows/win32/procthread/processor-groups) are mapped to [Linux concepts](https://github.com/kubernetes/kubernetes/blob/cede96336a809a67546ca08df0748e4253ec270d/pkg/kubelet/cm/cpumanager/topology/topology.go#L34-L39) with the following:
+
+| Kubelet Term | Description | Cadvisor term | Windows term |
+| --- | --- | --- | --- |
+| CPU | logical CPU | thread | Logical processor |
+| Core | physical CPU | Core | Core |
+| Socket | socket | Socket | Physical Processor |
+| NUMA Node | NUMA cell | Node | Numa node |
+
+The Windows API's used will be
+- [getlogicalprocessorinformationex](https://learn.microsoft.com/windows/win32/api/sysinfoapi/nf-sysinfoapi-getlogicalprocessorinformationex)
+- [nf-winbase-getnumaavailablememorynodeex](https://learn.microsoft.com/windows/win32/api/winbase/nf-winbase-getnumaavailablememorynodeex) 
+
+One difference between the Windows API and Linux is the concept of [Processor groups](https://learn.microsoft.com/windows/win32/procthread/processor-groups).
+On Windows systems with more than 64 cores the CPU's will be split into groups, 
+each processor is identified by its group number and its group-relative processor number. 
+
+In Cri we will add the following structure to the `WindowsContainerResources` in CRI:
+
+```
+message WindowsCpuGroupAffinity {
+ // CPU mask relative to this CPU group.
+ uint64 cpu_mask = 1;
+ // CPU group that this CPU belongs to.
+ uint32 cpu_group = 2;
+}
+```
+
+Since the Kubelet API's are looking for a distinct ProcessorId, the processorid's will be calculated by looping 
+through the mask and calculating the ids with `(group *64) + procesorid` resulting in unique processor id's from `group 0` as `0-63` and 
+processor Id's from `group 1` as `64-127` and so on. This translation will be done only in kubelet, the `cpu_mask` will be used when 
+communicating with the container runtime.
+
+```
+for i := 0; i < 64; i++ {
+ if GROUP_AFFINITY.Mask&(1<<i) != 0 {
+ processors = append(processors, i+(int(a.Group)*64))
+ }
+ }
+}
+```
+
+Using this logic, a cpu bit mask of `0000111` (leading zero's removed) would result in cpu's: 
+
+- `0,1,2` in `group 0` 
+- `64,65,66` in `group 1`.
+
+When converting back to the Windows Group Affinity we will divide the cpu number by 64 to get the group number then 
+use mod of 64 to calculate the location of the cpu in mask:
+
+```
+group := cpu / 64
+mask := 1 << (cpu % 64)
+
+groupaffinity.Mask |= mask
+```
+
+There are some scenarios where cpu count might be greater than 64 cores but in each group it is less
+than 64. For instance, you could have 2 CPU groups with 35 processors each. The unique ID's using the strategy 
+above would give you: 
+
+- CPU group 0 : 0 to 34
+- CPU group 2: 64 to 99
+
 ### CPU Manager interfaces (sketch)
 
 ```go

diff --git a/keps/sig-node/693-topology-manager/README.md b/keps/sig-node/693-topology-manager/README.md
@@ -53,6 +53,7 @@
 - [Drawbacks](#drawbacks)
 - [Alternatives](#alternatives)
 - [Infrastructure Needed (Optional)](#infrastructure-needed-optional)
+- [Windows considerations](#windows-considerations)
 <!-- /toc -->
 
 ## Release Signoff Checklist
@@ -929,3 +930,11 @@ allocation and thread scheduling, but does not address device locality.
 
 Multi-NUMA hardware is needed for testing of this feature. Recently, support for multi-NUMA
 harware was [added](https://github.com/kubernetes/test-infra/pull/28369) in Kubernetes test infrastructure.
+
+## Windows considerations
+
+Topology manager is already enabled on Windows in order to support the device manager. Since there are no changes to the 
+Topology manager, the answers to the [Production Readiness Review](#production-readiness-review-questionnaire) section also apply to Windows when CPU and Memory manager are 
+added as hint providers. The CPU manager and Memory Manager can independently be enabled or disabled to support cases where the features needs to be shut off. 
+In the future a new Policy (and new KEP) for the Topology manager maybe required to address unique Windows Numa Memory Management requirements as described in the Windows Section on the Memory Manager KEP. 
+