Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-1769/KEP-3570/KEP693: Adding Windows Kubelet Manager implementation details #4738

Closed
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
29 changes: 29 additions & 0 deletions keps/sig-node/1769-memory-manager/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,8 @@
- [Mechanism I (pod eviction by kubelet)](#mechanism-i-pod-eviction-by-kubelet)
- [Mechanism II (Out-of-Memory (OOM) killer by kernel/OS)](#mechanism-ii-out-of-memory-oom-killer-by-kernelos)
- [Mechanism III (obey cgroup limit, by OOM killer)](#mechanism-iii-obey-cgroup-limit-by-oom-killer)
- [Windows considerations](#windows-considerations)
- [Kubelet memory management](#kubelet-memory-management)
<!-- /toc -->

## Release Signoff Checklist
Expand Down Expand Up @@ -778,3 +780,30 @@ The Memory Manager sets and enforces cgroup memory limit for ("on behalf of") a
[hugepage-issue]: https://github.com/kubernetes/kubernetes/issues/80716
[memory-issue]: https://github.com/kubernetes/kubernetes/issues/81009

### Windows considerations

[Numa nodes](https://learn.microsoft.com/en-us/windows/win32/procthread/numa-support) can not be directly assigned or guaranteed via the Windows API but the windows sub system attempts to use memory assigned to the CPU to improve performance.
It is possible to indicate to a process which Numa node is preferred but a limitation of the Windows API's is that [PROC_THREAD_ATTRIBUTE_PREFERRED_NODE](https://learn.microsoft.com/windows/win32/api/processthreadsapi/nf-processthreadsapi-updateprocthreadattribute)
does not support setting multiple Numa nodes for a single Job object (i.e. Container) so is not usable in the context of Windows containers which have multiple processes.

To work around these limitations, the kubelet will query the OS to get the affinity masks associated with each of the Numa nodes selected by the memory manager and update the CPU Group affinity accordingly in the CRI field. This will result in the memory from the Numa node being used. There are a couple scenarios that need to be considered:

- Memory manager is enabled, cpu manager is not: kubelet will look up all the cpu's associated with the selected Numa nodes and assign the CPU Group affinity. For example if NumaNode 0 is selected by memory manager, and NumaNode 0 has the first four CPU's in Windows CPU group 0 the result would be `cpu affinity: 0000001111, group 0`.
- Memory manager is enabled, CPU manager is enabled
- cpu manager selects fewer CPU's than Numa nodes and CPU's fall with in Numa node: Kubelet will only set only the CPU's selected by the cpu-manager as the memory from the memory manager will be used by default.
- cpu manager selects more CPU's than Numa nodes and CPU's fall within/or outside Numa node: kubelet will set selected only CPU's from cpu-manager
- cpu manager selects few CPU's than Numa nodes and CPU's fall outside the Numa Node: Kubelet would set the CPU's by cpu-manager plus all the CPU's associated with the Numa node.
Comment on lines +793 to +795
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume by "cpu manager selects fewer CPU's than Numa nodes" what we really mean is "cpu manager selects fewer CPU's than the CPU's associated with the Numa nodes selected by the memory manager"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I can update the wording to be clearer


Using Memory manager's internal mapping this should provide the desired behavior in most cases. Since Memory affinity isn't guaranteed, It is possible that a CPU could access memory from a different Numa
Node than it is currently in, resulting in decreased performance. For this reason, we will add documentation, a log warning message in kubelet, and an warning event
to help raise awareness of this possibility. If access from the CPUs different than the assigned Numa Node is undesirable then `single-numa-node`
and the CPU manager should be configured in the Topology Manager policy setting which would force Kubelet to only select a Numa node if it will have enough memory
and CPU's available. In the future, in the case of workloads that span multiple Numa nodes, it may be desirable for Topology manager to have a new policy specific
for Windows. This would require a separate KEP to add a new policy.

#### Kubelet memory management

Windows support for [kubelet's memory eviction](https://github.com/kubernetes/kubernetes/pull/122922) was enabled in 1.31 and would follow the same patterns
as [Mechanism I](#mechanism-i-pod-eviction-by-kubelet).
Windows does not have an OOM killer and so Mechanisms II and III are out of scope in the section
related to the [Kubernetes Node Memory Management](#kubernetes-nodes-memory-management-mechanisms-and-their-relation-to-the-memory-manager).
73 changes: 73 additions & 0 deletions keps/sig-node/3570-cpumanager/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@
- [Risks and Mitigations](#risks-and-mitigations)
- [Design Details](#design-details)
- [Discovering CPU topology](#discovering-cpu-topology)
- [Windows CPU Discovery](#windows-cpu-discovery)
- [CPU Manager interfaces (sketch)](#cpu-manager-interfaces-sketch)
- [Configuring the CPU Manager](#configuring-the-cpu-manager)
- [Policy 1: &quot;none&quot; cpuset control [default]](#policy-1-none-cpuset-control-default)
Expand Down Expand Up @@ -208,6 +209,78 @@ Alternate options considered for discovering topology:
1. Execute a mature external topology program like [`mpi-hwloc`][hwloc] --
potentially adding support for the hwloc file format to the Kubelet.

#### Windows CPU Discovery

The Windows Kubelet provides an implementation for the [cadvisor api](https://github.com/kubernetes/kubernetes/blob/fbaf9b0353a61c146632ac195dfeb1fbaffcca1e/pkg/kubelet/cadvisor/cadvisor_windows.go#L50)
in order to provide Windows stats to other components without modification.
The ability to provide the `cadvisorapi.MachineInfo` api is already partially mapped
in on the Windows client. By mapping the Windows specific topology API's to
cadvisor API, no changes are required to the CPU Manager.

The [Windows concepts](https://learn.microsoft.com/windows/win32/procthread/processor-groups) are mapped to [Linux concepts](https://github.com/kubernetes/kubernetes/blob/cede96336a809a67546ca08df0748e4253ec270d/pkg/kubelet/cm/cpumanager/topology/topology.go#L34-L39) with the following:

| Kubelet Term | Description | Cadvisor term | Windows term |
| --- | --- | --- | --- |
| CPU | logical CPU | thread | Logical processor |
| Core | physical CPU | Core | Core |
| Socket | socket | Socket | Physical Processor |
| NUMA Node | NUMA cell | Node | Numa node |

The Windows API's used will be
- [getlogicalprocessorinformationex](https://learn.microsoft.com/windows/win32/api/sysinfoapi/nf-sysinfoapi-getlogicalprocessorinformationex)
- [nf-winbase-getnumaavailablememorynodeex](https://learn.microsoft.com/windows/win32/api/winbase/nf-winbase-getnumaavailablememorynodeex)

One difference between the Windows API and Linux is the concept of [Processor groups](https://learn.microsoft.com/windows/win32/procthread/processor-groups).
On Windows systems with more than 64 cores the CPU's will be split into groups,
each processor is identified by its group number and its group-relative processor number.

In Cri we will add the following structure to the `WindowsContainerResources` in CRI:

```
message WindowsCpuGroupAffinity {
// CPU mask relative to this CPU group.
uint64 cpu_mask = 1;
// CPU group that this CPU belongs to.
uint32 cpu_group = 2;
}
```

Since the Kubelet API's are looking for a distinct ProcessorId, the processorid's will be calculated by looping
through the mask and calculating the ids with `(group *64) + procesorid` resulting in unique processor id's from `group 0` as `0-63` and
processor Id's from `group 1` as `64-127` and so on. This translation will be done only in kubelet, the `cpu_mask` will be used when
communicating with the container runtime.

```
for i := 0; i < 64; i++ {
if GROUP_AFFINITY.Mask&(1<<i) != 0 {
processors = append(processors, i+(int(a.Group)*64))
}
}
}
```

Using this logic, a cpu bit mask of `0000111` (leading zero's removed) would result in cpu's:

- `0,1,2` in `group 0`
- `64,65,66` in `group 1`.

When converting back to the Windows Group Affinity we will divide the cpu number by 64 to get the group number then
use mod of 64 to calculate the location of the cpu in mask:

```
group := cpu / 64
mask := 1 << (cpu % 64)

groupaffinity.Mask |= mask
```

There are some scenarios where cpu count might be greater than 64 cores but in each group it is less
than 64. For instance, you could have 2 CPU groups with 35 processors each. The unique ID's using the strategy
above would give you:

- CPU group 0 : 0 to 34
- CPU group 2: 64 to 99

### CPU Manager interfaces (sketch)

```go
Expand Down
9 changes: 9 additions & 0 deletions keps/sig-node/693-topology-manager/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,7 @@
- [Drawbacks](#drawbacks)
- [Alternatives](#alternatives)
- [Infrastructure Needed (Optional)](#infrastructure-needed-optional)
- [Windows considerations](#windows-considerations)
<!-- /toc -->

## Release Signoff Checklist
Expand Down Expand Up @@ -929,3 +930,11 @@ allocation and thread scheduling, but does not address device locality.

Multi-NUMA hardware is needed for testing of this feature. Recently, support for multi-NUMA
harware was [added](https://github.com/kubernetes/test-infra/pull/28369) in Kubernetes test infrastructure.

## Windows considerations

Topology manager is already enabled on Windows in order to support the device manager. Since there are no changes to the
Topology manager, the answers to the [Production Readiness Review](#production-readiness-review-questionnaire) section also apply to Windows when CPU and Memory manager are
added as hint providers. The CPU manager and Memory Manager can independently be enabled or disabled to support cases where the features needs to be shut off.
In the future a new Policy (and new KEP) for the Topology manager maybe required to address unique Windows Numa Memory Management requirements as described in the Windows Section on the Memory Manager KEP.