Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-1769/KEP-3570/KEP693: Adding Windows Kubelet Manager implementation details #4738

Closed
Closed
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 18 additions & 0 deletions keps/sig-node/1769-memory-manager/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,8 @@
- [Mechanism I (pod eviction by kubelet)](#mechanism-i-pod-eviction-by-kubelet)
- [Mechanism II (Out-of-Memory (OOM) killer by kernel/OS)](#mechanism-ii-out-of-memory-oom-killer-by-kernelos)
- [Mechanism III (obey cgroup limit, by OOM killer)](#mechanism-iii-obey-cgroup-limit-by-oom-killer)
- [Windows considerations](#windows-considerations)
- [Kubelet memory management](#kubelet-memory-management)
<!-- /toc -->

## Release Signoff Checklist
Expand Down Expand Up @@ -778,3 +780,19 @@ The Memory Manager sets and enforces cgroup memory limit for ("on behalf of") a
[hugepage-issue]: https://github.com/kubernetes/kubernetes/issues/80716
[memory-issue]: https://github.com/kubernetes/kubernetes/issues/81009

### Windows considerations

Numa nodes can not be guaranteed via the Windows API, instead an [ideal Numa](https://learn.microsoft.com/en-us/windows/win32/procthread/numa-support#numa-support-on-systems-with-more-than-64-logical-processors) node can be
jsturtevant marked this conversation as resolved.
Show resolved Hide resolved
configured via the [PROC_THREAD_ATTRIBUTE_PREFERRED_NODE](https://learn.microsoft.com/en-us/windows/win32/api/processthreadsapi/nf-processthreadsapi-updateprocthreadattribute).
Using Memory manager's internal mapping this should provide the desired behavior in most cases. It is possible that a CPU could access memory from a different Numa Node than it is currently in, resulting in decreased performance. For this reason,
we will add documentation in addition to a log warning message in kubelet to help raise awareness.
jsturtevant marked this conversation as resolved.
Show resolved Hide resolved
If state is undesirable then `single-numa-node` and the CPU manager should be configured in the Topology Manager policy setting
jsturtevant marked this conversation as resolved.
Show resolved Hide resolved
jsturtevant marked this conversation as resolved.
Show resolved Hide resolved
which would force Kubelet to only select a numa node if it will have enough memory and CPU's available. In the future, in the case of workloads that span multiple Numa nodes, it may be desirable for
Topology manager to have a new policy specific for Windows.

#### Kubelet memory management

There is work to [enable kubelet's eviction](https://github.com/kubernetes/kubernetes/pulls/marosset) which would follow the same patterns
as [Mechanism I](#mechanism-i-pod-eviction-by-kubelet).
Windows does not have an OOM killer and so Mechanisms II and III are out of scope in the section
related to the [Kubernetes Node Memory Management](#kubernetes-nodes-memory-management-mechanisms-and-their-relation-to-the-memory-manager).
50 changes: 50 additions & 0 deletions keps/sig-node/3570-cpumanager/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@
- [Risks and Mitigations](#risks-and-mitigations)
- [Design Details](#design-details)
- [Discovering CPU topology](#discovering-cpu-topology)
- [Windows CPU Discovery](#windows-cpu-discovery)
- [CPU Manager interfaces (sketch)](#cpu-manager-interfaces-sketch)
- [Configuring the CPU Manager](#configuring-the-cpu-manager)
- [Policy 1: &quot;none&quot; cpuset control [default]](#policy-1-none-cpuset-control-default)
Expand Down Expand Up @@ -208,6 +209,55 @@ Alternate options considered for discovering topology:
1. Execute a mature external topology program like [`mpi-hwloc`][hwloc] --
potentially adding support for the hwloc file format to the Kubelet.

#### Windows CPU Discovery

The Windows Kubelet provides an implementation for the [cadvisor api](https://github.com/kubernetes/kubernetes/blob/fbaf9b0353a61c146632ac195dfeb1fbaffcca1e/pkg/kubelet/cadvisor/cadvisor_windows.go#L50)
in order to provide Windows stats to other components without modification.
The ability to provide the `cadvisorapi.MachineInfo` api is already partially mapped
in on the Windows client. By mapping the Windows specific topology API's to
cadvisor API, no changes are required to the CPU Manager.

The [Windows concepts](https://learn.microsoft.com/en-us/windows/win32/procthread/processor-groups) are mapped to [Linux concepts](https://github.com/kubernetes/kubernetes/blob/cede96336a809a67546ca08df0748e4253ec270d/pkg/kubelet/cm/cpumanager/topology/topology.go#L34-L39) with the following:

| Kubelet Term | Description | Cadvisor term | Windows term |
| --- | --- | --- | --- |
| CPU | logical CPU | thread | Logical processor |
| Core | physical CPU | Core | Core |
| Socket | socket | Socket | Physical Processor |
| NUMA Node | NUMA cell | Node | Numa node |

The Windows API's used will be
- [getlogicalprocessorinformationex](https://learn.microsoft.com/en-us/windows/win32/api/sysinfoapi/nf-sysinfoapi-getlogicalprocessorinformationex)
- [nf-winbase-getnumaavailablememorynodeex](https://learn.microsoft.com/en-us/windows/win32/api/winbase/nf-winbase-getnumaavailablememorynodeex)

One difference between the Windows API and Linux is the concept of [Processor groups](https://learn.microsoft.com/en-us/windows/win32/procthread/processor-groups).
On Windows systems with more than 64 cores the CPU's will be split into groups,
each processor is identified by its group number and its group-relative processor number.

In Cri we will add the following structure to the `WindowsContainerResources` in CRI:

```
message WindowsCpuGroupAffinity {
// CPU mask relative to this CPU group.
uint64 cpu_mask = 1;
// CPU group that this CPU belongs to.
uint32 cpu_group = 2;
}
```

Since the Kubelet API's are looking for a distinct ProcessorId, the id will be calculated by:
`(group *64) + procesorid` resulting in unique process id's from `group 0` as `1-64` and
aravindhp marked this conversation as resolved.
Show resolved Hide resolved
process Id's from `group 1` as `65-128` and so on. When converting back to the Windows
jsturtevant marked this conversation as resolved.
Show resolved Hide resolved
jsturtevant marked this conversation as resolved.
Show resolved Hide resolved
Group Affinity we will divide by 2 until we receive a value under 64, counting the number to determine
the groupID.

There are some scenarios where cpu count might be greater than 64 cores but in each group it is less
than 64. For instance, you could have 2 CPU groups with 35 processors each. The unique ID's using the strategy
above would give you:

- CPU group 0 : 1 to 35
- CPU group 2: 65 to 100

### CPU Manager interfaces (sketch)

```go
Expand Down
9 changes: 9 additions & 0 deletions keps/sig-node/693-topology-manager/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,7 @@
- [Drawbacks](#drawbacks)
- [Alternatives](#alternatives)
- [Infrastructure Needed (Optional)](#infrastructure-needed-optional)
- [Windows considerations](#windows-considerations)
<!-- /toc -->

## Release Signoff Checklist
Expand Down Expand Up @@ -929,3 +930,11 @@ allocation and thread scheduling, but does not address device locality.

Multi-NUMA hardware is needed for testing of this feature. Recently, support for multi-NUMA
harware was [added](https://github.com/kubernetes/test-infra/pull/28369) in Kubernetes test infrastructure.

## Windows considerations

Topology manager is already enabled on Windows in order to support the device manager. The same configuration options
and PRR applies to Windows. The CPU manager and Memory Manager can independently be enabled to support advance configuration
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is PRR in this case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant the same Production Readiness Review answers, in the sense that these features can be set to None to disable if required to roll back do to errors. I updated to clarify

where affinity is applied. In the future a new Policy maybe required to address unique Numa Memory Management as described in the
Windows Section on the Memory Manager KEP.