Skip to content

Commit

Permalink
Update based on feedback
Browse files Browse the repository at this point in the history
Signed-off-by: James Sturtevant <[email protected]>
  • Loading branch information
jsturtevant committed Jun 28, 2024
1 parent d07f1a6 commit b5669c3
Show file tree
Hide file tree
Showing 3 changed files with 239 additions and 182 deletions.
138 changes: 72 additions & 66 deletions keps/sig-node/1769-memory-manager/README.md
Original file line number Diff line number Diff line change
@@ -1,65 +1,66 @@
# KEP-1769: Memory Manager

<!-- toc -->
- [Release Signoff Checklist](#release-signoff-checklist)
- [Summary](#summary)
- [Motivation](#motivation)
- [Goals](#goals)
- [Non-Goals](#non-goals)
- [Proposal](#proposal)
- [Design Overview](#design-overview)
- [User Stories](#user-stories)
- [Story 1 : High-performance packet processing with DPDK](#story-1--high-performance-packet-processing-with-dpdk)
- [Story 2 : Databases](#story-2--databases)
- [Story 3 : KubeVirt (provided by @rmohr)](#story-3--kubevirt-provided-by-rmohr)
- [Risks and Mitigations](#risks-and-mitigations)
- [UX](#ux)
- [Design Details](#design-details)
- [How to enable the guaranteed memory allocation over many NUMA nodes?](#how-to-enable-the-guaranteed-memory-allocation-over-many-numa-nodes)
- [The Concept of Node Map and Memory Maps](#the-concept-of-node-map-and-memory-maps)
- [Memory Map](#memory-map)
- [Memory Maps at start-up (with examples)](#memory-maps-at-start-up-with-examples)
- [Memory Maps at runtime (with examples)](#memory-maps-at-runtime-with-examples)
- [Simulation - how the Memory Manager works? (by examples)](#simulation---how-the-memory-manager-works-by-examples)
- [Slide &quot;Reject Pod2&quot;](#slide-reject-pod2)
- [Hints Generation for Topology Manager](#hints-generation-for-topology-manager)
- [New Flags and Configuration of the Memory Manager](#new-flags-and-configuration-of-the-memory-manager)
- [Feature Gate Flag](#feature-gate-flag)
- [Memory Manager Policy Flag](#memory-manager-policy-flag)
- [Reserved Memory Flag](#reserved-memory-flag)
- [New Interfaces](#new-interfaces)
- [How this proposal affects the kubelet ecosystem?](#how-this-proposal-affects-the-kubelet-ecosystem)
- [Container Manager](#container-manager)
- [Topology Manager](#topology-manager)
- [Internal Container Lifecycle](#internal-container-lifecycle)
- [Test Plan](#test-plan)
- [Single-NUMA Systems Tests](#single-numa-systems-tests)
- [Multi-NUMA System Tests](#multi-numa-system-tests)
- [Graduation Criteria](#graduation-criteria)
- [Phase 1: Alpha (target v1.21)](#phase-1-alpha-target-v121)
- [Phase 2: Beta (target v1.22)](#phase-2-beta-target-v122)
- [GA (stable)](#ga-stable)
- [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy)
- [Version Skew Strategy](#version-skew-strategy)
- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire)
- [Feature Enablement and Rollback](#feature-enablement-and-rollback)
- [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning)
- [Monitoring Requirements](#monitoring-requirements)
- [Dependencies](#dependencies)
- [Scalability](#scalability)
- [Troubleshooting](#troubleshooting)
- [Implementation History](#implementation-history)
- [Drawbacks](#drawbacks)
- [Alternatives](#alternatives)
- [Appendix](#appendix)
- [Related Features](#related-features)
- [Related issues](#related-issues)
- [Kubernetes Node's Memory Management Mechanisms and their relation to the Memory Manager](#kubernetes-nodes-memory-management-mechanisms-and-their-relation-to-the-memory-manager)
- [Mechanism I (pod eviction by kubelet)](#mechanism-i-pod-eviction-by-kubelet)
- [Mechanism II (Out-of-Memory (OOM) killer by kernel/OS)](#mechanism-ii-out-of-memory-oom-killer-by-kernelos)
- [Mechanism III (obey cgroup limit, by OOM killer)](#mechanism-iii-obey-cgroup-limit-by-oom-killer)
- [Windows considerations](#windows-considerations)
- [Kubelet memory management](#kubelet-memory-management)
- [KEP-1769: Memory Manager](#kep-1769-memory-manager)
- [Release Signoff Checklist](#release-signoff-checklist)
- [Summary](#summary)
- [Motivation](#motivation)
- [Goals](#goals)
- [Non-Goals](#non-goals)
- [Proposal](#proposal)
- [Design Overview](#design-overview)
- [User Stories](#user-stories)
- [Story 1 : High-performance packet processing with DPDK](#story-1--high-performance-packet-processing-with-dpdk)
- [Story 2 : Databases](#story-2--databases)
- [Story 3 : KubeVirt (provided by @rmohr)](#story-3--kubevirt-provided-by-rmohr)
- [Risks and Mitigations](#risks-and-mitigations)
- [UX](#ux)
- [Design Details](#design-details)
- [How to enable the guaranteed memory allocation over many NUMA nodes?](#how-to-enable-the-guaranteed-memory-allocation-over-many-numa-nodes)
- [The Concept of Node Map and Memory Maps](#the-concept-of-node-map-and-memory-maps)
- [Memory Map](#memory-map)
- [Memory Maps at start-up (with examples)](#memory-maps-at-start-up-with-examples)
- [Memory Maps at runtime (with examples)](#memory-maps-at-runtime-with-examples)
- [Simulation - how the Memory Manager works? (by examples)](#simulation---how-the-memory-manager-works-by-examples)
- [Slide "Reject Pod2"](#slide-reject-pod2)
- [Hints Generation for Topology Manager](#hints-generation-for-topology-manager)
- [New Flags and Configuration of the Memory Manager](#new-flags-and-configuration-of-the-memory-manager)
- [Feature Gate Flag](#feature-gate-flag)
- [Memory Manager Policy Flag](#memory-manager-policy-flag)
- [Reserved Memory Flag](#reserved-memory-flag)
- [New Interfaces](#new-interfaces)
- [How this proposal affects the kubelet ecosystem?](#how-this-proposal-affects-the-kubelet-ecosystem)
- [Container Manager](#container-manager)
- [Topology Manager](#topology-manager)
- [Internal Container Lifecycle](#internal-container-lifecycle)
- [Test Plan](#test-plan)
- [Single-NUMA Systems Tests](#single-numa-systems-tests)
- [Multi-NUMA System Tests](#multi-numa-system-tests)
- [Graduation Criteria](#graduation-criteria)
- [Phase 1: Alpha (target v1.21)](#phase-1-alpha-target-v121)
- [Phase 2: Beta (target v1.22)](#phase-2-beta-target-v122)
- [GA (stable)](#ga-stable)
- [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy)
- [Version Skew Strategy](#version-skew-strategy)
- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire)
- [Feature Enablement and Rollback](#feature-enablement-and-rollback)
- [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning)
- [Monitoring Requirements](#monitoring-requirements)
- [Dependencies](#dependencies)
- [Scalability](#scalability)
- [Troubleshooting](#troubleshooting)
- [Implementation History](#implementation-history)
- [Drawbacks](#drawbacks)
- [Alternatives](#alternatives)
- [Appendix](#appendix)
- [Related Features](#related-features)
- [Related issues](#related-issues)
- [Kubernetes Node's Memory Management Mechanisms and their relation to the Memory Manager](#kubernetes-nodes-memory-management-mechanisms-and-their-relation-to-the-memory-manager)
- [Mechanism I (pod eviction by kubelet)](#mechanism-i-pod-eviction-by-kubelet)
- [Mechanism II (Out-of-Memory (OOM) killer by kernel/OS)](#mechanism-ii-out-of-memory-oom-killer-by-kernelos)
- [Mechanism III (obey cgroup limit, by OOM killer)](#mechanism-iii-obey-cgroup-limit-by-oom-killer)
- [Windows considerations](#windows-considerations)
- [Kubelet memory management](#kubelet-memory-management)
<!-- /toc -->

## Release Signoff Checklist
Expand Down Expand Up @@ -782,13 +783,18 @@ The Memory Manager sets and enforces cgroup memory limit for ("on behalf of") a

### Windows considerations

Numa nodes can not be guaranteed via the Windows API, instead an [ideal Numa](https://learn.microsoft.com/en-us/windows/win32/procthread/numa-support#numa-support-on-systems-with-more-than-64-logical-processors) node can be
configured via the [PROC_THREAD_ATTRIBUTE_PREFERRED_NODE](https://learn.microsoft.com/en-us/windows/win32/api/processthreadsapi/nf-processthreadsapi-updateprocthreadattribute).
Using Memory manager's internal mapping this should provide the desired behavior in most cases. It is possible that a CPU could access memory from a different Numa Node than it is currently in, resulting in decreased performance. For this reason,
we will add documentation in addition to a log warning message in kubelet to help raise awareness.
If state is undesirable then `single-numa-node` and the CPU manager should be configured in the Topology Manager policy setting
which would force Kubelet to only select a numa node if it will have enough memory and CPU's available. In the future, in the case of workloads that span multiple Numa nodes, it may be desirable for
Topology manager to have a new policy specific for Windows.
Numa nodes can not be directly assigned or guaranteed via the Windows API. Another limitation of the Windows API's is that [PROC_THREAD_ATTRIBUTE_PREFERRED_NODE](https://learn.microsoft.com/en-us/windows/win32/api/processthreadsapi/nf-processthreadsapi-updateprocthreadattribute) does not support setting this for the Job object (i.e. Container) and only supports setting a single Numa Node.
The `PROC_THREAD_ATTRIBUTE_PREFERRED_NODE` api works by assigning workloads to the Numa Node via CPU affinity. The API finds all processors associated with the Numa node then applies the CPU affinity to those processors which results in the memory from the Numa node being used.
In order to support multiple numa nodes and be able to apply the Numa affinity to job objects the container runtime will be expected to mimic
the behavior of [PROC_THREAD_ATTRIBUTE_PREFERRED_NODE](https://learn.microsoft.com/en-us/windows/win32/api/processthreadsapi/nf-processthreadsapi-updateprocthreadattribute)
by finding the associated CPU's for the Numa Nodes that are passed via the Cri API and setting the preferred affinity for the job object.

Using Memory manager's internal mapping this should provide the desired behavior in most cases. It is possible that a CPU could access memory from a different Numa
Node than it is currently in, resulting in decreased performance. For this reason, we will add documentation, a log warning message in kubelet, and an warning event
to help raise awareness of this possibility. If access from the CPUs different than the assigned Numa Node is undesirable then `single-numa-node`
and the CPU manager should be configured in the Topology Manager policy setting which would force Kubelet to only select a Numa node if it will have enough memory
and CPU's available. In the future, in the case of workloads that span multiple Numa nodes, it may be desirable for Topology manager to have a new policy specific
for Windows. This would require a separate KEP to add a new policy.

#### Kubelet memory management

Expand Down
Loading

0 comments on commit b5669c3

Please sign in to comment.