diff --git a/keps/sig-apps/3017-pod-healthy-policy-for-pdb/README.md b/keps/sig-apps/3017-pod-healthy-policy-for-pdb/README.md index adab1623e03..b8d3a676641 100644 --- a/keps/sig-apps/3017-pod-healthy-policy-for-pdb/README.md +++ b/keps/sig-apps/3017-pod-healthy-policy-for-pdb/README.md @@ -1,4 +1,4 @@ -# KEP-3017: Pod Healthy Policy for PDBs +# KEP-3017: Unhealthy Pod Eviction Policy for PDBs - [Release Signoff Checklist](#release-signoff-checklist) @@ -10,7 +10,6 @@ - [Risks and Mitigations](#risks-and-mitigations) - [Design Details](#design-details) - [API](#api) - - [Changes to the disruption controller](#changes-to-the-disruption-controller) - [Changes to the eviction API](#changes-to-the-eviction-api) - [Test Plan](#test-plan) - [Prerequisite testing updates](#prerequisite-testing-updates) @@ -34,6 +33,10 @@ - [Implementation History](#implementation-history) - [Drawbacks](#drawbacks) - [Alternatives](#alternatives) +- [Abandoned Alternative Implementation](#abandoned-alternative-implementation) + - [Changes to the disruption controller](#changes-to-the-disruption-controller) + - [Changes to the definition of healthy in a PDB according to the policy used.](#changes-to-the-definition-of-healthy-in-a-pdb-according-to-the-policy-used) +- [Future Work](#future-work) ## Release Signoff Checklist @@ -76,11 +79,11 @@ Items marked with (R) are required *prior to targeting to a milestone / release* ## Summary -Pod Disruption Budgets currently doesn't provide a way for users to specify how to handle -pods that are Running, but not Ready. In this KEP, we add a new field `podHealthyPolicy` that -allows users to specify whether those pods should be considered healthy and therefore -covered by the constraints of the Pod Disruption Budget, or if they should be considered as -already disrupted and thus not covered by a Pod Disruption Budget. +Pod Disruption Budgets currently don't provide a way for users to specify how to handle +pods that are Running, but not Healthy (Ready). +In this KEP, we add a new field `unhealthyPodEvictionPolicy` that allows users to specify +what should happen to these not Healthy (Ready) pods. Whether they should be always evicted or kept +in case the application guarded by a Pod Disruption Budget is not available and disrupted. ## Motivation @@ -93,7 +96,7 @@ soon-to-be evicted pod has been copied/shared/replicated to other pod(s). Both use-cases have rough edges with the current implementation. For users who only want to make sure a minimum number of pods are available, it is possible -to end up in situations where pods that are Running but not Ready can not be evicted, +to end up in situations where pods that are Running but not Healthy (Ready) can not be evicted, even when the total number of pods are higher than the threshold set in the PDB (https://github.com/kubernetes/kubernetes/issues/72320). This can block automated tooling like the cluster-autoscaler and draining of nodes. @@ -109,26 +112,36 @@ doesn't provide any alternatives solutions for this problem. ### Goals -- Prevent PDBs from deadlocking eviction due to non-Ready pods. +- Prevent PDBs from deadlocking eviction due to non-Healthy (non-Ready) pods. - Make sure users who rely on PDBs for data-safety can continue to do so. ### Non-Goals - Providing a safe solution for preventing data-loss. Not because this isn't important, but it is unclear if PDB is the right tool for this. +- Allow customization of healthiness detection for pods guarded by a PodDisruptionBudget. ## Proposal -The core issue here is whether a pod that is Running but not Ready is already disrupted, -and thus can be evicted without being constrained by a potential Pod Disruption Budget. Currently, -the disruption controller only considers Running and Ready pods as healthy, thus that is -the basis for computing `allowedDisruptions`. The disruption API on the other hand, does require -that `allowedDisruptions` is larger than 0 to evict a pod that is Running but not Ready. +The core issue here is whether a pod that is Running but not Healthy (Ready) is considered disrupted, +and thus should be evicted without being potentially constrained by a Pod Disruption Budget. + +Currently, we only allow evicting Running pods in case there are enough pods healthy +(`.status.currentHealthy` is at least equal to `.status.DesiredHealthy`). +This is to give the application best change to achieve availability and prevent data loss +by disallowing disruption of starting pods that have not become Healthy (Ready yet). -Adding a `podHealthyPolicy` field on the PDB API will allow the user to specify which -behavior that are desired, and this will be consistently handled by the disruption controller, eviction -API, and any other APIs that might use PDBs. If a `podHealthyPolicy` is not provided, the default -will be the current behavior. +We also want to allow unconditional eviction of Running pods for applications that do not have +such strict constraints. This will allow cluster administrators to evict misbehaving applications +that are guarded by a PDB and proceed with node drain. + +Adding a `unhealthyPodEvictionPolicy` field on the PDB API will allow the user to specify which +behavior is desired. This will be consistently handled by the eviction API, and any other APIs +that might use PDBs. If a `unhealthyPodEvictionPolicy` is not provided, the default will be +the current behavior. + +The behavior for pods in Pending, Succeeded or Failed phase will stay the same and such pods will +always be considered for eviction. ### Risks and Mitigations @@ -141,49 +154,51 @@ will be the current behavior. type PodDisruptionBudgetSpec struct { ... - - // PodHealthyPolicy defines the criteria for when the disruption controller - // should consider a pod to be healthy. - // If no policy is specified, the legacy behavior will be used. It means - // only pods that are Running and Ready will be considered when the disruption - // controller computes "disruptionsAllowed", but all pods in the Running phase - // will be subject to the PDB on eviction. - // +optional - PodHealthyPolicy PodHealthyPolicy `json:"podHealthyPolicy,omitempty" protobuf:"bytes,4,opt,name=podHealthyPolicy"` + // UnhealthyPodEvictionPolicy defines the criteria for when unhealthy pods + // should be considered for eviction. Current implementation considers healthy pods, + // as pods that have status.conditions item with type="Ready",status="True". + // + // Valid policies are IfHealthyBudget and AlwaysAllow. + // If no policy is specified, the default behavior will be used, + // which corresponds to the IfHealthyBudget policy. + // + // Additional policies may be added in the future. + // Clients making eviction decisions should disallow eviction of unhealthy pods + // if they encounter an unrecognized policy in this field. + // + // This field is alpha-level. The eviction API uses this field when + // the feature gate PDBUnhealthyPodEvictionPolicy is enabled (disabled by default). + // +optional + UnhealthyPodEvictionPolicy *UnhealthyPodEvictionPolicyType `json:"unhealthyPodEvictionPolicy,omitempty" protobuf:"bytes,4,opt,name=unhealthyPodEvictionPolicy"` } -// PodHealthyPolicy defines the policy when a pod are considered healthy and therefore -// covered by a PodDisruptionBudget. -type PodHealthyPolicy string +// UnhealthyPodEvictionPolicyType defines the criteria for when unhealthy pods +// should be considered for eviction. +// +enum +type UnhealthyPodEvictionPolicyType string const ( - // PodReady policy means that only pods that are both Running and Ready - // will be considered healthy by the disruption controller. Any pods that - // are not Ready are considered to already be disrupted and therefore will - // not be counted when computing "disruptionsAllowed" and can be evicted - // regardless of whether the criteria in a PDB is met. - PodReady PodHealthyPolicy = "PodReady" - - // PodRunning policy means that pods that are in the Running phase - // is considered healthy by the disruption controller, regardless of - // whether they are Ready or not. Any pods that are in the Running - // phase will be counted when computing "disruptionsAllowed" and - // will be subject to the PDB for eviction. - PodRunning PodHealthyPolicy = "PodRunning" + // IfHealthyBudget policy means that running pods (status.phase="Running"), + // but not yet healthy can be evicted only if the guarded application is not + // disrupted (status.currentHealthy is at least equal to status.desiredHealthy). + // Healthy pods will be subject to the PDB for eviction. + IfHealthyBudget UnhealthyPodEvictionPolicyType = "IfHealthyBudget" + + // AlwaysAllow policy means that all running pods (status.phase="Running"), + // but not yet healthy are considered disrupted and can be evicted regardless + // of whether the criteria in a PDB is met. This means perspective running + // pods of a disrupted application might not get a chance to become healthy. + // Healthy pods will be subject to the PDB for eviction. + AlwaysAllow UnhealthyPodEvictionPolicyType = "AlwaysAllow" ) ``` -### Changes to the disruption controller - -The disruption controller will be updated to use `podHealthyPolicy` to -determine how it should compute the value of `disruptionsAllowed`. - ### Changes to the eviction API -The eviction API will be updated to use `podHealthyPolicy` to determine whether -a pod which is Running but not Ready can be evicted regardless of the value of +The eviction API will be updated to use `unhealthyPodEvictionPolicy` of a PDB to determine +whether a pod which is Running but not Ready can be evicted regardless of the value of `disruptionsAllowed`. This will only be a behavioral change when users have specified -a `podHealthyPolicy`, and will not require the actual API to change. +a `unhealthyPodEvictionPolicy`, and will not require the actual API to change. ### Test Plan @@ -207,7 +222,7 @@ Based on reviewers feedback describe what additional tests need to be added prio implementing this enhancement to ensure the enhancements have also solid foundations. --> -We assess that the disruption controller has adequate test coverage for places which might be impacted by +We assess that the eviction api has adequate test coverage for places which might be impacted by this enhancement. Thus, no additional tests prior implementing this enhancement are needed. @@ -222,8 +237,7 @@ together with explanation why this is acceptable. Unit tests covering: - The current behavior stays unchanged when the policy is not specified. - - Correct behavior for both policies in both the disruption controller and - the eviction API. + - Correct behavior for both policies in the eviction API. - Feature gate disablement. The core packages (with their unit test coverage) which are going to be modified during the implementation: -- `k8s.io/kubernetes/pkg/controller/disruption`: `5 October 2022` - `77.7%` -- `k8s.io/kubernetes/pkg/apis/policy/validation`: `5 October 2022` - `93%` +- `k8s.io/kubernetes/pkg/apis/policy/validation`: `5 October 2022` - `93%` - `k8s.io/kubernetes/pkg/apis/policy/v1`: `5 October 2022` - `60%` +- `k8s.io/kubernetes/pkg/registry/policy/poddisruptionbudget`: `8 November 2022` - `62.5%` +- `k8s.io/kubernetes/pkg/registry/core/pod/storage`: `8 November 2022` - `74.2%` + +Alpha implementation: +- `k8s.io/kubernetes/pkg/apis/policy/validation`: `7 December 2022` - `93.1%` +- `k8s.io/kubernetes/pkg/apis/policy/v1`: `7 December 2022` - `60%` +- `k8s.io/kubernetes/pkg/registry/policy/poddisruptionbudget`: `7 December 2022` - `75%` +- `k8s.io/kubernetes/pkg/registry/core/pod/storage`: `7 December 2022` - `78%` + ##### Integration tests Integration tests covering: - The current behavior stays unchanged when the policy is not specified. - - Correct behavior for both policies in both the disruption controller and - the eviction API. + - Correct behavior for both policies in the eviction API. - Feature gate disablement. @@ -289,11 +310,14 @@ We expect no non-infra related flakes in the last month as a GA graduation crite - Feature gate enabled by default. - Existing E2E and conformance tests passing. +- Consider whether we want to keep the `spec.unhealthyPodEvictionPolicy` field null by default when not specified + or default it to `IfHealthyBudget` value. Both of these should preserve original behavior and handle null + values. This should be tested and documented. #### GA - Every bug report is fixed. -- The disruption controller and the eviction API ignores the feature gate. +- The eviction API ignores the feature gate. #### Deprecation @@ -314,9 +338,8 @@ This feature doesn't depend on the version for nodes. ###### How can this feature be enabled / disabled in a live cluster? - [x] Feature gate (also fill in values in `kep.yaml`) - - Feature gate name: PodDisruptionBudgetPodHealthyPolicy + - Feature gate name: PDBUnhealthyPodEvictionPolicy - Components depending on the feature gate: - - kube-controller-manager - kube-apiserver - [ ] Other - Describe the mechanism: @@ -327,18 +350,16 @@ This feature doesn't depend on the version for nodes. ###### Does enabling the feature change any default behavior? -No, the behavior is only changed when users specify the `podHealthyPolicy` in +No, the behavior is only changed when users specify the `unhealthyPodEvictionPolicy` in the PodDisruptionBudget spec. ###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)? -Yes, in that case the disruption controller and the eviction API will just -use the default behavior. +Yes, in that case the eviction API will just use the default behavior. ###### What happens if we reenable the feature if it was previously rolled back? -The disruption controller and the eviction API will again start using -the `PodHealthyPolicy` if provided on a PDB. +The eviction API will again start using the `unhealthyPodEvictionPolicy` if provided on a PDB. ###### Are there any tests for feature enablement/disablement? @@ -353,21 +374,20 @@ on the PDB for behavior to be affected. ###### What specific metrics should inform a rollback? -Unexpected controller-manager crashes or significant changes in the latency or depth of the disruption controller -workqueue. +Unexpected failing eviction requests. ###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested? A manual test will be performed, as follows: -1. Create a cluster in 1.23. -2. Upgrade to 1.24. -3. Create Deployment A and PDB A targeting the pods of Deployment A using the `PodReady` PodHealthyPolicy. -4. Downgrade to 1.23. -5. Verify that the disruption controller and eviction continue to work without using the PodHealthyPolicy. +1. Create a cluster in 1.25. +2. Upgrade to 1.26. +3. Create Deployment A and PDB A targeting the pods of Deployment A using the `AlwaysAllow` UnhealthyPodEvictionPolicy. +4. Downgrade to 1.25. +5. Verify that the eviction continue to work without using the UnhealthyPodEvictionPolicy. 6. Create another StatefulSet B and PDB B targeting the pods of StatefulSet B. -7. Upgrade to 1.24. -8. Verify that eviction of pods for Deployment A uses the `PodReady` PodHealthyPolicy and eviction of pods for +7. Upgrade to 1.26. +8. Verify that eviction of pods for Deployment A uses the `UnhealthyPodEvictionPolicy` UnhealthyPodEvictionPolicy and eviction of pods for StatefulSet B uses the default behavior. ###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.? @@ -454,8 +474,7 @@ No ###### Will enabling / using this feature result in any new API calls? -No, both the disruption controller and the eviction API already fetch the -PDB from the API server. +No, the eviction API already fetch the PDB from the API server. ###### Will enabling / using this feature result in introducing new API types? @@ -491,7 +510,7 @@ details). For now, we leave it here. ###### How does this feature react if the API server and/or etcd is unavailable? -No change from the existing behavior of the disruption controller and the eviction API. +No change from the existing behavior of the eviction API. ###### What are other known failure modes? @@ -513,6 +532,8 @@ For each of them, fill in the following information by copying the below templat ## Implementation History - 2021-10-24: Proposed KEP for adding the new behavior in alpha status in 1.24. +- 2022-11-11: Initial alpha implementation merged into 1.26 +- 2022-12-07: KEP rewritten to match the implementation (PodHealthyPolicy was renamed to UnhealthyPodEvictionPolicy) ## Drawbacks @@ -526,3 +547,52 @@ Changing the default behavior was considered but rejected for two reasons: * There are two separate use-cases for this feature and changing the behavior to support only one of them would create problems for other users. +## Abandoned Alternative Implementation + +There is a noticeable difference to the original KEP as some behaviours were dropped. + +### Changes to the disruption controller + +We have removed changes to the disruption controller and computation of `disruptionsAllowed`. +We have kept the scope only to the eviction API. It is better to split these changes into separate features +in order to have simpler (less confusing), and more well-defined behavior for each feature. + +You can see possible followups for customizing the definition of healthiness in [Future Work](#future-work) + +### Changes to the definition of healthy in a PDB according to the policy used. + +We have decided that eviction policy should not change the meaning of a healthy pod as a single powerful field +could introduce more confusion into how it affects the status of PodDisruptionBudget and Eviction API. + +`PodRunning` policy was measuring running pods and changing the computation of `disruptionsAllowed`, +and it was removed from the original KEP. + +```golang +const ( + // PodRunning policy means that pods that are in the Running phase + // is considered healthy by the disruption controller, regardless of + // whether they are Ready or not. Any pods that are in the Running + // phase will be counted when computing "disruptionsAllowed" and + // will be subject to the PDB for eviction. + PodRunning PodHealthyPolicy = "PodRunning" +) +``` + +## Future Work + +The current implementation considers healthy pods, as pods that have `.status.conditions` item with `type="Ready"` and `status="True"`. +These pods are tracked via `.status.currentHealthy` field in the PDB status. + +This might not be enough for all use cases. For example the user might want to specifically handle pods that have their PVC +on a specific node's local storage. The pod should block the node from being drained and going down to prevent a possible data loss, +even in all situtations when the pod is not ready ([discussion](https://github.com/kubernetes/kubernetes/pull/105296#issuecomment-929503163)) + +To support this, a new custom mechanism for defining healthiness needs to be defined to optionally replace the default implementation. +This could be achieved with the help of user defined [Pod Readiness Gates](https://github.com/kubernetes/enhancements/blob/master/keps/sig-network/580-pod-readiness-gates/README.md), +by introducing a new field in a PodDisruptionBudget that could receive either a list of condition types or a logical expression +referencing these condition types to conclude whether the pod is healthy or not. +This field and other options should be explored in an additional KEP. + +The disruption controller would update the existing fields in a PodDisruptionBudget status based on the custom healthiness. +The eviction API would react to the existing fields in the same way as it does now, +and in a combination with here proposed `PDBUnhealthyPodEvictionPolicy` feature.