Cluster validation cannot complete if metrics-server addon is enabled and there are less than 2 non-master nodes #16585

shapirus · 2024-05-21T15:39:25Z

/kind bug

1. What kops version are you running? The command kops version, will display
this information.
1.28.4

2. What Kubernetes version are you running? kubectl version will print the
version if a cluster is running or provide the Kubernetes version specified as
a kops flag.
1.28.10

3. Relevant cluster manifest portion:

  metricsServer:
    enabled: true
    insecure: false

9. Anything else do we need to know?
If the metrics server addon is enabled and there are less than two non-master nodes, then cluster validation will be failing indefinitely:

I0521 18:26:05.985979   76254 instancegroups.go:563] Cluster did not pass validation, will retry in "30s": system-cluster-critical pod "metrics-server-5c45c474f5-t8ppf" is pending.

The reason for this is that the deployment manifest specifies that there must be two replicas and topology spread constraints are defined in such a way that these two replicas must run on different nodes:

spec:
  ...
  replicas: 2
  ...
  template:
  ...
    spec:
      ...
      topologySpreadConstraints:
      - labelSelector:
          matchLabels:
            k8s-app: metrics-server
        maxSkew: 1
        topologyKey: topology.kubernetes.io/zone
        whenUnsatisfiable: ScheduleAnyway
      - labelSelector:
          matchLabels:
            k8s-app: metrics-server
        maxSkew: 1
        topologyKey: kubernetes.io/hostname
        whenUnsatisfiable: DoNotSchedule

If we have only one non-master node, this results in one of the pods staying forever in the Pending state (because the only other node is the master, which is tainted and metrics-server doesn't have a respective toleration):

metrics-server-5c45c474f5-cb89k               1/1     Running   0             107m
metrics-server-5c45c474f5-t8ppf               0/1     Pending   0             107m

Since having a cluster with just one worker node and the metrics server addon enabled at the same time is a valid use case, such manifests that prevent the cluster from validating successfully (e.g., on kops rolling-update cluster) in this scenario, this should be considered a bug.

An ideal solution would be to make both the number of replicas and the maxSkew parameters configurable in the cluster spec. Less than ideal would allow to configure only one of them or hardcode relaxed topology spread constraints permanently and call it a day.

Another approach is to stop treating metrics-server pods as system-cluster-critical, because they aren't all that critical really.

The text was updated successfully, but these errors were encountered:

k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label May 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cluster validation cannot complete if metrics-server addon is enabled and there are less than 2 non-master nodes #16585

Cluster validation cannot complete if metrics-server addon is enabled and there are less than 2 non-master nodes #16585

shapirus commented May 21, 2024 •

edited

Cluster validation cannot complete if metrics-server addon is enabled and there are less than 2 non-master nodes #16585

Cluster validation cannot complete if metrics-server addon is enabled and there are less than 2 non-master nodes #16585

Comments

shapirus commented May 21, 2024 • edited

shapirus commented May 21, 2024 •

edited