Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster validation cannot complete if metrics-server addon is enabled and there are less than 2 non-master nodes #16585

Open
shapirus opened this issue May 21, 2024 · 0 comments
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@shapirus
Copy link
Contributor

shapirus commented May 21, 2024

/kind bug

1. What kops version are you running? The command kops version, will display
this information.

1.28.4

2. What Kubernetes version are you running? kubectl version will print the
version if a cluster is running or provide the Kubernetes version specified as
a kops flag.

1.28.10

3. Relevant cluster manifest portion:

  metricsServer:
    enabled: true
    insecure: false

9. Anything else do we need to know?
If the metrics server addon is enabled and there are less than two non-master nodes, then cluster validation will be failing indefinitely:

I0521 18:26:05.985979   76254 instancegroups.go:563] Cluster did not pass validation, will retry in "30s": system-cluster-critical pod "metrics-server-5c45c474f5-t8ppf" is pending.

The reason for this is that the deployment manifest specifies that there must be two replicas and topology spread constraints are defined in such a way that these two replicas must run on different nodes:

spec:
  ...
  replicas: 2
  ...
  template:
  ...
    spec:
      ...
      topologySpreadConstraints:
      - labelSelector:
          matchLabels:
            k8s-app: metrics-server
        maxSkew: 1
        topologyKey: topology.kubernetes.io/zone
        whenUnsatisfiable: ScheduleAnyway
      - labelSelector:
          matchLabels:
            k8s-app: metrics-server
        maxSkew: 1
        topologyKey: kubernetes.io/hostname
        whenUnsatisfiable: DoNotSchedule

If we have only one non-master node, this results in one of the pods staying forever in the Pending state (because the only other node is the master, which is tainted and metrics-server doesn't have a respective toleration):

metrics-server-5c45c474f5-cb89k               1/1     Running   0             107m
metrics-server-5c45c474f5-t8ppf               0/1     Pending   0             107m

Since having a cluster with just one worker node and the metrics server addon enabled at the same time is a valid use case, such manifests that prevent the cluster from validating successfully (e.g., on kops rolling-update cluster) in this scenario, this should be considered a bug.

An ideal solution would be to make both the number of replicas and the maxSkew parameters configurable in the cluster spec. Less than ideal would allow to configure only one of them or hardcode relaxed topology spread constraints permanently and call it a day.

Another approach is to stop treating metrics-server pods as system-cluster-critical, because they aren't all that critical really.

@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label May 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

2 participants