Cluster Autoscaler not backing off exhausted node group #6240

elohmeier · 2023-11-01T05:25:45Z

Which component are you using?:

Cluster Autoscaler

What version of the component are you using?:

registry.k8s.io/autoscaling/cluster-autoscaler:v1.26.4

Component version:

What k8s version are you using (kubectl version)?:

kubectl version Output

$ kubectl version
Client Version: v1.26.6+k3s1
Kustomize Version: v4.5.7
Server Version: v1.26.6+k3s1

What environment is this in?:

Hetzner Cloud

What did you expect to happen?:

When the cluster autoscaler is configured with a priority expander, and multiple node groups of differing priorities are provided, the cluster autoscaler should back-off after some time if the cloud provider fails to provision nodes in the high priority node group due to resource unavailability and proceed to lower priority node groups.

What happened instead?:

The high priority node group (in the below log pool1) has no resources available currently to provision the requested nodes.
The cluster autoscaler is stuck in a loop trying to provision nodes in the high-prio group and not proceeding to pool2 (lower prio, resources available). I've also tried to set --max-node-group-backoff-duration=1m with no effect.

W1101 05:15:05.399825       1 hetzner_servers_cache.go:94] Fetching servers from Hetzner API
I1101 05:15:15.806488       1 hetzner_node_group.go:438] Set node group draining-node-pool size from 0 to 0, expected delta 0
I1101 05:15:15.806519       1 hetzner_node_group.go:438] Set node group pool1 size from 1 to 1, expected delta 0
I1101 05:15:15.806525       1 hetzner_node_group.go:438] Set node group pool2 size from 0 to 0, expected delta 0
I1101 05:15:15.808727       1 scale_up.go:608] Scale-up: setting group pool1 size to 4
E1101 05:15:16.068533       1 hetzner_node_group.go:117] failed to create error: could not create server type ccx43 in region fsn1: we are unable to provision servers for this location, try with a different location or try later (resource_unavailable)
E1101 05:15:16.079704       1 hetzner_node_group.go:117] failed to create error: could not create server type ccx43 in region fsn1: we are unable to provision servers for this location, try with a different location or try later (resource_unavailable)
E1101 05:15:16.126786       1 hetzner_node_group.go:117] failed to create error: could not create server type ccx43 in region fsn1: we are unable to provision servers for this location, try with a different location or try later (resource_unavailable)
W1101 05:15:16.126816       1 hetzner_servers_cache.go:94] Fetching servers from Hetzner API
I1101 05:15:26.655179       1 hetzner_node_group.go:438] Set node group pool1 size from 1 to 1, expected delta 0
I1101 05:15:26.655243       1 hetzner_node_group.go:438] Set node group pool2 size from 0 to 0, expected delta 0
I1101 05:15:26.655257       1 hetzner_node_group.go:438] Set node group draining-node-pool size from 0 to 0, expected delta 0
I1101 05:15:26.660093       1 scale_up.go:608] Scale-up: setting group pool1 size to 4
E1101 05:15:26.948368       1 hetzner_node_group.go:117] failed to create error: could not create server type ccx43 in region fsn1: we are unable to provision servers for this location, try with a different location or try later (resource_unavailable)
E1101 05:15:26.981452       1 hetzner_node_group.go:117] failed to create error: could not create server type ccx43 in region fsn1: we are unable to provision servers for this location, try with a different location or try later (resource_unavailable)
E1101 05:15:27.044150       1 hetzner_node_group.go:117] failed to create error: could not create server type ccx43 in region fsn1: we are unable to provision servers for this location, try with a different location or try later (resource_unavailable)

How to reproduce it (as minimally and precisely as possible):

apiVersion: v1
data:
  priorities: |
    10:
      - pool2
    20:
      - pool1
kind: ConfigMap
metadata:
  name: cluster-autoscaler-priority-expander
  namespace: kube-system
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: cluster-autoscaler
  namespace: kube-system
spec:
  replicas: 1
  selector:
    matchLabels:
      app: cluster-autoscaler
  template:
    spec:
      containers:
      - command:
        - ./cluster-autoscaler
        - --scale-down-unneeded-time=5m
        - --cloud-provider=hetzner
        - --stderrthreshold=info
        - --nodes=0:4:CCX43:FSN1:pool1
        - --nodes=0:4:CCX43:NBG1:pool2
        - --expander=priority
        env:
        - name: HCLOUD_IMAGE
          value: debian-11
        - name: HCLOUD_TOKEN
          valueFrom:
            secretKeyRef:
              key: token
              name: hcloud
        image: registry.k8s.io/autoscaling/cluster-autoscaler:v1.26.4
        name: cluster-autoscaler
      serviceAccountName: cluster-autoscaler

Anything else we need to know?:

The text was updated successfully, but these errors were encountered:

k8s-triage-robot · 2024-01-31T12:42:16Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

elohmeier · 2024-02-13T13:36:42Z

/remove-lifecycle stale

apricote · 2024-04-02T05:31:16Z

Is there something we need to do as the cloudprovider to make this possible? Otherwise this looks like an area/core-autoscaler issue.

tallaxes · 2024-04-24T02:43:46Z

@apricote Looking at the relevant provider code, it seems possible that on failure it logs a message, but neglects to return the error from IncreaseSize. This means the core autoscaler does not have any indication the scaleup has failed. (More generally, this could affect any provider that neglects reporting an error from IncreaseSize ...)

apricote · 2024-04-24T09:55:49Z

Thanks for the hint @tallaxes! I opened a PR to properly return encountered errors.

elohmeier added the kind/bug Categorizes issue or PR as related to a bug. label Nov 1, 2023

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 31, 2024

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 13, 2024

towca added area/cluster-autoscaler area/provider/hetzner Issues or PRs related to Hetzner provider labels Mar 21, 2024

apricote linked a pull request Apr 24, 2024 that will close this issue

fix(hetzner): missing error return in scale up/down #6750

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cluster Autoscaler not backing off exhausted node group #6240

Cluster Autoscaler not backing off exhausted node group #6240

elohmeier commented Nov 1, 2023 •

edited

k8s-triage-robot commented Jan 31, 2024

elohmeier commented Feb 13, 2024

apricote commented Apr 2, 2024

tallaxes commented Apr 24, 2024

apricote commented Apr 24, 2024

Cluster Autoscaler not backing off exhausted node group #6240

Cluster Autoscaler not backing off exhausted node group #6240

Comments

elohmeier commented Nov 1, 2023 • edited

k8s-triage-robot commented Jan 31, 2024

elohmeier commented Feb 13, 2024

apricote commented Apr 2, 2024

tallaxes commented Apr 24, 2024

apricote commented Apr 24, 2024

elohmeier commented Nov 1, 2023 •

edited