Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update trouble shooting to include the issue of etcd upgrade #43902

Closed
wants to merge 1 commit into from
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -431,3 +431,102 @@ See [Enabling signed kubelet serving certificates](/docs/tasks/administer-cluste
to understand how to configure the kubelets in a kubeadm cluster to have properly signed serving certificates.

Also see [How to run the metrics-server securely](https://github.com/kubernetes-sigs/metrics-server/blob/master/FAQ.md#how-to-run-metrics-server-securely).

## Upgrade failed due to etcd hash is not changed

Only applicable for upgrade from v1.28.0 (v1.28.1, v1.28.2) to v1.28.3 or after.

Here is the snippet of error message which may look like,
```
[upgrade/etcd] Failed to upgrade etcd: couldn't upgrade control plane. kubeadm has tried to recover everything into the earlier state. Errors faced: static Pod hash for component etcd on Node kinder-upgrade-control-plane-1 did not change after 5m0s: timed out waiting for the condition
[upgrade/etcd] Waiting for previous etcd to become available
I0907 10:10:09.109104 3704 etcd.go:588] [etcd] attempting to see if all cluster endpoints ([https://172.17.0.6:2379/ https://172.17.0.4:2379/ https://172.17.0.3:2379/]) are available 1/10
[upgrade/etcd] Etcd was rolled back and is now available
static Pod hash for component etcd on Node kinder-upgrade-control-plane-1 did not change after 5m0s: timed out waiting for the condition
couldn't upgrade control plane. kubeadm has tried to recover everything into the earlier state. Errors faced
k8s.io/kubernetes/cmd/kubeadm/app/phases/upgrade.rollbackOldManifests
cmd/kubeadm/app/phases/upgrade/staticpods.go:525
k8s.io/kubernetes/cmd/kubeadm/app/phases/upgrade.upgradeComponent
cmd/kubeadm/app/phases/upgrade/staticpods.go:254
k8s.io/kubernetes/cmd/kubeadm/app/phases/upgrade.performEtcdStaticPodUpgrade
cmd/kubeadm/app/phases/upgrade/staticpods.go:338
k8s.io/kubernetes/cmd/kubeadm/app/phases/upgrade.StaticPodControlPlane
cmd/kubeadm/app/phases/upgrade/staticpods.go:465
k8s.io/kubernetes/cmd/kubeadm/app/phases/upgrade.PerformStaticPodUpgrade
cmd/kubeadm/app/phases/upgrade/staticpods.go:617
k8s.io/kubernetes/cmd/kubeadm/app/cmd/upgrade.PerformControlPlaneUpgrade
cmd/kubeadm/app/cmd/upgrade/apply.go:216
k8s.io/kubernetes/cmd/kubeadm/app/cmd/upgrade.runApply
cmd/kubeadm/app/cmd/upgrade/apply.go:156
k8s.io/kubernetes/cmd/kubeadm/app/cmd/upgrade.newCmdApply.func1
cmd/kubeadm/app/cmd/upgrade/apply.go:74
github.com/spf13/cobra.(*Command).execute
vendor/github.com/spf13/cobra/command.go:940
github.com/spf13/cobra.(*Command).ExecuteC
vendor/github.com/spf13/cobra/command.go:1068
github.com/spf13/cobra.(*Command).Execute
vendor/github.com/spf13/cobra/command.go:992
k8s.io/kubernetes/cmd/kubeadm/app.Run
cmd/kubeadm/app/kubeadm.go:50
main.main
cmd/kubeadm/kubeadm.go:25
...
````

The reason for this is that the [patch 118867](https://github.com/kubernetes/kubernetes/pull/118867) accidently imported the package of `k8s.io/kubernetes/pkg/apis/core/v1`, and leads to the generation of default attributes of etcd pod for the kubeadm patch version where the change exists,
this will result to a diff for the manifest comparison, and kubeadm will expect a change of pod hash while kubelet will never update the hash.

The effected kubeadm version is v1.28.0, v1.28.1 and v1.28.2. Kubeadm has patched the v1.28.3, see [patch 120605](https://github.com/kubernetes/kubernetes/pull/120605), so the same issue shouldn't be seen for the upgrade between the version that is after v1.28.3.

There are two way to workaround this issue if you see the same problem in your cluster.
- skip the etcd upgrade
If you hit the same issue, etcd is not expected to be upgraded as there is no change except the default attribute which should be ignored instead, so you can skip it as this,
```bash
kubeadm upgrade apply [version] --etcd-upgrade=false
```

- patch the problematic etcd manifest to remove the default attributes

```patch
diff --git a/etc/kubernetes/manifests/etcd_defaults.yaml b/etc/kubernetes/manifests/etcd_origin.yaml
index d807ccbe0aa..46b35f00e15 100644
--- a/etc/kubernetes/manifests/etcd_defaults.yaml
+++ b/etc/kubernetes/manifests/etcd_origin.yaml
@@ -43,7 +43,6 @@ spec:
scheme: HTTP
initialDelaySeconds: 10
periodSeconds: 10
- successThreshold: 1
timeoutSeconds: 15
name: etcd
resources:
@@ -59,26 +58,18 @@ spec:
scheme: HTTP
initialDelaySeconds: 10
periodSeconds: 10
- successThreshold: 1
timeoutSeconds: 15
- terminationMessagePath: /dev/termination-log
- terminationMessagePolicy: File
volumeMounts:
- mountPath: /var/lib/etcd
name: etcd-data
- mountPath: /etc/kubernetes/pki/etcd
name: etcd-certs
- dnsPolicy: ClusterFirst
- enableServiceLinks: true
hostNetwork: true
priority: 2000001000
priorityClassName: system-node-critical
- restartPolicy: Always
- schedulerName: default-scheduler
securityContext:
seccompProfile:
type: RuntimeDefault
- terminationGracePeriodSeconds: 30
volumes:
- hostPath:
path: /etc/kubernetes/pki/etcd
```

Relevant issue can be found from here [issue 2927](https://github.com/kubernetes/kubeadm/issues/2927), you can get more details from there.