Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pods not syncing quickly: pod-syncer Error updating pod: context deadline exceeded #1765

Open
alextricity25 opened this issue May 10, 2024 · 7 comments
Labels

Comments

@alextricity25
Copy link

alextricity25 commented May 10, 2024

What happened?

I'm experiencing this rather strange bug where some pods are reporting to be stuck in the "Init: 0/1" status, but only when connected to the vcluster context. When connected to the host cluster context, the pod status reports correctly, however, the pod status in the vcluster occasionally is stuck in either "PendingCreate" or the "Init: 0/1" status, which causes downstream issues with my helm chart installation flow. The pod events while connected to the vcluster context report the following:

Warning  SyncError  13m   pod-syncer         Error updating pod: context deadline exceeded

The below image shows the bug in action with two panes. The left pane is K9s connected to the vcluster context, and the right connected to the host cluster context. As you can see, the pods in the host cluster are "Running", but the same pods in the vcluster context are suck in the "Init:0/1" status.
image

Looking at the pod events, I see the following:
image

The pod-syncer error only appears when connected to the vcluster.

The only "error" that I noticed in the vcluster syncer logs is:

filters/wrap.go:54	timeout or abort while handling: method=GET URI="/api/v1/namespaces/xrdm/pods/xxxx-portal-worker-66bc6969b-r4qqr/log?container=xxxx-portal-worker&follow=true&tailLines=100&timestamps=true" audit-ID="1d310b7a-7010-4c0e-a116-7b4127d94193"

What did you expect to happen?

I expect the pods status while connected to the vcluster context to reflect the correct status.

How can we reproduce it (as minimally and precisely as possible)?

  1. Install vcluster version v0.20.0-beta.5 using the helm chart.
  2. Connect to the vcluster and create a bunch of containers
  3. Observe that the pod-syncer is not working as intended

Anything else we need to know?

The pods while connected to the vcluster context will eventually report the correct status, but sometimes it takes 5 or 10 minutes before it does.

Host cluster Kubernetes version

$ kubectl version
Client Version: v1.29.1
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.29.0

Host cluster Kubernetes distribution

GKE-1.29.0

vlcuster version

v0.20.0-beta.5

Vcluster Kubernetes distribution(k3s(default)), k8s, k0s)

k8s

OS and Arch

OS:  GKE containerd image
Arch:
@heiko-braun
Copy link
Contributor

Let me summarise: the pods get scheduled on the host, but the vcluster api server doesn’t reflect the status correctly?

Do you observe significant load (i.e. request latency, total requests increased) on the host api server when this happens?

@everflux
Copy link

Looks similar to #1589 to me

@alextricity25
Copy link
Author

@heiko-braun

Let me summarise: the pods get scheduled on the host, but the vcluster api server doesn’t reflect the status correctly?

That's correct!

Do you observe significant load (i.e. request latency, total requests increased) on the host api server when this happens?

I do not. The vcluster pods on the host cluster are given a good amount of resources. 3 vCPUs, 4Gi of memory. Usually the node this vcluster pod is on is no where near these limits.

@everflux

Looks similar to #1589 to me

Yes, indeed! I suppose this issue can be marked as a duplicate. Thanks for catching that!

@FabianKramm
Copy link
Member

FabianKramm commented May 14, 2024

@alextricity25 would you mind trying virtual Kubernetes version v1.29.4 as there was a pretty significant bug in v1.29.0 that caused issues (kubernetes/kubernetes#123448), which could be the problem for this

@alextricity25
Copy link
Author

@alextricity25 would you mind trying virtual Kubernetes version v1.29.4 as there was a pretty significant bug in v1.29.0 that caused issues (kubernetes/kubernetes#123448), which could be the problem for this

@FabianKramm The default is v1.29.0. I'll override this to 1.29.4 to see if that does anything. It's difficult to iterate and test to see if this certain changes fixes this issue, because it doesn't happen all the time. I'll let you know!

@everflux
Copy link

@FabianKramm I tested 1.29.4 (Server Version: v1.29.4+k3s1, I think embedded db/sqlite) with helm chart v0.20.0-beta.5 and still observed the issue once the metrics server was present in the host cluster. (host is 1.18.2)

@alextricity25
Copy link
Author

@FabianKramm I also observed the issue again on 1.29.5. Screen shot below. The pods in the vCluster context eventually did report the correct status, but it took about 1-2 minutes before they did.
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants