-
Notifications
You must be signed in to change notification settings - Fork 308
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] When node is lost, it's pods can't recover #5830
Comments
Theoretically, draining the node should help.
It's interesting, that gitlab is not in the list of "problematic" pods - it doesn't have local storage. But it still not migrated / not restarted |
Hi @SlavikCA, Did the gitlab pod run on the Harvester cluster or the downstream cluster? Or you can generate a support bundle for investigation? |
Everything is running on Harvester cluster. I don't have any downstream cluster. Every PVC is ReadWriteOnce, defined similarly to this: apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: gitlab-data-pvc
namespace: gitlab
spec:
accessModes:
- ReadWriteOnce
storageClassName: ssd-2r
resources:
requests:
storage: 50Gi ssd-2r defined as 2 replicas on SSD disks You're correct, that the issue is that |
Hi @SlavikCA, |
@Vicente-Cheng Can you please clarify: which specific feature is experimental? |
To Reproduce
Steps to reproduce the behavior:
shutdown now
in CLI)Expected behavior
I expect, that pods which were running on T7820i now will run on another node: T7920
Actually what happens
Pod are not running on another node.
For example, I see that I had Gitlab deployment running on T7820i.
The pod uses storage, which is few PVC on Longhorn. Every PV has 2 replicas:
And when the node was shutdown here is what I see:
It stuck in the
Terminating
state.At the same time, on another node I see this:
So, I have 2 nodes, replicated storage, but when node is lost (shutdown) - the app is down. What can I do make the app (Gitlab) resilient in case on node failure?
In the steps above, I cordoned one node. But the promise of Kubernetes is that even in case on unexpected node failure I still would have the app continue to work. Why that's not the case? Is it the problem in Storage layer (Longhorn)?
The text was updated successfully, but these errors were encountered: