You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
What you would like to accomplish: Increasing the default amount of the GCSFuse sidecar's memory from 100Mi to 200Mi instead of using a mutation webhook.
How this should work:
Implemented either automatically and/or will have a native option to increase the memory of the sidecar container.*
Explanation of the problem:
The gcs-fuse-csi-driver sidecar container seems to be repeatedly restarting and shows as OOMKilled all the while the node doesn't seem to run out of resources. or doesn't have any memory pressure issues. While the restart of this container eventually happened, it took 3 hours in this case (instead of a few minutes), which made the workload unresponsive, it became stuck until the eventual eviction of the pod
Several error messages after a deep investigation into it:
MountVolume.SetUp failed for volume "<VOLUME_NAME>" : kubernetes.io/csi: mounter.SetUpAt failed to determine if the node service has VOLUME_MOUNT_GROUP capability: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial unix /var/lib/kubelet/plugins/gcsfuse.csi.storage.gke.io/csi.sock: connect: connection refused"
(combined from similar events): Memory cgroup out of memory: Killed process 2960932 (gcs-fuse-csi-dr) total-vm:1340248kB, anon-rss:100796kB, file-rss:26844kB, shmem-rss:0kB, UID:0 pgtables:372kB oom_score_adj:-997
Since this sidecar is created automatically and is managed by default, problems like these can cause serious downtime for the workload.
The text was updated successfully, but these errors were encountered:
The connection refused error tends to be related to unavailability from the gcs-fuse-csi-driver, which makes me suspect there's something deeper going on. Could you provide the following?
Are you using GKE Autopilot?
Are you using managed driver on GKE?
Could you verify the gcsfuse-node-* pod is healthy on your nodes?
Could you share how many gcsfuse-backed pods you are running per VM/Node?
Could you share the Cluster ID with me? You can get the id by running gcloud container clusters describe <cluster-name> --location <cluster-location> | grep id:
What you would like to accomplish:
Increasing the default amount of the GCSFuse sidecar's memory from 100Mi to 200Mi instead of using a mutation webhook.
How this should work:
Implemented either automatically and/or will have a native option to increase the memory of the sidecar container.*
Explanation of the problem:
The gcs-fuse-csi-driver sidecar container seems to be repeatedly restarting and shows as OOMKilled all the while the node doesn't seem to run out of resources. or doesn't have any memory pressure issues. While the restart of this container eventually happened, it took 3 hours in this case (instead of a few minutes), which made the workload unresponsive, it became stuck until the eventual eviction of the pod
Several error messages after a deep investigation into it:
MountVolume.SetUp failed for volume "<VOLUME_NAME>" : kubernetes.io/csi: mounter.SetUpAt failed to determine if the node service has VOLUME_MOUNT_GROUP capability: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial unix /var/lib/kubelet/plugins/gcsfuse.csi.storage.gke.io/csi.sock: connect: connection refused"
(combined from similar events): Memory cgroup out of memory: Killed process 2960932 (gcs-fuse-csi-dr) total-vm:1340248kB, anon-rss:100796kB, file-rss:26844kB, shmem-rss:0kB, UID:0 pgtables:372kB oom_score_adj:-997
Since this sidecar is created automatically and is managed by default, problems like these can cause serious downtime for the workload.
The text was updated successfully, but these errors were encountered: