-
Notifications
You must be signed in to change notification settings - Fork 355
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kubernetes API access issues from syncer container on vcluster v0.19.0 #1589
Comments
@hinewin Thanks for raising the issue! Regarding the 403 on certain resources in named groups (e.g the Regarding the timeouts. Would you mind providing us with more information regarding the utilization of the Kubernetes API server/host cluster. Could it be that it is/was under heavy load? |
Hello @johannesfrey, thanks for the quick reply.
Some of these errors from the log that I provided include endpoints that are already enabled by default. Such as I am not exactly getting a timeout error on the |
Sure 🙂 . I just saw that the host cluster's Kubernetes version is v1.25, which is not supported anymore with vcluster v0.19.X. I don't know if it's feasible for you to either upgrade the host cluster or to use vcluster v0.18.X (where v1.25 is supported). Just to rule out any incompatibility issues beforehand. |
Hi @johannesfrey , I've updated the host cluster to the supported version (v1.26.14) but unfortunately still facing the same issue. |
Same issue here:
with k8s 1.28.4 and vcluster 0.19.5 |
@hinewin @SCLogo would you mind providing the config values you used to create the vclusters, so I could try to reproduce as good as possible? Just to see if there are similarities that might help to track down the reason. @SCLogo is your host k8s cluster also rancher based? |
Attached is a file that contains all of the values I use for the vCluster helm chart. Regarding
Sorry I'm not quite sure what exactly you mean by this, but none of the control planes restarted on its own. However, I've recently done a rolling restart on each control plane to upgrade its resources as part of troubleshooting this vCluster delay issue. Other than that, the nodes have always been stable. |
Thx for the info. Yeah, just trying to connect the dots and rule out if there have been any overload (you ruled this out already) or connection issues to the underlying API server of the host. Because the logs look like the watches that the syncer opens to this API server time out or are being aborted. Also, in your initial post you mentioned the vCluster distro to be k8s but in the values you attached you set the |
@johannesfrey my mistake, the Vcluster Kubernetes distribution is K3s. |
sorry for late response: api:
image: registry.k8s.io/kube-apiserver:v1.28.2
controller:
image: registry.k8s.io/kube-controller-manager:v1.28.2
coredns:
image: coredns/coredns:1.11.1
etcd:
storage:
className: default
ingress:
annotations:
external-dns.alpha.kubernetes.io/hostname: example.site.host.com
enabled: true
host: example.site.host.com
ingressClassName: internal-ingress
mapServices:
fromHost:
- from: default/app
to: default/app
proxy:
metricsServer:
nodes:
enabled: true
pods:
enabled: true
sync:
ingresses:
enabled: true
networkpolicies:
enabled: true
nodes:
enabled: true
persistentvolumeclaims:
enabled: true
persistentvolumes:
enabled: true
poddisruptionbudgets:
enabled: true
storageclasses:
enabled: true
syncer:
extraArgs:
- --tls-san=example.site.host.com
- --sync-labels="app.kubernetes.io/instance" apiserver on host restarted 2 times since cluster is running. no rancher based host. we create hostclusters with |
@hinewin @SCLogo Thanks for the info and sorry for the late response. I assume that the issue still occurs? If so, could you test disabling the Metrics Server Proxy and see if this changes the situation. So essentially removing this part from your values: proxy:
metricsServer:
nodes:
enabled: true
pods:
enabled: true |
I have a similar issue, did not yet try to disable the metrics server
Syncer image: ghcr.io/loft-sh/vcluster:0.19.5 |
It looks like that one can reproducibly enforce this error message ( E.g.:
Will check if this is kind of "expected" (which shouldn't log an ERROR then) or not. Apart from that, is your virtual cluster running fine or do you perceive other slowness/delays? (@everflux @SCLogo) |
I experience multi-10-seconds latency when interacting with the api server in the guest cluster. Another curiosity that might be related: I installed metrics server in the guest. |
Could you try disabling the proxy metrics server. Also, would you mind sharing your values? |
I will give it a try once the clusters are no longer in use.
|
Unfortunately even after disabling the Metrics Server Proxy, the issue persists. This deployment took 2 minutes to go through and I received this error
|
Are those the logs from the syncer? This looks different to the initial error log, which was |
I re-tested leaving out the metrics proxy and I could not reproduce the problem with that setup, even quickly spawning 100 pods/deleting them did not show any errors in the syncer log. |
My apologies, the log message I sent above was from the CLI. Here is the syncer log as I am still experiencing the same issue with metrics proxy being disabled. The issue is quite erratic, I am using a script to create & delete resources and still facing the same delay issue. |
Unfortunately leaving out metrics does not seem to solve the problem completely.
The following error could be observed, but the (about) 30 seconds complete hang of the api server not responding did not happen.
This is my yaml for that case
If I disable the fakeKubeletIPs I could observe the timeouts occuring more often, but as a subjective measurement. |
Thanks for testing this out @everflux and for providing additional logs @hinewin. So I guess this problem is two fold:
In your current setup you don't experience any delays right now @everflux? |
I am not sure what triggers the unresponsive api server so I have no clue how to force or reproduce the issue. Any ideas to try are appreciated. |
I re-tested with the 0.20.0-beta1 and could observe the hang of about 20 seconds again.
|
I've noticed a similar limitation (vcluster-k8s 0.19.3). Reverting to old distro (0.15.3) |
I observe the same issue with 0.20.0-beta2 |
@johannesfrey Is there any specific input you are waiting for as this issue is marked as waiting for details? |
I noticed this
which lead me to this issue kubernetes/kubernetes#56430 |
@everflux Sorry for the delay. I'm currently occupied with other tasks, but I'll take a look as soon as possible. |
I'm also seeing the behavior described in this issue. I made an issue before realizing that this issue existed. I'll leave it open for now as it has some details relevant to my configuration. I am running vcluster version
@Interaze Did that fix the issue? |
@alextricity25 Would it be possible for you to also share your config values for your virtual cluster. |
This issue could be related to kubernetes/kubernetes#123448, which was fixed with Kubernetes versions v1.27.13 and v1.28.9 |
@johannesfrey Sure thing! Here is my controlPlane:
distro:
k8s:
enabled: true
resources:
limits:
cpu: 1
memory: 1Gi
requests:
cpu: 1
memory: 1Gi
proxy:
extraSANs:
- example.dev
statefulSet:
resources:
limits:
cpu: "3"
ephemeral-storage: 8Gi
memory: 4Gi
requests:
cpu: "3"
ephemeral-storage: 8Gi
memory: 4Gi
exportKubeConfig:
server: https://example.dev
sync:
toHost:
serviceAccounts:
enabled: true |
I'm running version v1.29.1-gke.1589018 on the host cluster, and my virtual cluster version is v1.29.0 |
What happened?
We are currently running on vcluster version v0.19.0, and have encountered some issues with accessing the Kubernetes API from the syncer container.
We have been observing a series of timeout errors while making GET requests to various API endpoints from within the syncer container. The logs include messages such as:
2024-03-08 19:50:10 ERROR filters/wrap.go:54 timeout or abort while handling: method=GET URI="/api/v1/endpoints?allowWatchBookmarks=true&resourceVersion=17782778&watch=true" audit-ID="aff4ede2-a83c-4077-8a97-1bb29628aa2a" {"component": "vcluster"}
The majority of the errors include timeouts or aborts while making requests to watch Kubernetes resources. Additionally, we are also observing a delay while trying to deploy any Kubernetes resources, the
kubectl apply -f
command stalls for a long period before the operation actually gets executed. We suspect that these delays and API access issues might be related.As per our investigation, we tried to exec into the syncer container and directly access the mentioned endpoints. The endpoints with "API" returned a 200 status without any issues while "APIS" endpoints returned 403.
"API" endpoint tested:
${APISERVER}/api/v1/persistentvolumes?allowWatchBookmarks=true&resourceVersion=17782782&watch=true
"APIS" Endpoint tested:
apis/flowcontrol.apiserver.k8s.io/v1/flowschemas?allowWatchBookmarks=true&resourceVersion=17782783& watch=true
syncer.log
What did you expect to happen?
We would expect the API requests from within the syncer container to be handled without any issues or timeouts. The various Kubernetes endpoints should be accessible and respond in a timely manner, with the correct HTTP response codes.
On deploying YAML files,
kubectl apply -f
should execute seamlessly and promptly, without any noticeable delays.How can we reproduce it (as minimally and precisely as possible)?
Set up vcluster to v0.19.0 and attempt to access Kubernetes API endpoints manually
Anything else we need to know?
No response
Host cluster Kubernetes version
Host cluster Kubernetes distribution
vlcuster version
Vcluster Kubernetes distribution(k3s(default)), k8s, k0s)
OS and Arch
The text was updated successfully, but these errors were encountered: