Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bug]: k8s.KubernetesException: too old resource version: xxxx (xxxx) on re-connect #793

Open
rpearsondev opened this issue Jul 17, 2024 · 6 comments
Labels
bug Something isn't working

Comments

@rpearsondev
Copy link
Contributor

rpearsondev commented Jul 17, 2024

Describe the bug

When the operator has been running for a while and the watch of a resource has failed it correctly tries to re-connect. However when it tries to re-start the watch of the entity, it can return the error k8s.KubernetesException: too old resource version: xxxx (xxxx). This is because the operator has been in a re-connecting state long enought that the currently stored resource version (in memory int the opertor) is too far behind the resource version of the resource.

To reproduce

It is difficult to reproduce because it relies on a transient error to initiate it, however here is the profile of the issue.

There has to be an initial failure in the watch of a resource. In my case it was "The response ended prematurely while waiting for the next frame from the server. (ResponseEnded)".

Then on each successive re-connection attempt it fails with "too old resource version: 68358497 (68368462)"

here is the logs filtered to only what is relevant:

timestamp [UTC] message error
7/16/2024, 1:05:49.052 PM Received watch event "Bookmark" for "Hidden/(null)", last observed resource version: 68358100.  
7/16/2024, 1:06:02.432 PM Received watch event "Bookmark" for "Hidden/(null)", last observed resource version: 68358497.  
7/16/2024, 1:06:44.608 PM Received watch event "Bookmark" for "Hidden/(null)", last observed resource version: 68357572.  
7/16/2024, 1:06:49.407 PM Received watch event "Bookmark" for "Hidden/(null)", last observed resource version: 68358100.  
7/16/2024, 1:07:02.387 PM Received watch event "Bookmark" for "Hidden/(null)", last observed resource version: 68358497.  
7/16/2024, 1:07:43.839 PM Received watch event "Bookmark" for "Hidden/(null)", last observed resource version: 68368670.  
7/16/2024, 1:07:47.938 PM Watcher for Hidden was terminated and is reconnecting.  
7/16/2024, 1:07:47.943 PM Watcher for Hidden was terminated and is reconnecting.  
7/16/2024, 1:07:47.943 PM Watcher for Hidden was terminated and is reconnecting.  
7/16/2024, 1:07:47.949 PM There was an error while watching the resource "ObscureNamespace.V1EntityR". The response ended prematurely while waiting for the next frame from the server. (ResponseEnded)
7/16/2024, 1:07:47.949 PM There were 1 errors / retries in the watcher. Wait 2.456s before next attempt to connect.  
7/16/2024, 1:07:48.030 PM There was an error while watching the resource "ObscureNamespace.V1EntityPM". too old resource version: 68358497 (68368462)
7/16/2024, 1:07:48.030 PM There were 1 errors / retries in the watcher. Wait 2.197s before next attempt to connect.  
7/16/2024, 1:07:48.097 PM There was an error while watching the resource "ObscureNamespace.V1EntitySRB". too old resource version: 68358844 (68369473)
7/16/2024, 1:07:48.097 PM There were 1 errors / retries in the watcher. Wait 2.823s before next attempt to connect.  
7/16/2024, 1:07:50.230 PM Watcher for V1EntityPM was terminated and is reconnecting.  
7/16/2024, 1:07:50.232 PM There was an error while watching the resource "ObscureNamespace.V1EntityPM". too old resource version: 68358497 (68368462)
7/16/2024, 1:07:50.232 PM There were 2 errors / retries in the watcher. Wait 4.481s before next attempt to connect.  
7/16/2024, 1:07:50.403 PM Watcher for V1EntityR was terminated and is reconnecting.  
7/16/2024, 1:07:50.405 PM There was an error while watching the resource "ObscureNamespace.V1EntityR". too old resource version: 68358100 (68368631)
7/16/2024, 1:07:50.405 PM There were 2 errors / retries in the watcher. Wait 4.085s before next attempt to connect.  
7/16/2024, 1:07:50.918 PM Watcher for V1EntitySRB was terminated and is reconnecting.  
7/16/2024, 1:07:50.920 PM There was an error while watching the resource "ObscureNamespace.V1EntitySRB". too old resource version: 68358844 (68369473)
7/16/2024, 1:07:50.920 PM There were 2 errors / retries in the watcher. Wait 4.945s before next attempt to connect.  
7/16/2024, 1:07:54.490 PM Watcher for V1EntityR was terminated and is reconnecting.  
7/16/2024, 1:07:54.493 PM There was an error while watching the resource "ObscureNamespace.V1EntityR". too old resource version: 68358100 (68368631)
7/16/2024, 1:07:54.493 PM There were 3 errors / retries in the watcher. Wait 8.04s before next attempt to connect.  
7/16/2024, 1:07:54.714 PM Watcher for V1EntityPM was terminated and is reconnecting.  
7/16/2024, 1:07:54.716 PM There was an error while watching the resource "ObscureNamespace.V1EntityPM". too old resource version: 68358497 (68368462)
7/16/2024, 1:07:54.716 PM There were 3 errors / retries in the watcher. Wait 8.351s before next attempt to connect.  
7/16/2024, 1:07:55.866 PM Watcher for V1EntitySRB was terminated and is reconnecting.  
7/16/2024, 1:07:55.869 PM There was an error while watching the resource "ObscureNamespace.V1EntitySRB". too old resource version: 68358844 (68369473)
7/16/2024, 1:07:55.869 PM There were 3 errors / retries in the watcher. Wait 8.888s before next attempt to connect.  
7/16/2024, 1:08:02.532 PM Watcher for V1EntityR was terminated and is reconnecting.  
7/16/2024, 1:08:02.534 PM There was an error while watching the resource "ObscureNamespace.V1EntityR". too old resource version: 68358100 (68368631)
7/16/2024, 1:08:02.534 PM There were 4 errors / retries in the watcher. Wait 16.606s before next attempt to connect.  
7/16/2024, 1:08:03.066 PM Watcher for V1EntityPM was terminated and is reconnecting.  
7/16/2024, 1:08:03.068 PM There was an error while watching the resource "ObscureNamespace.V1EntityPM". too old resource version: 68358497 (68368462)
7/16/2024, 1:08:03.068 PM There were 4 errors / retries in the watcher. Wait 16.1s before next attempt to connect.  
7/16/2024, 1:08:04.756 PM Watcher for V1EntitySRB was terminated and is reconnecting.  
7/16/2024, 1:08:04.758 PM There was an error while watching the resource "ObscureNamespace.V1EntitySRB". too old resource version: 68358844 (68369473)
7/16/2024, 1:08:04.758 PM There were 4 errors / retries in the watcher. Wait 16.796s before next attempt to connect.  
7/16/2024, 1:08:19.143 PM Watcher for V1EntityR was terminated and is reconnecting.  
7/16/2024, 1:08:19.148 PM There was an error while watching the resource "ObscureNamespace.V1EntityR". too old resource version: 68358100 (68368631)
7/16/2024, 1:08:19.148 PM There were 5 errors / retries in the watcher. Wait 32.544s before next attempt to connect.  
7/16/2024, 1:08:19.168 PM Watcher for V1EntityPM was terminated and is reconnecting.  
7/16/2024, 1:08:19.169 PM There was an error while watching the resource "ObscureNamespace.V1EntityPM". too old resource version: 68358497 (68368462)
7/16/2024, 1:08:19.169 PM There were 5 errors / retries in the watcher. Wait 32.427s before next attempt to connect.  
7/16/2024, 1:08:21.555 PM Watcher for V1EntitySRB was terminated and is reconnecting.  
7/16/2024, 1:08:21.557 PM There was an error while watching the resource "ObscureNamespace.V1EntitySRB". too old resource version: 68358844 (68369473)
7/16/2024, 1:08:21.557 PM There were 5 errors / retries in the watcher. Wait 32.521s before next attempt to connect.  
7/16/2024, 1:08:51.596 PM Watcher for V1EntityPM was terminated and is reconnecting.  

It continues in this way forever. Untill the pod is restarted.

Expected behavior

On error an error the watcher should start again (regardless of if the currentVersion is too old) so that the operator doesnt stop reconciling CR's

Screenshots

No response

Additional Context

We are using the KubeOps.Operator 9.1.2 nuget package.

I have created a PR with a suggested fix: #792

@rpearsondev rpearsondev added the bug Something isn't working label Jul 17, 2024
@stevefan1999-personal
Copy link
Contributor

Yes, although it did retry, but the version is too far off, plus you won't catch up with missing event:

k8s.KubernetesException: too old resource version: 8271646 (8271663)
   at KubeOps.KubernetesClient.KubernetesClient.<>c__26`1.<WatchAsync>b__26_0(Exception ex)
   at k8s.Watcher`1.CreateWatchEventEnumerator(Func`1 streamReaderCreator, Action`1 onError, CancellationToken cancellationToken)+MoveNext()
   at k8s.Watcher`1.CreateWatchEventEnumerator(Func`1 streamReaderCreator, Action`1 onError, CancellationToken cancellationToken)+System.Threading.Tasks.Sources.IValueTaskSource<System.Boolean>.GetResult()
   at KubeOps.KubernetesClient.KubernetesClient.WatchAsync[TEntity](String namespace, String resourceVersion, String labelSelector, Nullable`1 allowWatchBookmarks, CancellationToken cancellationToken)+MoveNext()
   at KubeOps.KubernetesClient.KubernetesClient.WatchAsync[TEntity](String namespace, String resourceVersion, String labelSelector, Nullable`1 allowWatchBookmarks, CancellationToken cancellationToken)+MoveNext()
   at KubeOps.KubernetesClient.KubernetesClient.WatchAsync[TEntity](String namespace, String resourceVersion, String labelSelector, Nullable`1 allowWatchBookmarks, CancellationToken cancellationToken)+System.Threading.Tasks.Sources.IValueTaskSource<System.Boolean>.GetResult()
   at KubeOps.Operator.Watcher.ResourceWatcher`1.WatchClientEventsAsync(CancellationToken stoppingToken)
   at KubeOps.Operator.Watcher.ResourceWatcher`1.WatchClientEventsAsync(CancellationToken stoppingToken)
warn: KubeOps.Operator.Watcher.ResourceWatcher[0]
      There were 163 errors / retries in the watcher. Wait 32.255s before next attempt to connect.
KubeOps.Operator.Watcher.ResourceWatcher: Warning: There were 163 errors / retries in the watcher. Wait 32.255s before next attempt to connect.

@exextatic
Copy link
Contributor

I've also encountered this issue in my environment.

@aptandon
Copy link

@buehler we have also encountered this.

@jhested
Copy link

jhested commented Sep 19, 2024

+1

@oskrSaxo
Copy link

@buehler Encountering the same - seems to build up over time - resolved for us by restarting the pods, which is obviously not ideal.

@kristianbjensen
Copy link

We are also seing this issue at all our environments

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

7 participants