[bug]: k8s.KubernetesException: too old resource version: xxxx (xxxx) on re-connect #793

rpearsondev · 2024-07-17T11:26:11Z

Describe the bug

When the operator has been running for a while and the watch of a resource has failed it correctly tries to re-connect. However when it tries to re-start the watch of the entity, it can return the error k8s.KubernetesException: too old resource version: xxxx (xxxx). This is because the operator has been in a re-connecting state long enought that the currently stored resource version (in memory int the opertor) is too far behind the resource version of the resource.

To reproduce

It is difficult to reproduce because it relies on a transient error to initiate it, however here is the profile of the issue.

There has to be an initial failure in the watch of a resource. In my case it was "The response ended prematurely while waiting for the next frame from the server. (ResponseEnded)".

Then on each successive re-connection attempt it fails with "too old resource version: 68358497 (68368462)"

here is the logs filtered to only what is relevant:

timestamp [UTC]	message	error
7/16/2024, 1:05:49.052 PM	Received watch event "Bookmark" for "Hidden/(null)", last observed resource version: 68358100.
7/16/2024, 1:06:02.432 PM	Received watch event "Bookmark" for "Hidden/(null)", last observed resource version: 68358497.
7/16/2024, 1:06:44.608 PM	Received watch event "Bookmark" for "Hidden/(null)", last observed resource version: 68357572.
7/16/2024, 1:06:49.407 PM	Received watch event "Bookmark" for "Hidden/(null)", last observed resource version: 68358100.
7/16/2024, 1:07:02.387 PM	Received watch event "Bookmark" for "Hidden/(null)", last observed resource version: 68358497.
7/16/2024, 1:07:43.839 PM	Received watch event "Bookmark" for "Hidden/(null)", last observed resource version: 68368670.
7/16/2024, 1:07:47.938 PM	Watcher for Hidden was terminated and is reconnecting.
7/16/2024, 1:07:47.943 PM	Watcher for Hidden was terminated and is reconnecting.
7/16/2024, 1:07:47.943 PM	Watcher for Hidden was terminated and is reconnecting.
7/16/2024, 1:07:47.949 PM	There was an error while watching the resource "ObscureNamespace.V1EntityR".	The response ended prematurely while waiting for the next frame from the server. (ResponseEnded)
7/16/2024, 1:07:47.949 PM	There were 1 errors / retries in the watcher. Wait 2.456s before next attempt to connect.
7/16/2024, 1:07:48.030 PM	There was an error while watching the resource "ObscureNamespace.V1EntityPM".	too old resource version: 68358497 (68368462)
7/16/2024, 1:07:48.030 PM	There were 1 errors / retries in the watcher. Wait 2.197s before next attempt to connect.
7/16/2024, 1:07:48.097 PM	There was an error while watching the resource "ObscureNamespace.V1EntitySRB".	too old resource version: 68358844 (68369473)
7/16/2024, 1:07:48.097 PM	There were 1 errors / retries in the watcher. Wait 2.823s before next attempt to connect.
7/16/2024, 1:07:50.230 PM	Watcher for V1EntityPM was terminated and is reconnecting.
7/16/2024, 1:07:50.232 PM	There was an error while watching the resource "ObscureNamespace.V1EntityPM".	too old resource version: 68358497 (68368462)
7/16/2024, 1:07:50.232 PM	There were 2 errors / retries in the watcher. Wait 4.481s before next attempt to connect.
7/16/2024, 1:07:50.403 PM	Watcher for V1EntityR was terminated and is reconnecting.
7/16/2024, 1:07:50.405 PM	There was an error while watching the resource "ObscureNamespace.V1EntityR".	too old resource version: 68358100 (68368631)
7/16/2024, 1:07:50.405 PM	There were 2 errors / retries in the watcher. Wait 4.085s before next attempt to connect.
7/16/2024, 1:07:50.918 PM	Watcher for V1EntitySRB was terminated and is reconnecting.
7/16/2024, 1:07:50.920 PM	There was an error while watching the resource "ObscureNamespace.V1EntitySRB".	too old resource version: 68358844 (68369473)
7/16/2024, 1:07:50.920 PM	There were 2 errors / retries in the watcher. Wait 4.945s before next attempt to connect.
7/16/2024, 1:07:54.490 PM	Watcher for V1EntityR was terminated and is reconnecting.
7/16/2024, 1:07:54.493 PM	There was an error while watching the resource "ObscureNamespace.V1EntityR".	too old resource version: 68358100 (68368631)
7/16/2024, 1:07:54.493 PM	There were 3 errors / retries in the watcher. Wait 8.04s before next attempt to connect.
7/16/2024, 1:07:54.714 PM	Watcher for V1EntityPM was terminated and is reconnecting.
7/16/2024, 1:07:54.716 PM	There was an error while watching the resource "ObscureNamespace.V1EntityPM".	too old resource version: 68358497 (68368462)
7/16/2024, 1:07:54.716 PM	There were 3 errors / retries in the watcher. Wait 8.351s before next attempt to connect.
7/16/2024, 1:07:55.866 PM	Watcher for V1EntitySRB was terminated and is reconnecting.
7/16/2024, 1:07:55.869 PM	There was an error while watching the resource "ObscureNamespace.V1EntitySRB".	too old resource version: 68358844 (68369473)
7/16/2024, 1:07:55.869 PM	There were 3 errors / retries in the watcher. Wait 8.888s before next attempt to connect.
7/16/2024, 1:08:02.532 PM	Watcher for V1EntityR was terminated and is reconnecting.
7/16/2024, 1:08:02.534 PM	There was an error while watching the resource "ObscureNamespace.V1EntityR".	too old resource version: 68358100 (68368631)
7/16/2024, 1:08:02.534 PM	There were 4 errors / retries in the watcher. Wait 16.606s before next attempt to connect.
7/16/2024, 1:08:03.066 PM	Watcher for V1EntityPM was terminated and is reconnecting.
7/16/2024, 1:08:03.068 PM	There was an error while watching the resource "ObscureNamespace.V1EntityPM".	too old resource version: 68358497 (68368462)
7/16/2024, 1:08:03.068 PM	There were 4 errors / retries in the watcher. Wait 16.1s before next attempt to connect.
7/16/2024, 1:08:04.756 PM	Watcher for V1EntitySRB was terminated and is reconnecting.
7/16/2024, 1:08:04.758 PM	There was an error while watching the resource "ObscureNamespace.V1EntitySRB".	too old resource version: 68358844 (68369473)
7/16/2024, 1:08:04.758 PM	There were 4 errors / retries in the watcher. Wait 16.796s before next attempt to connect.
7/16/2024, 1:08:19.143 PM	Watcher for V1EntityR was terminated and is reconnecting.
7/16/2024, 1:08:19.148 PM	There was an error while watching the resource "ObscureNamespace.V1EntityR".	too old resource version: 68358100 (68368631)
7/16/2024, 1:08:19.148 PM	There were 5 errors / retries in the watcher. Wait 32.544s before next attempt to connect.
7/16/2024, 1:08:19.168 PM	Watcher for V1EntityPM was terminated and is reconnecting.
7/16/2024, 1:08:19.169 PM	There was an error while watching the resource "ObscureNamespace.V1EntityPM".	too old resource version: 68358497 (68368462)
7/16/2024, 1:08:19.169 PM	There were 5 errors / retries in the watcher. Wait 32.427s before next attempt to connect.
7/16/2024, 1:08:21.555 PM	Watcher for V1EntitySRB was terminated and is reconnecting.
7/16/2024, 1:08:21.557 PM	There was an error while watching the resource "ObscureNamespace.V1EntitySRB".	too old resource version: 68358844 (68369473)
7/16/2024, 1:08:21.557 PM	There were 5 errors / retries in the watcher. Wait 32.521s before next attempt to connect.
7/16/2024, 1:08:51.596 PM	Watcher for V1EntityPM was terminated and is reconnecting.

It continues in this way forever. Untill the pod is restarted.

Expected behavior

On error an error the watcher should start again (regardless of if the currentVersion is too old) so that the operator doesnt stop reconciling CR's

Screenshots

No response

Additional Context

We are using the KubeOps.Operator 9.1.2 nuget package.

I have created a PR with a suggested fix: #792

stevefan1999-personal · 2024-07-17T15:12:32Z

Yes, although it did retry, but the version is too far off, plus you won't catch up with missing event:

k8s.KubernetesException: too old resource version: 8271646 (8271663)
   at KubeOps.KubernetesClient.KubernetesClient.<>c__26`1.<WatchAsync>b__26_0(Exception ex)
   at k8s.Watcher`1.CreateWatchEventEnumerator(Func`1 streamReaderCreator, Action`1 onError, CancellationToken cancellationToken)+MoveNext()
   at k8s.Watcher`1.CreateWatchEventEnumerator(Func`1 streamReaderCreator, Action`1 onError, CancellationToken cancellationToken)+System.Threading.Tasks.Sources.IValueTaskSource<System.Boolean>.GetResult()
   at KubeOps.KubernetesClient.KubernetesClient.WatchAsync[TEntity](String namespace, String resourceVersion, String labelSelector, Nullable`1 allowWatchBookmarks, CancellationToken cancellationToken)+MoveNext()
   at KubeOps.KubernetesClient.KubernetesClient.WatchAsync[TEntity](String namespace, String resourceVersion, String labelSelector, Nullable`1 allowWatchBookmarks, CancellationToken cancellationToken)+MoveNext()
   at KubeOps.KubernetesClient.KubernetesClient.WatchAsync[TEntity](String namespace, String resourceVersion, String labelSelector, Nullable`1 allowWatchBookmarks, CancellationToken cancellationToken)+System.Threading.Tasks.Sources.IValueTaskSource<System.Boolean>.GetResult()
   at KubeOps.Operator.Watcher.ResourceWatcher`1.WatchClientEventsAsync(CancellationToken stoppingToken)
   at KubeOps.Operator.Watcher.ResourceWatcher`1.WatchClientEventsAsync(CancellationToken stoppingToken)
warn: KubeOps.Operator.Watcher.ResourceWatcher[0]
      There were 163 errors / retries in the watcher. Wait 32.255s before next attempt to connect.
KubeOps.Operator.Watcher.ResourceWatcher: Warning: There were 163 errors / retries in the watcher. Wait 32.255s before next attempt to connect.

exextatic · 2024-07-25T04:46:02Z

I've also encountered this issue in my environment.

aptandon · 2024-08-30T08:36:17Z

@buehler we have also encountered this.

jhested · 2024-09-19T13:12:52Z

+1

oskrSaxo · 2024-09-19T13:14:46Z

@buehler Encountering the same - seems to build up over time - resolved for us by restarting the pods, which is obviously not ideal.

kristianbjensen · 2024-09-23T11:34:58Z

We are also seing this issue at all our environments

rpearsondev added the bug Something isn't working label Jul 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bug]: k8s.KubernetesException: too old resource version: xxxx (xxxx) on re-connect #793

[bug]: k8s.KubernetesException: too old resource version: xxxx (xxxx) on re-connect #793

rpearsondev commented Jul 17, 2024 •

edited

Loading

stevefan1999-personal commented Jul 17, 2024

exextatic commented Jul 25, 2024

aptandon commented Aug 30, 2024

jhested commented Sep 19, 2024

oskrSaxo commented Sep 19, 2024

kristianbjensen commented Sep 23, 2024

[bug]: k8s.KubernetesException: too old resource version: xxxx (xxxx) on re-connect #793

[bug]: k8s.KubernetesException: too old resource version: xxxx (xxxx) on re-connect #793

Comments

rpearsondev commented Jul 17, 2024 • edited Loading

Describe the bug

To reproduce

Expected behavior

Screenshots

Additional Context

stevefan1999-personal commented Jul 17, 2024

exextatic commented Jul 25, 2024

aptandon commented Aug 30, 2024

jhested commented Sep 19, 2024

oskrSaxo commented Sep 19, 2024

kristianbjensen commented Sep 23, 2024

rpearsondev commented Jul 17, 2024 •

edited

Loading