-
-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[bug]: Re-triggering reconcile does not work when status is updated #554
Comments
Following up on our discussion to get more context on this:
My understanding is that replacing "Switch" with "Merge" should fix this issue. I want to be sure that this won't cause any unintended side-effects. dotnet-operator-sdk/src/KubeOps/Operator/Controller/EventQueue.cs Lines 44 to 55 in 3f42bbc
Thank you. |
@buehler I think we're also being hit by this bug - I couldnt debug our operator yet. The issue is that after some reconcile loop's most of the resources are never reconciled again, even if the resource was updated in kubernetes (not the status). Strangely this does not happen for all resources. After restarting the operator all resources are reconciled again for some time and stops working for most resources again. I will open a new issue as it seems unrelated. |
We encountered the same issue last week. We perform reconciliation for custom resources every minute, during which we update the status of each custom resource. There are seven resources undergoing reconciliation every minute. However, at random intervals, such as after 2 days or 5 days, we've noticed that the reconciliation process stops for some of these seven resources. We did not observed any error/exception in logs. We are using eks 1.24 and kubeops 7.6.1 library. Is there any fix/workaround available? |
@tomitesh Just wanted to double check that you are not returning "Null" for these objects (for which reconciliation process is stopping) as part of the reconcile loop. |
Thanks @Karthik2893 for your reply. We aim to reconcile every minute, opting for option 1 as outlined in the readme to achieve periodic reconciliation. However, after a random interval of 2-3-5 days, reconciliation ceases for certain resources. While utilizing option 2, reconciliation occurs only once during startup (unless i am missing any configuration). Note : we also update status as part of "// reconcile logic" option 1 public async Task<ResourceControllerResult?> ReconcileAsync(V1DemoEntity entity)
{
_logger.LogInformation($"entity {entity.Name()} called {nameof(ReconcileAsync)}.");
await _finalizerManager.RegisterFinalizerAsync<DemoFinalizer>(entity);
// reconcile logic
return ResourceControllerResult.RequeueEvent(TimeSpan.FromSeconds(60));
} Option 2 public async Task<ResourceControllerResult?> ReconcileAsync(V1DemoEntity entity)
{
_logger.LogInformation($"entity {entity.Name()} called {nameof(ReconcileAsync)}.");
await _finalizerManager.RegisterFinalizerAsync<DemoFinalizer>(entity);
// reconcile logic
return null;
} |
@tomitesh Do you happen to know what you WatcherHttpTimeout is set to? Setting it to a higher value (>120min or so) resulted in failing to reconcile after a certain period of time. But if that is the case, it should fail to reconcile for all the entities and not a fraction of entities. For your case, I am suspecting one of your codepaths might be returning "null" for the entities that are failing to reconcile? If not, then we will have to ask others to look into it and it will be helpful if we have code snippets :) |
I can confirm, experiencing the same issue on (8.0.0-pre.29, 8.0.0-pre.34 - didn't try other versions) - during reconcile i update the status and publish events (not sure if related) then requeue. |
Hmm. Good question. Events should not impact the watcher, since they are completely different objects in Kubernetes. However, status updates could impact the requeue cache. But when you update the status and then requeue the resource, it should retrigger the reconcile. I'll conduct a test of my own :) |
[Checked on v8.0.0-pre.38] If in the ReconcileAsync one would simply
due to the asynchronous nature of the system, even if the entity status is changed first, the "Modified" event for it might occur AFTER the requeue. You can check the attached screenshot (
|
It's even more obvious that the requeue is skipped if you look at the log below (after the KubernetesClient reconnects): [14:38:51.216 - DBG - WebhookOperator.Controller.UscSystemEntityController - requeueing...
[14:38:51.216 - VRB - KubeOps.Abstractions.Queue.EntityRequeue - Requeue entity "X" in 5000ms.
[14:38:51.216 - DBG - WebhookOperator.Controller.XEntityController - requeued
[14:38:52.883 - DBG - KubeOps.Operator.Watcher.ResourceWatcher - The watcher was closed
[14:38:52.883 - DBG - KubeOps.Operator.Watcher.ResourceWatcher - Create watcher for entity of type "WebhookOperator.Entities.XEntity".
[14:38:52.888 - VRB - KubeOps.Operator.Watcher.ResourceWatcher - Received watch event "Added" for "X/alpha".
[14:38:52.888 - DBG - KubeOps.Operator.Watcher.ResourceWatcher - Entity "X/alpha" modification did not modify generation. Skip event.
NOTHING HAPPENS ANYMORE HERE EVEN IF THE THIRD LINE REQUEUED THE ENTITY!
|
This should be overhauled in v8. |
Unfortunately, my previous two comments were for v8.0.0-pre.38. |
Hey @sicavz, so it does not work? when using the framework as documented, I could not reproduce the error. You need to use the returned entities from the client to have the updated resource versions and stuff. And status update does not update the resource version which should be fine. |
I'm struggling with the same issue in v8. When running in VS locally, everything runs perfectly fine. When deployed to kubernetes I get maybe 2 or 3 reconcile calls before it stops |
Describe the bug
As discussed in #551, the timed reconcile return value is ignored when a status update is performed during the reconciliation loop.
To reproduce
Expected behavior
No response
Screenshots
No response
Additional Context
dotnet-operator-sdk/src/KubeOps/Operator/Controller/EventQueue.cs
Lines 44 to 55 in 3f42bbc
The text was updated successfully, but these errors were encountered: