-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DPE-4307] HA process interrupt tests #114
base: 2/edge
Are you sure you want to change the base?
Conversation
f453c0a
to
f519b99
Compare
7c2e78f
to
7cd0176
Compare
return bool(self.dashboards.services[self.SNAP_APP_SERVICE]["active"]) and bool( | ||
self.dashboards.services[self.SNAP_EXPORTER_SERVICE]["active"] | ||
) | ||
return bool(self.dashboards.services[self.SNAP_APP_SERVICE]["active"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We had an ugly bug here. (@phvalguima is my witness :-) )
Reminder: OSD is a service that doesn't even come up (at all!) except in perfect conditions.
Meaning functional connection to Opensearch, etc. No, we do NOT even have a Status Page.
All we get is
The Prometheus Exproter process is very similar. It can't come up as long as there is no functional OSD available. It's failing right after re-start otherwise.
Now this resulted in an ugly, flaky behavior whenever OSD was considered active
on a snap level -- yet the service was not running (for example: due to missing Opensearch connection). Meaning that the first part of the condition above was True
-- while the 2nd part was True/False
... depending on the moment.
Debugging underlying data structures (representing snap service state):
The conclusion is this: we shouldn't check Prometheus Exporter state here.
One proposal could be to do it in a separate check. (I can propose a solution even in parallel on a separte branch. If we want to address the matter, it would be much nicer in a separate task.).
In my opinion, as long as we are not providing similar safety measures on Openserach (where monitoring is a WAY more crucial), it may be best to keep comlexity low and trus automatic snap restarts to cover Prometheus Exporter failures.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@juditnovak yes, I think we need to break the concepts: alive()
here means the main service is up and running.
20f8758
to
b597fb7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Judit - I have a minor comment regarding the check on the status as part of the charm logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a potential loop here:
The reconcile method may end calling for a restart
And, in the _restart
, we end up calling reconcile again before the actual restart.
A charm leader in a deployment with no lock assigned would end up in a continuous loop where the charm calls:
reconcile() > detects unit is down > calls acquire_lock()
> which emits a relation_changed > assigns itself the lock > calls the _restart
> re-calls the reconcile
and restarts the loop
My recommendation is to restart the workload right away on _restart
method.
6b3e1ca
to
83de696
Compare
83de696
to
9dc892b
Compare
cedd2a6
to
1322a51
Compare
@phvalguima You are right. Restart attempt on service down is removed from |
clear_status(self.unit, [MSG_STARTING, MSG_STARTING_SERVER]) | ||
self.on.update_status.emit() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoid a loop here, as described on: #114 (review)
There is no point in speeding up an update-status
, it will happen anyways. In this case, it is better to leave some time between this _restart
call and the next update-status
for the system to settle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at this code again, may be worthy to structure it instead as follows:
start_time = time.time()
unit_healthy, _ = self.health_manager.unit_healthy()
while not unit_healthy and time.time() - start_time < SERVICE_AVAILABLE_TIMEOUT:
time.sleep(5)
unit_healthy, _ = self.health_manager.unit_healthy()
if unit_healthy:
# We clean the status and finish
clear_status(self.unit, [MSG_STARTING, MSG_STARTING_SERVER])
else:
# this call will potentially retrigger a restart
self.on.update_status.emit()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI, the proposal above reintroduces the risk of continuous restart loop if the unit is failing because of some other problem...
This change unites 2 tasks:
SIGSTOP
to thenode
process) was not handled by the charm.update-status
update-status
as soon as the process (i.e. snap level) was considered up.Service unavailable
status.