[DPE-4307] HA process interrupt tests #114

juditnovak · 2024-09-24T23:03:36Z

This change unites 2 tasks:

DPE-4307 [OpenSearch Dashboards][VM] - Testing - HA process interrupt tests
- HA tests are covering both OSD process intrrunpt and Opensearch Dashboards process interrupt
- An issues was discovered and fixed:
  - OSD "hanigng" (simulated by sending SIGSTOP to the node process) was not handled by the charm.
  - Healthcheck is extended, and a restart mechansim is added for the case when the application doesn't respond
DPE-5516 [OpenSearch Dashboards][VM][BUGFIX] Service unavailable until first update-status
- Reason:
  - I believe that the issue here was not to leave time for the service to establish, but triggering an update-status as soon as the process (i.e. snap level) was considered up.
- Fix:
  - A "grace period" was added, giving the application a chance to get back after a restart.
- Proof of functionality:
  - See earlier pipeline (not including the fix) with OSD showing regular Service unavailable status.
  - Gone on current 🟢 pipelines.

juditnovak · 2024-09-25T15:01:55Z

src/workload.py

-            return bool(self.dashboards.services[self.SNAP_APP_SERVICE]["active"]) and bool(
-                self.dashboards.services[self.SNAP_EXPORTER_SERVICE]["active"]
-            )
+            return bool(self.dashboards.services[self.SNAP_APP_SERVICE]["active"])


We had an ugly bug here. (@phvalguima is my witness :-) )

Reminder: OSD is a service that doesn't even come up (at all!) except in perfect conditions.
Meaning functional connection to Opensearch, etc. No, we do NOT even have a Status Page.
All we get is

The Prometheus Exproter process is very similar. It can't come up as long as there is no functional OSD available. It's failing right after re-start otherwise.

Now this resulted in an ugly, flaky behavior whenever OSD was considered active on a snap level -- yet the service was not running (for example: due to missing Opensearch connection). Meaning that the first part of the condition above was True -- while the 2nd part was True/False ... depending on the moment.

Debugging underlying data structures (representing snap service state):

The conclusion is this: we shouldn't check Prometheus Exporter state here.

One proposal could be to do it in a separate check. (I can propose a solution even in parallel on a separte branch. If we want to address the matter, it would be much nicer in a separate task.).
In my opinion, as long as we are not providing similar safety measures on Openserach (where monitoring is a WAY more crucial), it may be best to keep comlexity low and trus automatic snap restarts to cover Prometheus Exporter failures.

@juditnovak yes, I think we need to break the concepts: alive() here means the main service is up and running.

Mehdi-Bendriss

Thanks Judit - I have a minor comment regarding the check on the status as part of the charm logic.

src/charm.py

phvalguima

There is a potential loop here:

The reconcile method may end calling for a restart

And, in the _restart, we end up calling reconcile again before the actual restart.

A charm leader in a deployment with no lock assigned would end up in a continuous loop where the charm calls:
reconcile() > detects unit is down > calls acquire_lock() > which emits a relation_changed > assigns itself the lock > calls the _restart > re-calls the reconcile and restarts the loop

My recommendation is to restart the workload right away on _restart method.

src/charm.py

…r service

juditnovak · 2024-10-22T09:26:32Z

@phvalguima You are right. Restart attempt on service down is removed from reconcile(). The check is not needed.

phvalguima · 2024-11-05T09:05:39Z

src/charm.py

        clear_status(self.unit, [MSG_STARTING, MSG_STARTING_SERVER])
-        self.on.update_status.emit()


Avoid a loop here, as described on: #114 (review)

There is no point in speeding up an update-status, it will happen anyways. In this case, it is better to leave some time between this _restart call and the next update-status for the system to settle.

Looking at this code again, may be worthy to structure it instead as follows:

start_time = time.time() unit_healthy, _ = self.health_manager.unit_healthy() while not unit_healthy and time.time() - start_time < SERVICE_AVAILABLE_TIMEOUT: time.sleep(5) unit_healthy, _ = self.health_manager.unit_healthy() if unit_healthy: # We clean the status and finish clear_status(self.unit, [MSG_STARTING, MSG_STARTING_SERVER]) else: # this call will potentially retrigger a restart self.on.update_status.emit()

FYI, the proposal above reintroduces the risk of continuous restart loop if the unit is failing because of some other problem...

juditnovak force-pushed the DPE-4307/HA_process_interrupt branch 2 times, most recently from f453c0a to f519b99 Compare September 24, 2024 23:05

juditnovak changed the title ~~[DPE-4307] HA process interrupt~~ [DPE-4307] HA process interrupt tests Sep 24, 2024

juditnovak force-pushed the DPE-4307/HA_process_interrupt branch from 7c2e78f to 7cd0176 Compare September 24, 2024 23:57

juditnovak commented Sep 25, 2024

View reviewed changes

juditnovak force-pushed the DPE-4307/HA_process_interrupt branch 5 times, most recently from 20f8758 to b597fb7 Compare September 26, 2024 09:58

juditnovak marked this pull request as ready for review September 26, 2024 16:04

juditnovak requested a review from Mehdi-Bendriss September 26, 2024 16:04

phvalguima self-requested a review October 11, 2024 07:47

Mehdi-Bendriss reviewed Oct 13, 2024

View reviewed changes

src/charm.py Outdated Show resolved Hide resolved

phvalguima reviewed Oct 15, 2024

View reviewed changes

src/charm.py Outdated Show resolved Hide resolved

juditnovak added 5 commits October 21, 2024 00:55

[HA] Integration Tests for Process interrupt

bb73af6

Fix: restart if service unresponsive

9a6e2ba

Bugfix: workload.alive() function was flaky due to Prometheus Exporte…

f2639cf

…r service

Fix: don't set Service unavailable charm state on healthy restart

3811e8e

Unittests

aced2d8

juditnovak force-pushed the DPE-4307/HA_process_interrupt branch 2 times, most recently from 6b3e1ca to 83de696 Compare October 21, 2024 00:55

Libs update

cf3dd84

juditnovak force-pushed the DPE-4307/HA_process_interrupt branch from 83de696 to 9dc892b Compare October 21, 2024 00:59

juditnovak added 2 commits October 22, 2024 02:06

Changes on PR review

72bb826

Unittest changes due to rebase

1322a51

juditnovak force-pushed the DPE-4307/HA_process_interrupt branch from cedd2a6 to 1322a51 Compare October 22, 2024 00:09

phvalguima added 2 commits November 5, 2024 09:44

Merge remote-tracking branch 'origin' into DPE-4307/HA_process_interrupt

b18af94

Remove reconcile() from _restart() logic

0f69b9c

phvalguima reviewed Nov 5, 2024

View reviewed changes

phvalguima added 3 commits November 11, 2024 15:15

Update test profile=testing

c881b51

The process interrupt run is starving the VM

ecc33b5

Add xlarge to the build_and_deploy

58eb3ed

phvalguima requested review from phvalguima and Mehdi-Bendriss November 11, 2024 17:57

phvalguima approved these changes Nov 12, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DPE-4307] HA process interrupt tests #114

[DPE-4307] HA process interrupt tests #114

juditnovak commented Sep 24, 2024 •

edited

Loading

juditnovak Sep 25, 2024 •

edited

Loading

phvalguima Sep 25, 2024

Mehdi-Bendriss left a comment

phvalguima left a comment •

edited

Loading

juditnovak commented Oct 22, 2024

phvalguima Nov 5, 2024

phvalguima Nov 8, 2024

phvalguima Nov 8, 2024

		clear_status(self.unit, [MSG_STARTING, MSG_STARTING_SERVER])
		self.on.update_status.emit()

[DPE-4307] HA process interrupt tests #114

Are you sure you want to change the base?

[DPE-4307] HA process interrupt tests #114

Conversation

juditnovak commented Sep 24, 2024 • edited Loading

juditnovak Sep 25, 2024 • edited Loading

Choose a reason for hiding this comment

phvalguima Sep 25, 2024

Choose a reason for hiding this comment

Mehdi-Bendriss left a comment

Choose a reason for hiding this comment

phvalguima left a comment • edited Loading

Choose a reason for hiding this comment

juditnovak commented Oct 22, 2024

phvalguima Nov 5, 2024

Choose a reason for hiding this comment

phvalguima Nov 8, 2024

Choose a reason for hiding this comment

phvalguima Nov 8, 2024

Choose a reason for hiding this comment

juditnovak commented Sep 24, 2024 •

edited

Loading

juditnovak Sep 25, 2024 •

edited

Loading

phvalguima left a comment •

edited

Loading