Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DPE-4307] HA process interrupt tests #114

Open
wants to merge 13 commits into
base: 2/edge
Choose a base branch
from

Conversation

juditnovak
Copy link
Contributor

@juditnovak juditnovak commented Sep 24, 2024

This change unites 2 tasks:

  • DPE-4307 [OpenSearch Dashboards][VM] - Testing - HA process interrupt tests
    • HA tests are covering both OSD process intrrunpt and Opensearch Dashboards process interrupt
    • An issues was discovered and fixed:
      • OSD "hanigng" (simulated by sending SIGSTOP to the node process) was not handled by the charm.
      • Healthcheck is extended, and a restart mechansim is added for the case when the application doesn't respond
  • DPE-5516 [OpenSearch Dashboards][VM][BUGFIX] Service unavailable until first update-status
    • Reason:
      • I believe that the issue here was not to leave time for the service to establish, but triggering an update-status as soon as the process (i.e. snap level) was considered up.
    • Fix:
      • A "grace period" was added, giving the application a chance to get back after a restart.
    • Proof of functionality:
      • See earlier pipeline (not including the fix) with OSD showing regular Service unavailable status.
      • Gone on current 🟢 pipelines.

@juditnovak juditnovak force-pushed the DPE-4307/HA_process_interrupt branch 2 times, most recently from f453c0a to f519b99 Compare September 24, 2024 23:05
@juditnovak juditnovak changed the title [DPE-4307] HA process interrupt [DPE-4307] HA process interrupt tests Sep 24, 2024
@juditnovak juditnovak force-pushed the DPE-4307/HA_process_interrupt branch from 7c2e78f to 7cd0176 Compare September 24, 2024 23:57
return bool(self.dashboards.services[self.SNAP_APP_SERVICE]["active"]) and bool(
self.dashboards.services[self.SNAP_EXPORTER_SERVICE]["active"]
)
return bool(self.dashboards.services[self.SNAP_APP_SERVICE]["active"])
Copy link
Contributor Author

@juditnovak juditnovak Sep 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We had an ugly bug here. (@phvalguima is my witness :-) )

Reminder: OSD is a service that doesn't even come up (at all!) except in perfect conditions.
Meaning functional connection to Opensearch, etc. No, we do NOT even have a Status Page.
All we get is
image

The Prometheus Exproter process is very similar. It can't come up as long as there is no functional OSD available. It's failing right after re-start otherwise.

Now this resulted in an ugly, flaky behavior whenever OSD was considered active on a snap level -- yet the service was not running (for example: due to missing Opensearch connection). Meaning that the first part of the condition above was True -- while the 2nd part was True/False ... depending on the moment.

Screenshot from 2024-09-25 13-25-42

Debugging underlying data structures (representing snap service state):

Screenshot from 2024-09-25 13-25-54

The conclusion is this: we shouldn't check Prometheus Exporter state here.

One proposal could be to do it in a separate check. (I can propose a solution even in parallel on a separte branch. If we want to address the matter, it would be much nicer in a separate task.).
In my opinion, as long as we are not providing similar safety measures on Openserach (where monitoring is a WAY more crucial), it may be best to keep comlexity low and trus automatic snap restarts to cover Prometheus Exporter failures.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@juditnovak yes, I think we need to break the concepts: alive() here means the main service is up and running.

@juditnovak juditnovak force-pushed the DPE-4307/HA_process_interrupt branch 5 times, most recently from 20f8758 to b597fb7 Compare September 26, 2024 09:58
@juditnovak juditnovak marked this pull request as ready for review September 26, 2024 16:04
@phvalguima phvalguima self-requested a review October 11, 2024 07:47
Copy link
Contributor

@Mehdi-Bendriss Mehdi-Bendriss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Judit - I have a minor comment regarding the check on the status as part of the charm logic.

src/charm.py Outdated Show resolved Hide resolved
Copy link
Contributor

@phvalguima phvalguima left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a potential loop here:

The reconcile method may end calling for a restart

And, in the _restart, we end up calling reconcile again before the actual restart.

A charm leader in a deployment with no lock assigned would end up in a continuous loop where the charm calls:
reconcile() > detects unit is down > calls acquire_lock() > which emits a relation_changed > assigns itself the lock > calls the _restart > re-calls the reconcile and restarts the loop

My recommendation is to restart the workload right away on _restart method.

src/charm.py Outdated Show resolved Hide resolved
@juditnovak juditnovak force-pushed the DPE-4307/HA_process_interrupt branch 2 times, most recently from 6b3e1ca to 83de696 Compare October 21, 2024 00:55
@juditnovak juditnovak force-pushed the DPE-4307/HA_process_interrupt branch from 83de696 to 9dc892b Compare October 21, 2024 00:59
@juditnovak juditnovak force-pushed the DPE-4307/HA_process_interrupt branch from cedd2a6 to 1322a51 Compare October 22, 2024 00:09
@juditnovak
Copy link
Contributor Author

@phvalguima You are right. Restart attempt on service down is removed from reconcile(). The check is not needed.

clear_status(self.unit, [MSG_STARTING, MSG_STARTING_SERVER])
self.on.update_status.emit()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoid a loop here, as described on: #114 (review)

There is no point in speeding up an update-status, it will happen anyways. In this case, it is better to leave some time between this _restart call and the next update-status for the system to settle.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at this code again, may be worthy to structure it instead as follows:

        start_time = time.time()
        unit_healthy, _ = self.health_manager.unit_healthy()
        while not unit_healthy and time.time() - start_time < SERVICE_AVAILABLE_TIMEOUT:
            time.sleep(5)
            unit_healthy, _ = self.health_manager.unit_healthy()
        if unit_healthy:
            # We clean the status and finish
            clear_status(self.unit, [MSG_STARTING, MSG_STARTING_SERVER])
        else:
            # this call will potentially retrigger a restart
            self.on.update_status.emit()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, the proposal above reintroduces the risk of continuous restart loop if the unit is failing because of some other problem...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants