Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Harden against monitor db failures #1545

Merged
merged 2 commits into from May 6, 2024
Merged

Conversation

byucesoy
Copy link
Member

@byucesoy byucesoy commented May 3, 2024

Do not crash if the monitor database is not available
While monitoring database is important, it isn't critical for the entire system
to function. Even the monitor itself can continue to work in degraded mode. So,
if the monitoring database is not available, we should not crash the entire
system. This commit catches Sequel::DatabaseConnectionError that can be raised
while trying to connect to the monitoring database and logs an error message.

Ignore errors while trying to save last_known_lsn
Even if the last_known_lsn cannot saved (potentially due to unavailability of
the monitoring database), the monitor should still be able to record pulses.
Otherwise, pulse checking would stop for all PostgreSQL databases when the
monitoring database is down. This commit ensures that we properly handle the
exceptions that can be raised when trying to save the last_known_lsn.

Of course, we shouldn't perform failovers if the last_known_lsn is unknown.
That is still the case, because the last_known_lsn is only used to at the
time of failover to determine the failover target.

While monitoring database is important, it isn't critical for the entire system
to function. Even the monitor itself can continue to work in degraded mode. So,
if the monitoring database is not available, we should not crash the entire
system. This commit catches Sequel::DatabaseConnectionError that can be raised
while trying to connect to the monitoring database and logs an error message.
Even if the last_known_lsn cannot saved (potentially due to unavailability of
the monitoring database), the monitor should still be able to record pulses.
Otherwise, pulse checking would stop for all PostgreSQL databases when the
monitoring database is down. This commit ensures that we properly handle the
exceptions that can be raised when trying to save the last_known_lsn.

Of course, we shouldn't perform failovers if the last_known_lsn is unknown.
That is still the case, because the last_known_lsn is only used to at the
time of failover to determine the failover target.
@byucesoy byucesoy requested a review from a team May 3, 2024 19:48
@byucesoy byucesoy changed the title Harden against monitor db failure Harden against monitor db failures May 3, 2024
@byucesoy byucesoy self-assigned this May 3, 2024
@byucesoy byucesoy merged commit 8e080ab into main May 6, 2024
6 checks passed
@byucesoy byucesoy deleted the harden-against-monitor-db-failure branch May 6, 2024 11:07
@github-actions github-actions bot locked and limited conversation to collaborators May 6, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants