Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework check_univention_replication nagios check script #20

Open
wants to merge 1 commit into
base: 5.0-0
Choose a base branch
from

Conversation

s3lph
Copy link

@s3lph s3lph commented Aug 31, 2021

Thank you for providing a pull request!

Please make sure you considered the following things

Link to the issue in Bugzilla

https://forge.univention.org/bugzilla/show_bug.cgi?id=53730

Description of the changes

We reworked the univention_replication_check nagios plugin to better fit our requirements in a large customer's environment. The primary motivation for this was that we're processing a lot of LDAP changes each night, and our on-call team was being woken up in the middle of the night, even though everything was allright, just the replication taking some time. We've been using this reworked check in production for 2 months now. As discussed with Dirk Ahrnke, we're now contributing this back to Univention:

The most significant change is the changed alerting behavior:

  • This check will report CRITICAL if:
    • Replication has failed (failed.ldif exists) OR
    • The listener id is behind that of the notifier AND
    • The listener id has not changed since the last invocation falling inside the considered timeframe (between --min-age and --max-age)
  • This check will report WARNING if:
    • The notifier id couldn't be fetched (a stopped notifier shouldn't trigger an alert on the affected host, but not on every single replica node) OR
    • No invocation history is present OR
    • The listener is FAR (greater than the warning threshold) behind the primary's, but is progressing
  • The following cases will report OK:
    • The listener is in sync with the notifier OR
    • The listener id is behind (but less than the warning threshold) the primary's, but is progressing

In addition, we introduced the following changes:

  • Use Python 3, it's 2021 after all
  • Use Python's argparse module for somewhat human-readable argument parsing, rather than getopt
  • Add perfdata output containing the listener id, notifier id and their difference

Note that this check is NOT A DROP-IN REPLACEMENT for the existing check_univention_replication. It uses different command line arguments, the history file uses a different format in a different place, and probably some other breaking changes:

usage: check_univention_replication [-h] [--version] [-v] [-r] [-w cnt] [-M seconds] [-m seconds] [-f file]

optional arguments:
  -h, --help            show this help message and exit
  --version             show program's version number and exit
  -v, --verbose         Verbose debug output
  -r, --readonly        Do not modify the history file
  -w cnt, --warning cnt
                        WARNING if difference of transaction IDs is >= <cnt>
  -M seconds, --max-age seconds
                        Disregard and remove all history entries older than <seconds>
  -m seconds, --min-age seconds
                        Disregard all history entries younger than <seconds>
  -f file, --hist-file file, --history-file file
                        Path to the history file

@CLAassistant
Copy link

CLAassistant commented Aug 31, 2021

CLA assistant check
All committers have signed the CLA.

@spaceone
Copy link
Member

Thanks, I created a bugzilla entry: https://forge.univention.org/bugzilla/show_bug.cgi?id=53730

pmhahn pushed a commit that referenced this pull request Dec 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants