Skip to content

Latest commit

 

History

History
531 lines (413 loc) · 20.1 KB

README.md

File metadata and controls

531 lines (413 loc) · 20.1 KB

check-process

Go-based tooling used to monitor processes.

Latest Release Go Reference go.mod Go version Lint and Build Project Analysis

Table of Contents

Project home

See our GitHub repo for the latest code, to file an issue or submit improvements for review and potential inclusion into the project.

Overview

This repo is intended to provide various tools used to monitor processes.

Tool Name Overall Status Description
check_process Alpha Nagios plugin used to monitor processes for problematic states.
lsps Alpha Small CLI tool to list processes with known problematic states.

check_process

Performance Data

Initial support has been added for emitting Performance Data / Metrics, but refinement suggestions are welcome.

Consult the table below for the metrics implemented thus far.

Please add to an existing Discussion thread (if applicable) or open a new one with any feedback that you may have. Thanks in advance!

Emitted Performance Data / Metric Meaning
time Runtime for plugin
problem_processes Number of overall "problem" processes
running Number of running processes
sleeping Number of sleeping processes
uninterruptible_disk_sleep Number of (uninterruptible) disk sleep processes
stopped Number of sleeping processes
zombie Number of zombie processes
dead Number of dead processes
tracing_stop Number of tracing stop processes
wakekill Number of wakekill processes
waking Number of waking processes
idle Number of idle processes
parked Number of parked processes

NOTE: Not all process types are available for all kernel versions. Consult the Known states section for more information.

Features

check_process plugin

Nagios plugin (check_process) used to monitor for problematic process states on Linux distros.

NOTE: The intent is to support multiple operating systems, but as of this writing Linux is the only supported OS

  • Optional branding "signature"

    • used to indicate what Nagios plugin (and what version) is responsible for the service check result
  • Optional, leveled logging using rs/zerolog package

    • logfmt format output (to stderr)
    • choice of disabled, panic, fatal, error, warn, info (the default), debug or trace

NOTE: This tool ignores its own process entry when reporting running processes.

lsps CLI tool

Small CLI tool to list processes with known problematic processes.

  • Optional expanded or "all" listing of processes grouped by process state

    • NOTE: This may produce a LOT of output
  • Optional branding "signature"

    • used to indicate what Nagios plugin (and what version) is responsible for the service check result
  • Optional, leveled logging using rs/zerolog package

    • logfmt format output (to stderr)
    • choice of disabled, panic, fatal, error, warn, info (the default), debug or trace

NOTE: This tool ignores its own process entry when reporting running processes.

Changelog

See the CHANGELOG.md file for the changes associated with each release of this application. Changes that have been merged to master, but not yet an official release may also be noted in the file under the Unreleased section. A helpful link to the Git commit history since the last official release is also provided for further review.

Requirements

The following is a loose guideline. Other combinations of Go and operating systems for building and running tools from this repo may work, but have not been tested.

Building source code

  • Go
    • see this project's go.mod file for preferred version
    • this project tests against officially supported Go releases
      • the most recent stable release (aka, "stable")
      • the prior, but still supported release (aka, "oldstable")
  • GCC
    • if building with custom options (as the provided Makefile does)
  • make
    • if using the provided Makefile

Running

  • Red Hat Enterprise Linux 6
  • Red Hat Enterprise Linux 7
  • Red Hat Enterprise Linux 8
  • Ubuntu 20.04

Installation

From source

  1. Download Go
  2. Install Go
  3. Clone the repo
    1. cd /tmp
    2. git clone https://github.com/atc0005/check-process
    3. cd check-process
  4. Install dependencies (optional)
    • for Ubuntu Linux
      • sudo apt-get install make gcc
    • for CentOS Linux
      1. sudo yum install make gcc
  5. Build
    • manually, explicitly specifying target OS and architecture
      • GOOS=linux GOARCH=amd64 go build -mod=vendor ./cmd/check_process/
      • GOOS=linux GOARCH=amd64 go build -mod=vendor ./cmd/lsps/
        • most likely this is what you want (if building manually)
        • substitute amd64 with the appropriate architecture if using different hardware (e.g., arm64 or 386)
    • using Makefile linux recipe
      • make linux
        • generates x86 and x64 binaries
    • using Makefile release-build recipe
      • make release-build
        • generates the same release assets as provided by this project's releases
  6. Locate generated binaries
    • if using Makefile
      • look in /tmp/check-process/release_assets/check_process/
      • look in /tmp/check-process/release_assets/lsps/
    • if using go build
      • look in /tmp/check-process/
  7. Copy the applicable binaries to whatever systems needs to run them so that they can be deployed

NOTE: Depending on which Makefile recipe you use the generated binary may be compressed and have an xz extension. If so, you should decompress the binary first before deploying it (e.g., xz -d check_process-linux-amd64.xz).

Using release binaries

  1. Download the latest release binaries
  2. Decompress binaries
    • e.g., xz -d check_process-linux-amd64.xz
  3. Copy the applicable binaries to whatever systems needs to run them so that they can be deployed

NOTE:

DEB and RPM packages are provided as an alternative to manually deploying binaries.

Deployment

  1. Place check_process in a location where it can be executed by the monitoring agent
    • Usually the same place as other Nagios plugins
    • For example, on a default Red Hat Enterprise Linux system using check_nrpe the check_process plugin would be deployed to /usr/lib64/nagios/plugins/check_process or /usr/local/nagios/libexec/check_process
  2. Place lsps in a location where it can be easily accessed
    • Usually the same place as other custom tools installed outside of your package manager's control
    • e.g., /usr/local/bin/lsps

NOTE:

DEB and RPM packages are provided as an alternative to manually deploying binaries.

Configuration

Command-line arguments

  • Use the -h or --help flag to display current usage information.
  • Flags marked as required must be set via CLI flag.
  • Flags not marked as required are for settings where a useful default is already defined, but may be overridden if desired.

check_process

Flag Required Default Repeat Possible Description
branding No false No branding Toggles emission of branding details with plugin status details. This output is disabled by default.
h, help No false No h, help Show Help text along with the list of supported flags.
version No false No version Whether to display application version and then immediately exit application.
ll, log-level No info No disabled, panic, fatal, error, warn, info, debug, trace Log message priority filter. Log messages with a lower level are ignored.

lsps

Flag Required Default Repeat Possible Description
branding No false No branding Toggles emission of branding details with plugin status details. This output is disabled by default.
h, help No false No h, help Show Help text along with the list of supported flags.
version No false No version Whether to display application version and then immediately exit application.
show-all No false No show-all Toggles listing of all processes. WARNING: This may produce a LOT of output. Disabled by default.
ll, log-level No info No disabled, panic, fatal, error, warn, info, debug, trace Log message priority filter. Log messages with a lower level are ignored.

Process states

Red Hat Enterprise Linux 6 running a 2.6.32 version kernel is the baseline test environment for this project.

The valid process states for a 2.6.32 kernel differs from the process states for a 3.10 kernel (RHEL 7) which in turn differs from a 4.18 (RHEL 8) and newer kernel. This project attempts to evaluate processes in all supported states. In an effort to simplify use, some assumptions are made regarding which process states map to which monitoring plugin state.

Known states

The state details in this section were pulled directly from the source code for each of the upstream kernel versions for RHEL releases that this project was tested against. See the References section for additional details.

v2.6.32 (RHEL 6)

/*
 * The task state array is a strange "bitmap" of
 * reasons to sleep. Thus "running" is zero, and
 * you can test for combinations of others with
 * simple bit tests.
 */
static const char *task_state_array[] = {
  "R (running)",         /*  0 */
  "S (sleeping)",        /*  1 */
  "D (disk sleep)",      /*  2 */
  "T (stopped)",         /*  4 */
  "T (tracing stop)",    /*  8 */
  "Z (zombie)",          /* 16 */
  "X (dead)"             /* 32 */
};

v3.10 (RHEL 7)

/*
 * The task state array is a strange "bitmap" of
 * reasons to sleep. Thus "running" is zero, and
 * you can test for combinations of others with
 * simple bit tests.
 */
static const char * const task_state_array[] = {
  "R (running)",         /*   0 */
  "S (sleeping)",        /*   1 */
  "D (disk sleep)",      /*   2 */
  "T (stopped)",         /*   4 */
  "t (tracing stop)",    /*   8 */
  "Z (zombie)",          /*  16 */
  "X (dead)",            /*  32 */
  "x (dead)",            /*  64 */
  "K (wakekill)",        /* 128 */
  "W (waking)",          /* 256 */
  "P (parked)",          /* 512 */
};

v4.18 (RHEL 8)

/*
 * The task state array is a strange "bitmap" of
 * reasons to sleep. Thus "running" is zero, and
 * you can test for combinations of others with
 * simple bit tests.
 */
static const char * const task_state_array[] = {

  /* states in TASK_REPORT: */
  "R (running)",         /* 0x00 */
  "S (sleeping)",        /* 0x01 */
  "D (disk sleep)",      /* 0x02 */
  "T (stopped)",         /* 0x04 */
  "t (tracing stop)",    /* 0x08 */
  "X (dead)",            /* 0x10 */
  "Z (zombie)",          /* 0x20 */
  "P (parked)",          /* 0x40 */

  /* states beyond TASK_REPORT: */
  "I (idle)",            /* 0x80 */
};

v.5.14 (RHEL 9)

/*
 * The task state array is a strange "bitmap" of
 * reasons to sleep. Thus "running" is zero, and
 * you can test for combinations of others with
 * simple bit tests.
 */
static const char * const task_state_array[] = {

  /* states in TASK_REPORT: */
  "R (running)",         /* 0x00 */
  "S (sleeping)",        /* 0x01 */
  "D (disk sleep)",      /* 0x02 */
  "T (stopped)",         /* 0x04 */
  "t (tracing stop)",    /* 0x08 */
  "X (dead)",            /* 0x10 */
  "Z (zombie)",          /* 0x20 */
  "P (parked)",          /* 0x40 */

  /* states beyond TASK_REPORT: */
  "I (idle)",            /* 0x80 */
};

Summary

// kernel 2.6.32 (RHEL 6)
  "R (running)"
  "S (sleeping)"
  "D (disk sleep)"
  "T (stopped)"
  "T (tracing stop)"
  "Z (zombie)"
  "X (dead)"

// kernel 3.10 (RHEL 7)
  "R (running)"
  "S (sleeping)"
  "D (disk sleep)"
  "T (stopped)"
  "t (tracing stop)"
  "Z (zombie)"
  "X (dead)"
  "x (dead)"
  "K (wakekill)"
  "W (waking)"
  "P (parked)"

// kernel 4.18/5.14 (RHEL 8/9)
  "R (running)"
  "S (sleeping)"
  "D (disk sleep)"
  "T (stopped)"
  "t (tracing stop)"
  "X (dead)"
  "Z (zombie)"
  "P (parked)"
  "I (idle)"

Process state to plugin state mappings

Process State Monitoring State
D (disk sleep) CRITICAL
Z (zombie) WARNING

Examples

OK result

This output is emitted by the plugin when no problematic processes are found.

$ ./check_process
OK: No problematic processes found (364 evaluated)

Process Summary:

  - R (running) [1]
  - S (sleeping) [363]


--------------------------------------------------


Problems:

  - None

 | 'dead'=0;;;; 'idle'=0;;;; 'parked'=0;;;; 'problem_processes'=0;;;; 'running'=1;;;; 'sleeping'=363;;;; 'stopped'=0;;;; 'time'=18ms;;;; 'tracing_stop'=0;;;; 'uninterruptible_disk_sleep'=0;;;; 'wakekill'=0;;;; 'waking'=0;;;; 'zombie'=0;;;;

Regarding the output:

  • The last line beginning with a space and the | symbol are performance data metrics emitted by the plugin. Depending on your monitoring system, these metrics may be collected and exposed as graphs/charts.
  • This output was captured on a Red Hat Enterprise Linux 6 system (baseline OS for testing). The output is comparable to other Linux distros.

WARNING result

This output is emitted by the plugin when problematic processes of a WARNING state are found.

TODO: Provide example output when this scenario is encountered.

CRITICAL result

This output is emitted by the plugin when problematic processes of a CRITICAL state are found.

In the case of the rsync entries below, the activity is fairly normal for this system (daily, early AM backups). To work around this, you can either modify the timeperiod used for notifications to exclude this scenario (until D state processes are found outside of that window) or increase the number of retries so that an alert is not raised until after all retry attempts have been exceeded.

$ ./check_process
CRITICAL: 2 problematic processes found (D (disk sleep) [2], R (running) [7], S (sleeping) [368], evaluated [377])

Process Summary:

- D (disk sleep) [2]
- R (running) [7]
- S (sleeping) [368]


--------------------------------------------------


Problems:
- Name: rsync [Parent: backup.sh (6761), State: D (disk sleep), Pid: 16431, PPid: 6761, Threads: 1]
- Name: rsync [Parent: backup.sh (6761), State: D (disk sleep), Pid: 18321, PPid: 6761, Threads: 1]

 | 'dead'=0;;;; 'idle'=0;;;; 'parked'=0;;;; 'problem_processes'=0;;;; 'running'=7;;;; 'sleeping'=368;;;; 'stopped'=0;;;; 'time'=18ms;;;; 'tracing_stop'=0;;;; 'uninterruptible_disk_sleep'=2;;;; 'wakekill'=0;;;; 'waking'=0;;;; 'zombie'=0;;;;

License

See the LICENSE file for details.

References