digoal
2013-06-05
PostgreSQL , 监控 , nagios , check_postgres
前面介绍了nagios的安装和配置, 本文要进入主题了, 介绍一下PostgreSQL的监控.
这里要用到的是Bucardo提供的PostgreSQL监控脚本, check_postgres.pl.
它兼容nagios, cacti, mrtg等监控平台软件. 使用perl语言编写, 支持监控的项目较多.
check_postgres依赖以下perl模块 :
Access to a working version of psql, and the following very standard Perl modules:
Cwd
Getopt::Long
File::Basename
File::Temp
Time::HiRes (if $opt{showtime} is set to true, which is the default)
The settings_checksum action requires the Digest::MD5 module.
The checkpoint action requires the Date::Parse module.
Some actions require access to external programs. If psql is not explicitly specified, the command which is used to find it. The program /bin/df is needed by the disk_space action.
1. 安装perl
yum install -y perl
2. 查看依赖的perl模块是否已安装 :
pg92@db-172-16-3-39-> perl -MCwd -e 1
pg92@db-172-16-3-39-> echo $?
0
pg92@db-172-16-3-39-> perl -MGetopt::Long -e 1
pg92@db-172-16-3-39-> perl -MFile::Basename -e 1
pg92@db-172-16-3-39-> perl -MFile::Temp -e 1
pg92@db-172-16-3-39-> perl -MTime::HiRes -e 1
pg92@db-172-16-3-39-> perl -MDigest::MD5 -e 1
pg92@db-172-16-3-39-> perl -MDate::Parse -e 1
Can't locate Date/Parse.pm in @INC (@INC contains: /usr/lib64/perl5/site_perl/5.8.8/x86_64-linux-thread-multi /usr/lib/perl5/site_perl/5.8.8 /usr/lib/perl5/site_perl /usr/lib64/perl5/vendor_perl/5.8.8/x86_64-linux-thread-multi /usr/lib/perl5/vendor_perl/5.8.8 /usr/lib/perl5/vendor_perl /usr/lib64/perl5/5.8.8/x86_64-linux-thread-multi /usr/lib/perl5/5.8.8 .).
BEGIN failed--compilation aborted.
3. 安装依赖的perl模块
从以上检测可以看出Date::Parse未安装.
方法1,使用cpan安装, 需要联网.
[root@db-172-16-3-39 nrpe-2.14]# cpan Date::Parse
方法2,使用源码安装略
4. 下载check_postgres
wget http://bucardo.org/downloads/check_postgres.tar.gz
解压
tar -zxvf check_postgres.tar.gz
5. check_postgres用法
5.1 命令行参数
输出格式参数, check_postgres支持nagios, mrtg, simple(cacti)格式, 本例选择nagios格式.
--output=nagios
连接数据库的参数 :
-H NAME or --host=NAME
-p PORT or --port=PORT
-db NAME or --dbname=NAME
-u USERNAME or --dbuser=USERNAME
--dbpass=PASSWORD
--dbservice=NAME
使用--dbservice时需要配置~/.pg_service.conf, 最好不要使用密码, 配置本地的pg_hba.conf 127.0.0.1为trust即可.
如果一定要配置密码的话, 那最好配置在~/.pgpass密码文件中, 并且使用400权限.
psql路径参数 :
--PGBINDIR
使用PGBINDIR前需要修改check_postgres.pl脚本中的 our $NO_PSQL_OPTION = 0;
创建符号链接的参数, 可以省去每次输入--action, 如果要以nagios为默认输出格式, 那么路径名中需要包含nagios.
[root@db-172-16-3-33 check_postgres-2.20.0]# mkdir /opt/nagios
[root@db-172-16-3-33 check_postgres-2.20.0]# cd /opt/nagios
[root@db-172-16-3-33 nagios]# ll
total 0
[root@db-172-16-3-33 nagios]# /opt/soft_bak/check_postgres-2.20.0/check_postgres.pl --symlinks
Created "check_postgres_archive_ready"
Created "check_postgres_autovac_freeze"
Created "check_postgres_backends"
Created "check_postgres_bloat"
Created "check_postgres_checkpoint"
Created "check_postgres_cluster_id"
Created "check_postgres_commitratio"
Created "check_postgres_connection"
Created "check_postgres_custom_query"
Created "check_postgres_database_size"
Created "check_postgres_dbstats"
Created "check_postgres_disabled_triggers"
Created "check_postgres_disk_space"
Created "check_postgres_fsm_pages"
Created "check_postgres_fsm_relations"
Created "check_postgres_hitratio"
Created "check_postgres_hot_standby_delay"
Created "check_postgres_index_size"
Created "check_postgres_last_analyze"
Created "check_postgres_last_autoanalyze"
Created "check_postgres_last_autovacuum"
Created "check_postgres_last_vacuum"
Created "check_postgres_listener"
Created "check_postgres_locks"
Created "check_postgres_logfile"
Created "check_postgres_new_version_bc"
Created "check_postgres_new_version_box"
Created "check_postgres_new_version_cp"
Created "check_postgres_new_version_pg"
Created "check_postgres_new_version_tnm"
Created "check_postgres_pgagent_jobs"
Created "check_postgres_pgb_pool_cl_active"
Created "check_postgres_pgb_pool_cl_waiting"
Created "check_postgres_pgb_pool_maxwait"
Created "check_postgres_pgb_pool_sv_active"
Created "check_postgres_pgb_pool_sv_idle"
Created "check_postgres_pgb_pool_sv_login"
Created "check_postgres_pgb_pool_sv_tested"
Created "check_postgres_pgb_pool_sv_used"
Created "check_postgres_pgbouncer_backends"
Created "check_postgres_pgbouncer_checksum"
Created "check_postgres_prepared_txns"
Created "check_postgres_query_runtime"
Created "check_postgres_query_time"
Created "check_postgres_relation_size"
Created "check_postgres_replicate_row"
Created "check_postgres_same_schema"
Created "check_postgres_sequence"
Created "check_postgres_settings_checksum"
Created "check_postgres_slony_status"
Created "check_postgres_table_size"
Created "check_postgres_timesync"
Created "check_postgres_txn_idle"
Created "check_postgres_txn_time"
Created "check_postgres_txn_wraparound"
Created "check_postgres_version"
Created "check_postgres_wal_files"
5.2 环境变量和相关文件 :
PATH : check_postgres依赖psql命令, 因此需要将$PGHOME/bin配置到PATH. 除此之外也可以使用--PGBINDIR参数指定psql的bin目录.
~/.pg_service.conf : 这个文件用于--dbservice参数.
~/.pg_service.conf格式举例 :
[root@db-172-16-3-33 nagios]# cat ~/.pg_service.conf
[mydb1]
hostaddr=127.0.0.1
port=1999
user=postgres
dbname=postgres
password=postgres
[mydb2]
hostaddr=172.16.3.39
port=1919
user=postgres
dbname=postgres
password=postgres
设置了密码, 最好设置一下权限
chmod 400 ~/.pg_service.conf
5.3 阈值参数
--warning=VAL or -w VAL
Sets the threshold at which a warning alert is fired. The valid options for this option depends on the action used.
--critical=VAL or -c VAL
Sets the threshold at which a critical alert is fired. The valid options for this option depends on the action used.
-t VAL or --timeout=VAL
Sets the timeout in seconds after which the script will abort whatever it is doing and return an UNKNOWN status. The timeout is per Postgres cluster, not for the entire script. The default value is 10; the units are always in seconds.
5.4 check_postgres的返回值对应nagios状态 :
0 : OK
1 : WARNING
2 : CRITICAL
3 : UNKNOWN
6. 使用举例
6.1 将check_postgres目录移动到合适位置, 赋予postgres用户可执行权限. 本例将使用postgres数据库启动用户调用这些命令.
[root@db-172-16-3-39 ~]# cd /opt
[root@db-172-16-3-39 opt]# mv /opt/soft_bak/check_postgres-2.20.0 /opt/
[root@db-172-16-3-39 opt]# chown -R pg92:pg92 /opt/check_postgres-2.20.0
[root@db-172-16-3-39 opt]# chmod 755 /opt/check_postgres-2.20.0/check_postgres.pl
6.2 配置pg92用户.bash_profile, 确保psql在路径中.
[root@db-172-16-3-39 opt]# su - pg92
pg92@db-172-16-3-39-> cat .bash_profile
export PGHOME=/opt/pgsql92
export PATH=$PGHOME/bin:$PATH:.
6.3 配置数据库pg_hba.conf, 允许本地trust连接.
pg92@db-172-16-3-39-> vi $PGDATA/pg_hba.conf
host all all 127.0.0.1/32 trust
重载
pg92@db-172-16-3-39-> pg_ctl reload -D $PGDATA
server signaled
6.4 监控项详解, 抽取部分check_postgres常用的监控项详细讲解.
6.4.1 监控归档 :
Checks how many WAL files with extension .ready exist in the pg_xlog/archive_status directory, which is found off of your data_directory.
This action must be run as a superuser, in order to access the contents of the pg_xlog/archive_status directory.
The --warning and --critical options are simply the number of .ready files in the pg_xlog/archive_status directory.
Usually, these values should be low, turning on the archive mechanism, we usually want it to archive WAL files as fast as possible.
If the archive command fail, number of WAL in your pg_xlog directory will grow until exhausting all the disk space and force PostgreSQL to stop immediately.
需要超级用户, 因为需要查看归档状态目录, ready结尾文件越少越好, 太大说明归档速度太慢. 测试监控如下, 10个警告, 20个严重. 如果这个值过大, DBA要考虑提高一下归档目录的写性能了.
pg92@db-172-16-3-39-> /opt/check_postgres-2.20.0/check_postgres.pl --output=nagios -H 127.0.0.1 -p 1919 -db postgres -u postgres --action=archive_ready -w 10 -c 20
POSTGRES_ARCHIVE_READY OK: DB "postgres" (host:127.0.0.1) (port=1919) WAL ".ready" files found: 0 | time=0.01s files=0;10;20
6.4.2 freeze监控 :
The --warning and --critical options should be expressed as percentages.
The 'age' of the transactions in each database is compared to the autovacuum_freeze_max_age setting (200 million by default) to generate a rounded percentage.
The default values are 90% for the warning and 95% for the critical.
Databases can be filtered by use of the --include and --exclude options.
因为当表的age达到一定值后, 数据库必须对这个表执行vacuum freeze操作, 确保数据不会因为mvcc机制而"消失", 不是真正的消失.
例如 :
postgres=# select age(datfrozenxid),datname from pg_database;
age | datname
----------+-----------
41050570 | template1
41050570 | template0
41050570 | postgres
41050570 | digoal_01
41050570 | digoal_02
41050570 | digoal
(6 rows)
postgres=# show autovacuum_freeze_max_age;
autovacuum_freeze_max_age
---------------------------
200000000
(1 row)
postgres=# select 41050570/200000000.0;
?column?
------------------------
0.20525285000000000000
(1 row)
使用方法, 可以过滤一些不想检查的数据库. 如果告警了, 说明数据库快要被freeze了, 这个会带来巨大的io开销, 建议在自动触发前安排一下在数据库空闲时段手工执行vacuumdb --freeze操作. 或者找出age最大的表(age(pg_class.relfrozenxid))手工执行vacuum freeze. 总之尽量不要让数据库自动触发, 因为自动触发的时间无法控制, 可能发生在业务最繁忙的时候.
pg92@db-172-16-3-39-> /opt/check_postgres-2.20.0/check_postgres.pl --output=nagios -H 127.0.0.1 -p 1919 -db postgres -u postgres --action=autovac_freeze -w 90% -c 95% --exclude=template1,template0,postgres
POSTGRES_AUTOVAC_FREEZE OK: DB "postgres" (host:127.0.0.1) (port=1919) digoal=21%;90;95 digoal_01=21%;90;95 digoal_02=21%;90;95 | time=0.01s digoal=21%;90;95 digoal_01=21%;90;95 digoal_02=21%;90;95
6.4.3 连接数监控
阈值可以配置正数, 负数, 百分比.
正数表示已使用连接数
负数表示剩余可用连接数, 需要用到max_connections参数.
百分比表示已使用连接数占max_connections的百分比.
正数
pg92@db-172-16-3-39-> /opt/check_postgres-2.20.0/check_postgres.pl --output=nagios -H 127.0.0.1 -p 1919 -db postgres -u postgres --action=backends -w=120 -c=150
POSTGRES_BACKENDS OK: DB "postgres" (host:127.0.0.1) (port=1919) 1 of 1000 connections (1%) | time=0.01s digoal=0;120;150;0;1000 digoal_01=0;120;150;0;1000 digoal_02=0;120;150;0;1000 postgres=1;120;150;0;1000 template0=0;120;150;0;1000 template1=0;120;150;0;1000
负数
pg92@db-172-16-3-39-> /opt/check_postgres-2.20.0/check_postgres.pl --output=nagios -H 127.0.0.1 -p 1919 -db postgres -u postgres --action=backends -w=-120 -c=-100
POSTGRES_BACKENDS OK: DB "postgres" (host:127.0.0.1) (port=1919) 1 of 1000 connections (1%) | time=0.01s digoal=0;880;900;0;1000 digoal_01=0;880;900;0;1000 digoal_02=0;880;900;0;1000 postgres=1;880;900;0;1000 template0=0;880;900;0;1000 template1=0;880;900;0;1000
百分比
pg92@db-172-16-3-39-> /opt/check_postgres-2.20.0/check_postgres.pl --output=nagios -H 127.0.0.1 -p 1919 -db postgres -u postgres --action=backends -w=80% -c=90%
POSTGRES_BACKENDS OK: DB "postgres" (host:127.0.0.1) (port=1919) 1 of 1000 connections (1%) | time=0.01s digoal=0;800;900;0;1000 digoal_01=0;800;900;0;1000 digoal_02=0;800;900;0;1000 postgres=1;800;900;0;1000 template0=0;800;900;0;1000 template1=0;800;900;0;1000
6.4.4 表和索引的数据膨胀监控
Checks the amount of bloat in tables and indexes.
(Bloat is generally the amount of dead unused space taken up in a table or index. This space is usually reclaimed by use of the VACUUM command.)
This action requires that stats collection be enabled on the target databases, and requires that ANALYZE is run frequently.
The --include and --exclude options can be used to filter out which tables to look at.
阈值可以使用数字或者百分比.
数字表示膨胀多少容量, 单位可以是bytes, kilobytes, megabytes, gigabytes, terabytes, exabytes, petabytes, and zettabytes. 可以使用两位缩写.
百分比表示膨胀的百分比.
--perflimit这个参数用来限制输出的对象个数. 如果膨胀超过阈值的比较多, 可以使用这个参数限制一下输出.
注意膨胀系数和以下参数有关 :
autovacuum_vacuum_scale_factor = 0.2, 这样的话基本上DML频繁的表都会膨胀20%左右. 所以告警的阈值最好设置在20%以上.
举例 :
百分比
pg92@db-172-16-3-39-> /opt/check_postgres-2.20.0/check_postgres.pl --output=nagios -H 127.0.0.1 -p 1919 -db postgres -u postgres --action=bloat -w=30% -c=40% --perflimit=5
POSTGRES_BLOAT CRITICAL: DB "postgres" (host:127.0.0.1) (port=1919) (db postgres) table public.test rows:1000000 pages:4425 shouldbe:4406 (1.0X) wasted size:155648 (152 kB) * (db postgres) index idx_test_id rows:? pages:2745 shouldbe:1958 (1.4X) wasted bytes:6447104 (6 MB) * (db postgres) table public.batch rows:9163560 pages:76363 shouldbe:76251 (1.0X) wasted size:917504 (896 kB) | idx_test_id=6447104B public.batch=917504B public.test=155648B pg_catalog.pg_depend_depender_index=0B pg_toast.pg_toast_2618=0B
百分比和容量
pg92@db-172-16-3-39-> /opt/check_postgres-2.20.0/check_postgres.pl --output=nagios -H 127.0.0.1 -p 1919 -db postgres -u postgres --action=bloat -w="30% and 100M" -c="40% and 200M" --perflimit=5
POSTGRES_BLOAT OK: DB "postgres" (host:127.0.0.1) (port=1919) (db postgres) index idx_test_id rows:? pages:2745 shouldbe:1958 (1.4X) wasted bytes:6447104 (6 MB) | idx_test_id=6447104B public.batch=917504B public.test=155648B pg_catalog.pg_depend_depender_index=0B pg_toast.pg_toast_2618=0B
百分比或容量
pg92@db-172-16-3-39-> /opt/check_postgres-2.20.0/check_postgres.pl --output=nagios -H 127.0.0.1 -p 1919 -db postgres -u postgres --action=bloat -w="30% or 100M" -c="40% or 200M" --perflimit=5
POSTGRES_BLOAT CRITICAL: DB "postgres" (host:127.0.0.1) (port=1919) (db postgres) table public.test rows:1000000 pages:4425 shouldbe:4406 (1.0X) wasted size:155648 (152 kB) * (db postgres) index idx_test_id rows:? pages:2745 shouldbe:1958 (1.4X) wasted bytes:6447104 (6 MB) * (db postgres) table public.batch rows:9163560 pages:76363 shouldbe:76251 (1.0X) wasted size:917504 (896 kB) | idx_test_id=6447104B public.batch=917504B public.test=155648B pg_catalog.pg_depend_depender_index=0B pg_toast.pg_toast_2618=0B
如果发现膨胀太大, 可以使用vacuum full 回收空间. 百分比不准, 建议使用容量.
如果要监控多个数据库需要在参数中指定, 例如 :
--db postgres,digoal_01,digoal_02
6.4.5 checkpoint监控
检查最后一次checkpoint到现在已经过去多长时间了, 超过阈值则告警. 但是需要注意LC_TIME变量需要与数据库一致.
checkpoint检测必须在本机进行, 不需要连接参数, 但是需要提供$PGDATA环境变量或目录, 同时pg_controldata要在路径中.
pg92@db-172-16-3-39-> export LC_TIME=C
pg92@db-172-16-3-39-> /opt/check_postgres-2.20.0/check_postgres.pl --output=nagios --datadir=$PGDATA --action=checkpoint -w 10s -c 200000s
POSTGRES_CHECKPOINT WARNING: Last checkpoint was 1420 seconds ago | age=1420;10;200000
或者写$PGDATA路径
pg92@db-172-16-3-39-> /opt/check_postgres-2.20.0/check_postgres.pl --output=nagios --datadir=/pgdata1919 --action=checkpoint -w 10s -c 200000s
POSTGRES_CHECKPOINT WARNING: Last checkpoint was 1531 seconds ago | age=1531;10;200000
把LC_TIME写到环境变量中.
pg92@db-172-16-3-39-> vi ~/.bash_profile
export LC_TIME=C
6.4.6 检查数据库集群的ID是否发生变化
类似checkpoint, 只能用于本地监控. 首先使用-c 0 找出ID, 当然使用pg_controldata也可以找到这个id.
pg92@db-172-16-3-39-> /opt/check_postgres-2.20.0/check_postgres.pl --output=nagios --datadir=/pgdata1919 --action=cluster_id -c 0
POSTGRES_CLUSTER_ID UNKNOWN: cluster_id: 5742948541169127399
然后正式监控使用这个id,
pg92@db-172-16-3-39-> /opt/check_postgres-2.20.0/check_postgres.pl --output=nagios --datadir=/pgdata1919 --action=cluster_id -c 5742948541169127399
POSTGRES_CLUSTER_ID OK: cluster_id: 5742948541169127399
如果ID发生变化, 告警.
pg92@db-172-16-3-39-> /opt/check_postgres-2.20.0/check_postgres.pl --output=nagios --datadir=/pgdata1919 --action=cluster_id -c 5742948541169127398
POSTGRES_CLUSTER_ID CRITICAL: cluster_id: 5742948541169127399
6.4.7 监控数据库事务提交的百分比.
反过来想, 其实是监控回滚的百分比.
postgres=# select * from pg_stat_database
-[ RECORD 3 ]--+------------------------------
datid | 12788
datname | postgres
numbackends | 1
xact_commit | 1498
xact_rollback | 1
blks_read | 347126
blks_hit | 261131
tup_returned | 40780490
tup_fetched | 94728
tup_inserted | 5223
tup_updated | 1512
tup_deleted | 4664
conflicts | 0
temp_files | 0
temp_bytes | 0
deadlocks | 0
blk_read_time | 0
blk_write_time | 0
stats_reset | 2013-06-04 08:07:33.154495+08
使用阈值百分比
pg92@db-172-16-3-39-> /opt/check_postgres-2.20.0/check_postgres.pl --output=nagios -H 127.0.0.1 -p 1919 -db postgres -u postgres --action=commitratio -w 90% -c 80%
POSTGRES_COMMITRATIO OK: DB "postgres" (host:127.0.0.1) (port=1919) postgres: 99.93 | time=0.02s postgres=99.93;90;80
6.4.8 数据库连接状态监控
执行SELECT version(), 如果正常则返回0. 异常则为psql的返回.
pg92@db-172-16-3-39-> /opt/check_postgres-2.20.0/check_postgres.pl --output=nagios -H 127.0.0.1 -p 1919 -db postgres -u postgres --action=connection
POSTGRES_CONNECTION OK: DB "postgres" (host:127.0.0.1) (port=1919) version 9.2beta1 | time=0.01s
不能连接的情况 :
pg92@db-172-16-3-39-> /opt/check_postgres-2.20.0/check_postgres.pl --output=nagios -H 127.0.0.1 -p 1918 -db postgres -u postgres --action=connection
ERROR: could not connect to server: Connection refused
Is the server running on host "127.0.0.1" and accepting
TCP/IP connections on port 1918?
pg92@db-172-16-3-39-> echo $?
3
6.4.9 定制化SQL监控
SQL语句必须至少返回一个名为result的列值, 或者额外再返回一个任意列值. 也就是最多返回2个列.
result列作为返回值的阈值比较列.
返回值类型可以选择如下, 共计4种返回类型, 使用参数--valtype指定返回类型 :
integer: Does a simple integer comparison. The first column should be a simple integer, and the warning and critical values should be the same.
string: The warning and critical are strings, and are triggered only if the value in the first column matches it exactly. This is case-sensitive.
time: The warning and the critical are times, and can have units of seconds, minutes, hours, or days. Each may be written singular or abbreviated to just the first letter. If no units are given, seconds are assumed. The first column should be an integer representing the number of seconds to check.
size: The warning and the critical are sizes, and can have units of bytes, kilobytes, megabytes, gigabytes, terabytes, or exabytes. Each may be abbreviated to the first letter. If no units are given, bytes are assumed. The first column should be an integer representing the number of bytes to check.
阈值比较可以是大于或者小于, 默认是大于. 使用 --reverse 参数则是小于.
举例 :
1. 如果test这个对象的占用空间大于10个页面则严重警告.
pg92@db-172-16-3-39-> /opt/check_postgres-2.20.0/check_postgres.pl --output=nagios -H 127.0.0.1 -p 1919 -db postgres -u postgres --action=custom_query --valtype=string -c "test" --query="select relname AS result, relpages AS pages from pg_class where relpages>10"
POSTGRES_CUSTOM_QUERY CRITICAL: DB "postgres" (host:127.0.0.1) (port=1919) test | time=0.01s pages=4425;;test
其他 :
Example 2: Give a critical if the "foobar" function returns a number over 5MB:
check_postgres_custom_query --critical='5MB'--valtype=size --query="SELECT foobar() AS result"
Example 2: Warn if the function "snazzo" returns less than 42:
check_postgres_custom_query --critical=42 --query="SELECT snazzo() AS result" --reverse
6.4.10 数据库容量监控
可以使用include或者exclude. 单个数据库的容量超过阈值则告警.
pg92@db-172-16-3-39-> /opt/check_postgres-2.20.0/check_postgres.pl --output=nagios -H 127.0.0.1 -p 1919 -db postgres -u postgres --action=database_size -w 10M -c 20M
POSTGRES_DATABASE_SIZE CRITICAL: DB "postgres" (host:127.0.0.1) (port=1919) postgres: 770831480 (735 MB) digoal: 582435332 (555 MB) digoal_01: 169705988 (162 MB) digoal_02: 8094212 (7905 kB) template1: 6496772 (6345 kB) template0: 6488580 (6337 kB) | time=0.02s postgres=770831480;10485760;20971520 digoal=582435332;10485760;20971520 digoal_01=169705988;10485760;20971520 digoal_02=8094212;10485760;20971520 template1=6496772;10485760;20971520 template0=6488580;10485760;20971520
使用exclude排除不监控的库
pg92@db-172-16-3-39-> /opt/check_postgres-2.20.0/check_postgres.pl --output=nagios -H 127.0.0.1 -p 1919 -db postgres -u postgres --action=database_size -w 10M -c 20M --exclude=template0,template1
POSTGRES_DATABASE_SIZE CRITICAL: DB "postgres" (host:127.0.0.1) (port=1919) postgres: 770831480 (735 MB) digoal: 582435332 (555 MB) digoal_01: 169705988 (162 MB) digoal_02: 8094212 (7905 kB) | time=0.02s postgres=770831480;10485760;20971520 digoal=582435332;10485760;20971520 digoal_01=169705988;10485760;20971520 digoal_02=8094212;10485760;20971520
6.4.11 数据库运行统计信息, 这个一般给cacti用来画图. 不返回异常结果
pg92@db-172-16-3-39-> /opt/check_postgres-2.20.0/check_postgres.pl --output=nagios -H 127.0.0.1 -p 1919 -db postgres -u postgres --action=dbstats
backends:2 commits:1587 rollbacks:1 read:347126 hit:264246 idxscan:0 idxtupread:0 idxtupfetch:0 idxblksread:2744 idxblkshit:1 seqscan:12 seqtupread:40043328 ret:40799382 fetch:95747 ins:5223 upd:1512 del:4664 dbname:postgres
6.4.12 触发器监控, 监控是否有disabled的触发器.
pg92@db-172-16-3-39-> /opt/check_postgres-2.20.0/check_postgres.pl --output=nagios -H 127.0.0.1 -p 1919 -db postgres -u postgres --action=disabled_triggers -w 1 -c 2
POSTGRES_DISABLED_TRIGGERS OK: DB "postgres" (host:127.0.0.1) (port=1919) Disabled triggers: 0 | time=0.01s
多个数据库则使用--db postgres,other_dbname,...指定.
6.4.13 磁盘空间监控
阈值为百分比或者容量.
使用超过指定的百分比或容量则告警.
pg92@db-172-16-3-39-> /opt/check_postgres-2.20.0/check_postgres.pl --output=nagios -H 127.0.0.1 -p 1919 -db postgres -u postgres --action=disk_space -w "10% or 5GB"
POSTGRES_DISK_SPACE WARNING: DB "postgres" (host:127.0.0.1) (port=1919) FS /dev/sda1 mounted on / is using 9.34 GB of 28.38 GB (35%) | time=0.01s size=10027884544B
使用规则表达式限定设备名.
--include=~^/dev/
符号后面是规则表达式^/dev/表示以/dev/开头的设备.
6.4.14 PostgreSQL buffer命中率监控
监控用到pg_stat_database视图,
blks_read bigint Number of disk blocks read in this database
blks_hit bigint Number of times disk blocks were found already in the buffer cache, so that a read was not necessary (this only includes hits in the PostgreSQL buffer cache, not the operating system's file system cache)
算法 :
blks_hit/(blks_hit+blks_read)
但是请注意, blks_hit包含的是PostgreSQL buffer中的读, 而不包含操作系统的cache命中, 所以可能命中率很低, 但是实际上读物理IO并不高, 因为大部分可能在文件系统cache中了.
pg92@db-172-16-3-39-> /opt/check_postgres-2.20.0/check_postgres.pl --output=nagios -H 127.0.0.1 -p 1919 -db postgres -u postgres --action=hitratio -w 90% -c 80%
POSTGRES_HITRATIO CRITICAL: DB "postgres" (host:127.0.0.1) (port=1919) postgres: 43.60 | time=0.02s postgres=43.60;90;80
使用pg_stat_database查询结果一致
postgres=# select datname,blks_read,blks_hit from pg_stat_database;
datname | blks_read | blks_hit
-----------+-----------+----------
template1 | 0 | 0
template0 | 0 | 0
postgres | 347127 | 268297
digoal_01 | 0 | 0
digoal_02 | 0 | 0
digoal | 0 | 0
(6 rows)
postgres=# select 268297/(268297+347127.0);
?column?
------------------------
0.43595472389767054908
(1 row)
多个数据库使用--db dbname1,dbname2,...指定.
6.4.15 hot_standby延迟监控, 用来计算slave xlog的接收延迟和apply延迟, slave必须是hot_standby模式, 因为需要连接到master和slave数据库中执行如下SQL :
slave :
SELECT pg_last_xlog_receive_location() AS receive, pg_last_xlog_replay_location() AS replay
master :
SELECT pg_current_xlog_location() AS location
获得结果后, 比较偏移量. 根据偏移量和提供的阈值进行比较, 超出则告警.
告警阈值为偏移量, 这个不太好评估, 因为check_postgres是先查询slave, 再查询master的, 所以本质上来讲已经有一定的偏移了. 对于DML 繁忙的数据库, 需要微调告警阈值(字节数)才能达到不误报的效果.
例子 :
postgres=# create table test (id int primary key, info text);
NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index "test_pkey" for table "test"
CREATE TABLE
postgres=# insert into test select generate_series(1,1000000),md5(random()::text);
INSERT 0 1000000
pg92@db-172-16-3-39-> vi upd.sh
\setrandom id 1 1000000
update test set info=md5(random()::text) where id=:id;
pg92@db-172-16-3-39-> pgbench -M prepared -f ./upd.sh -n -r -h 127.0.0.1 -p 1919 -U postgres -c 8 -j 2 -T 60 postgres
transaction type: Custom query
scaling factor: 1
query mode: prepared
number of clients: 8
number of threads: 2
duration: 60 s
number of transactions actually processed: 2705800
tps = 45095.767006 (including connections establishing)
tps = 45104.988913 (excluding connections establishing)
statement latencies in milliseconds:
0.002196 \setrandom id 1 1000000
0.172358 update test set info=md5(random()::text) where id=:id;
由于先查询slave, 再查询master带来的偏移量的问题,
pg92@db-172-16-3-39-> /opt/check_postgres-2.20.0/check_postgres.pl --output=nagios -H 127.0.0.1 -p 1919 -db postgres -u postgres --action=hot_standby_delay -H 127.0.0.1 -p 11919 --db postgres -u postgres -w 1
POSTGRES_HOT_STANDBY_DELAY WARNING: DB "postgres" (host:127.0.0.1) (port=11919) 47889704 | time=0.03s replay_delay=47889704;1 receive-delay=47889704;1
pgbench结束后恢复 :
pg92@db-172-16-3-39-> /opt/check_postgres-2.20.0/check_postgres.pl --output=nagios -H 127.0.0.1 -p 1919 -db postgres -u postgres --action=hot_standby_delay -H 127.0.0.1 -p 11919 --db postgres -u postgres -w 1 --verbose
POSTGRES_HOT_STANDBY_DELAY OK: DB "postgres" (host:127.0.0.1) (port=11919) 0 | time=0.01s replay_delay=0;1 receive-delay=0;1
前面的连接参数为master,
-H 127.0.0.1 -p 1919 -db postgres -u postgres
后面的连接参数为slave.
-H 127.0.0.1 -p 11919 --db postgres -u postgres
6.4.16 监控表/索引对象的大小. 有超过阈值的对象则告警.
action :
index_size, 只统计索引的大小. pg_class.relkind='i'
table_size, 只统计表的大小. pg_class.relkind='t'
relation_size, 统计所有对象的大小. pg_class.relkind 不区分.
可以结合include, exclude, includeuser, excludeuser使用
需要指定被监控的数据库, 库多的话-db中用逗号把库隔开就好了.
举例 :
pg92@db-172-16-3-39-> /opt/check_postgres-2.20.0/check_postgres.pl --output=nagios -H 127.0.0.1 -p 1919 -db postgres -u postgres --action=table_size -w 200MB --exclude=batch --perflimit=3
POSTGRES_TABLE_SIZE OK: DB "postgres" (host:127.0.0.1) (port=1919) largest table is "public.test": 114 MB | time=0.01s public.test=119578624B;209715200 pg_catalog.pg_proc=507904B;209715200
多个数据库使用--db dbname1,dbname2,...指定
6.4.17 监控上一次vacuum, analyze, autovacuum, autoanalyze到现在的时间间隔. 超过阈值告警.
用到如下函数 :
pg_catalog | pg_stat_get_last_analyze_time | timestamp with time zone | oid | normal
pg_catalog | pg_stat_get_last_autoanalyze_time | timestamp with time zone | oid | normal
pg_catalog | pg_stat_get_last_autovacuum_time | timestamp with time zone | oid | normal
pg_catalog | pg_stat_get_last_vacuum_time | timestamp with time zone | oid | normal
阈值单位如下 :
seconds, minutes, hours, and days
需指定数据库监控. 监控范围受限于指定的数据库.
pg92@db-172-16-3-39-> /opt/check_postgres-2.20.0/check_postgres.pl --output=nagios -H 127.0.0.1 -p 1919 -u postgres --db postgres --action=last_analyze --perflimit=3 -w '1 d'
POSTGRES_LAST_ANALYZE OK: DB "postgres" (host:127.0.0.1) (port=1919) DB: postgres TABLE: public.batch: 15:45 June 04, 2013 (20 hours 4 minutes) | time=0.02s postgres.public.batch=72269s;86400 postgres.pg_catalog.pg_authid=72269s;86400; postgres.pg_catalog.pg_constraint=72269s;86400
多个数据库需要使用--db dbname1,dbname2,...指定.
6.4.18 锁监控, 只监控数量, 不监控锁等待的时间.
用到如下SQL :
SELECT granted, mode, datname FROM pg_locks l RIGHT JOIN pg_database d ON (d.oid=l.database) WHERE d.datallowconn
阈值内容包括锁总数, 等待数, 锁类型数量. 超过则告警.
锁类型截取lock字符, 大小写不敏感.
postgres=# select mode from pg_locks group by 1;
mode
-----------------
ExclusiveLock
AccessShareLock
(2 rows)
例如以上所类型为exclusive和accessshare
举例 :
返回结果包含所有数据库.
pg92@db-172-16-3-39-> /opt/check_postgres-2.20.0/check_postgres.pl --output=nagios -H 127.0.0.1 -p 1919 -u postgres --db postgres --action=locks -w 100 -c total=250:waiting=5:exclusive=20
POSTGRES_LOCKS OK: DB "postgres" (host:127.0.0.1) (port=1919) total=1 | time=0.02s digoal.total=0;100;250 digoal_01.total=0;100;250 digoal_02.total=0;100;250 postgres.total=1;100;250 template1.total=0;100;250
6.4.19 pg_agent jobs执行失败监控
监控阈值, 时间范围, 例如1d, 表示监控1天内无执行失败的jobs.
需要读取pgagent.pga_job表的内容, 仅适用于安装了pg_agent插件的数据库.
pg93@db-172-16-3-33-> /opt/check_postgres-2.20.0/check_postgres.pl --output=nagios -H 127.0.0.1 -p 1999 -u postgres --db postgres --action=pgagent_jobs -w 1d
Password for user postgres:
POSTGRES_PGAGENT_JOBS OK: DB "postgres" (host:127.0.0.1) (port=1999) No failed jobs | time=2.11s
6.4.20 prepared transaction监控.
二阶段事务监控, 当二阶事务在限定时间范围内未结束, 则告警. 因为事务长时间不提交会影响vacuum回收空间, 只要没有提交, 那么自那起DML产生的垃圾是肯定没有办法回收的. 将会导致数据库膨胀.
当然, 如果你的数据库没有开启二阶事务, 那就不需要监控这个了.
#max_prepared_transactions = 0 # zero disables the feature
# (change requires restart)
举例 :
阈值单位, 秒.
pg92@db-172-16-3-39-> /opt/check_postgres-2.20.0/check_postgres.pl --output=nagios -H 127.0.0.1 -p 1919 -u postgres --db postgres --action=prepared_txns -w 10
POSTGRES_PREPARED_TXNS OK: DB "postgres" (host:127.0.0.1) (port=1919) No prepared transactions found | time=0.11s
6.4.21 检查SQL的执行时长, 超过阈值的告警. 通过pg_stat_activity查询. 排除IDLE状态.
sub check_query_time {
## Check the length of running queries
check_txn_idle('qtime',
msg('queries'),
msg('query-time'),
'query_start',
q{query_start IS NOT NULL AND current_query NOT LIKE '<IDLE>%'});
return;
} ## end of check_query_time
可用单位 :
Valid units are 'seconds', 'minutes', 'hours', or 'days'
举例 :
pg92@db-172-16-3-39-> /opt/check_postgres-2.20.0/check_postgres.pl --output=nagios -H 127.0.0.1 -p 1919 -u postgres --db postgres --action=query_time -w '1s'
POSTGRES_QUERY_TIME OK: DB "postgres" (host:127.0.0.1) (port=1919) longest query: 0s | time=0.01s query_time=0s;1
监控包括所有数据库, 无需多个指定.
6.4.22 利用UPDATE SQL监控主从复制延迟.
前面介绍的是流复制的延迟监控, 通过主从的3个函数来实现. 这里要介绍的是通过UPDATE QUERY来实现监控延迟.
适用范围更广. 例如bucardo, londiste3, slony等复制插件都适合.
这种方法监控需要找一个表, 指定一个pk字段以及值, 再指定该表的另一个字段作为更新字段, 再指定这个更新字段的2个值, 每次监控时在这两个值之间更新. 通过判断主从的这个更新字段的值是否匹配来判断是否成功复制.
阈值单位 :
Valid units are 'seconds', 'minutes', 'hours', or 'days'.
阈值为时间,
pg92@db-172-16-3-39-> /opt/check_postgres-2.20.0/check_postgres.pl --output=nagios -H 127.0.0.1 -p 1919 -u postgres --db postgres --action=replicate_row -w 5 -H 127.0.0.1 -p 11919 -u postgres --db postgres --repinfo=test,id,1,info,12f5cc73549166aef1034909ce465969,b
POSTGRES_REPLICATE_ROW OK: DB "postgres" (host:127.0.0.1) (port=1919) Row was replicated | time=2s
前面的连接参数是主库
-H 127.0.0.1 -p 1919 -u postgres --db postgres
后面的连接参数是从库
-H 127.0.0.1 -p 11919 -u postgres --db postgres
--repinfo 指定了6个字段, 分别代表表名, pk字段名, pk字段值, 更新字段名, 更新字段的两个值.
pg92@db-172-16-3-39-> psql -U postgres postgres -p 1919
psql (9.2beta1)
Type "help" for help.
postgres=# select * from test where id=1;
id | info
----+----------------------------------
1 | 12f5cc73549166aef1034909ce465969
(1 row)
postgres=# \q
pg92@db-172-16-3-39-> psql -U postgres postgres -p 11919
psql (9.2beta1)
Type "help" for help.
postgres=# select * from test where id=1;
id | info
----+----------------------------------
1 | 12f5cc73549166aef1034909ce465969
(1 row)
执行1次监控后, info值被更新
pg92@db-172-16-3-39-> psql -U postgres postgres -p 1919
psql (9.2beta1)
Type "help" for help.
postgres=# select * from test where id=1;
id | info
----+------
1 | b
(1 row)
postgres=# \q
pg92@db-172-16-3-39-> psql -U postgres postgres -p 11919
psql (9.2beta1)
Type "help" for help.
postgres=# select * from test where id=1;
id | info
----+------
1 | b
(1 row)
6.4.23 比较2个数据库的schema是否一致, 不比较数据, 只比较结构.
例1 :
pg92@db-172-16-3-39-> /opt/check_postgres-2.20.0/check_postgres.pl --output=nagios -H 127.0.0.1 -p 1919 -u postgres --db postgres --action=same_schema -H 127.0.0.1 -p 11919 -u postgres --db postgres
POSTGRES_SAME_SCHEMA OK: DB "postgres" (host:127.0.0.1) (ports:1919,11919) All databases have identical items | time=0.76s
例2 :
pg92@db-172-16-3-39-> /opt/check_postgres-2.20.0/check_postgres.pl --output=nagios -H 127.0.0.1 -p 1919 -u postgres --db postgres --action=same_schema -H 127.0.0.1 -p 11919 -u postgres --db digoal
POSTGRES_SAME_SCHEMA CRITICAL: (databases:postgres,digoal) (host:127.0.0.1) (ports:1919,11919) Databases were different. Items not matched: 5 | time=0.71s
DB 1: dbservice= port=1919 host=127.0.0.1 dbname=postgres user=postgres
DB 1: PG version: 9.2beta1
DB 1: Total objects: 33
DB 2: dbservice= port=11919 host=127.0.0.1 dbname=digoal user=postgres
DB 2: PG version: 9.2beta1
DB 2: Total objects: 26
Function "public.pg_freespace(regclass)" does not exist on all databases:
Exists on: 1
Missing on: 2
Function "public.pg_freespace(regclass,bigint)" does not exist on all databases:
Exists on: 1
Missing on: 2
Table "public.test" does not exist on all databases:
Exists on: 1
Missing on: 2
Index "public.test_pkey" does not exist on all databases:
Exists on: 1
Missing on: 2
Constraint "public.test_pkey" does not exist on all databases:
Exists on: 1
Missing on: 2
这个除了用于监控, 还可以用于数据迁移工作后的结构比对. 用法详见check_postgres手册.
6.4.24 序列使用率监控, 用于监控序列值使用的百分比. 注意不管是否使用cycle, 只计较minvalue和increment by
SELECT last_value, slots, used, ROUND(used/slots*100) AS percent,
CASE WHEN slots < used THEN 0 ELSE slots - used END AS numleft
FROM (
SELECT last_value,
CEIL((LEAST(max_value, $maxValue)-min_value::numeric+1)/increment_by::NUMERIC) AS slots,
CEIL((last_value-min_value::numeric+1)/increment_by::NUMERIC) AS used
FROM $seqname
举例 :
postgres=# create sequence seq increment by 10 minvalue 100 maxvalue 200 cycle;
CREATE SEQUENCE
使用率超过90%则告警.
pg92@db-172-16-3-39-> /opt/check_postgres-2.20.0/check_postgres.pl --output=nagios -H 127.0.0.1 -p 1919 -u postgres --db postgres --action=sequence -w 90%
POSTGRES_SEQUENCE OK: DB "postgres" (host:127.0.0.1) (port=1919) public.seq=9% (calls left=10) | time=0.01s public.seq=9%;90%;%
使用率超过1%则告警.
pg92@db-172-16-3-39-> /opt/check_postgres-2.20.0/check_postgres.pl --output=nagios -H 127.0.0.1 -p 1919 -u postgres --db postgres --action=sequence -w 1%
POSTGRES_SEQUENCE WARNING: DB "postgres" (host:127.0.0.1) (port=1919) public.seq=9% (calls left=10) | time=0.01s public.seq=9%;1%;%
多个数据库不能使用--db dbname1,dbname2指定.
需要写成多条check_postgres监控命令.
6.4.25 监控数据库的参数是否被改变.
首先要生成当前的md5
pg92@db-172-16-3-39-> /opt/check_postgres-2.20.0/check_postgres.pl --output=nagios -H 127.0.0.1 -p 1919 -u postgres --db postgres --action=settings_checksum -c 0
POSTGRES_SETTINGS_CHECKSUM UNKNOWN: DB "postgres" (host:127.0.0.1) (port=1919) checksum: e70f4e6dc87fd4072c86fc2a3b2c25a8 | time=0.01s
然后根据这个md5进行监控
pg92@db-172-16-3-39-> /opt/check_postgres-2.20.0/check_postgres.pl --output=nagios -H 127.0.0.1 -p 1919 -u postgres --db postgres --action=settings_checksum -c e70f4e6dc87fd4072c86fc2a3b2c25a8
POSTGRES_SETTINGS_CHECKSUM OK: DB "postgres" (host:127.0.0.1) (port=1919) checksum: e70f4e6dc87fd4072c86fc2a3b2c25a8 | time=0.01s
注意这个监控需要指定数据库和用户. 因为可以为用户和数据库单独设置参数值.
不同的用户和数据库得到的md5值可能不一样.
6.4.26 监控远程数据库与本地服务器的时间差
pg92@db-172-16-3-39-> /opt/check_postgres-2.20.0/check_postgres.pl --output=nagios -H 127.0.0.1 -p 1919 -u postgres --db postgres,digoal --action=timesync -w 1 -c 3
POSTGRES_TIMESYNC OK: DB "digoal" (host:127.0.0.1) (port=1919) timediff=0 DB=2013-06-05 13:20:24 Local=2013-06-05 13:20:24 DB "postgres" (host:127.0.0.1) (port=1919) timediff=0 DB=2013-06-05 13:20:24 Local=2013-06-05 13:20:24 | time=0.01s diff=0s;1;3 time=0.01s diff=0s;1;3
把IP 替换成远程数据库的IP.
6.4.27 监控空闲事务, 根据数量以及可选时长进行告警.
pg92@db-172-16-3-39-> /opt/check_postgres-2.20.0/check_postgres.pl --output=nagios -H 127.0.0.1 -p 1919 -u postgres --db postgres,digoal --action=txn_idle -w "1 for 10s" -c "2 for 10s"
POSTGRES_TXN_IDLE OK: DB "postgres" (host:127.0.0.1) (port=1919) no idle in transaction | time=0.01s transaction_time=0;1;10
超过1个idle in transaction状态的事务告警.
超过2个idle in transaction, 并且最长的idle已经持续10秒以上则告警.
-w 和-c最好使用一种风格, 否则会有一个无效. 例如 -w 1 -c "2 for 10s" 不要这么使用.
6.4.28 监控事务的执行时间, 超过则告警.
postgres=# begin;
BEGIN
postgres=# alter database postgres set work_mem='10MB';
ALTER DATABASE
不关闭事务. 然后执行监控如下
pg92@db-172-16-3-39-> /opt/check_postgres-2.20.0/check_postgres.pl --output=nagios -H 127.0.0.1 -p 1919 -u postgres --db postgres,digoal --action=txn_time -w 1s -c 2s
POSTGRES_TXN_TIME CRITICAL: DB "postgres" (host:127.0.0.1) (port=1919) longest txn: 44s PID:4752 database:postgres username:postgres | time=0.01s transaction_time=44s;1;2
6.4.29 监控数据库年龄, 或者理解为数据库中最老的数据库记录的版本号的年龄.
一般是防止wrapping自动触发的.
类似check_postgres freeze的监控, 只是这里用的不是百分比, 而是事务数.
pg92@db-172-16-3-39-> /opt/check_postgres-2.20.0/check_postgres.pl --output=nagios -H 127.0.0.1 -p 1919 -u postgres --db postgres,digoal --action=txn_wraparound -w 1
POSTGRES_TXN_WRAPAROUND WARNING: DB "digoal" (host:127.0.0.1) (port=1919) digoal: 43758651 DB "postgres" (host:127.0.0.1) (port=1919) digoal: 43758651 | time=0.01s digoal=43758651;1;;0;2000000000 digoal_01=43758651;1;;0;2000000000 digoal_02=43758651;1;;0;2000000000 postgres=43758651;1;;0;2000000000 template1=43758651;1;;0;2000000000 time=0.01s digoal=43758651;1;;0;2000000000 digoal_01=43758651;1;;0;2000000000 digoal_02=43758651;1;;0;2000000000 postgres=43758651;1;;0;2000000000 template1=43758651;1;;0;2000000000
6.4.30 监控xlog文件数量, 超出阈值则告警.
pg92@db-172-16-3-39-> /opt/check_postgres-2.20.0/check_postgres.pl --output=nagios -H 127.0.0.1 -p 1919 -u postgres --db postgres --action=wal_files -w 100
POSTGRES_WAL_FILES WARNING: DB "digoal" (host:127.0.0.1) (port=1919) WAL files found: 146 DB "postgres" (host:127.0.0.1) (port=1919) WAL files found: 146 | time=0.08s files=146;100 time=0.02s files=146;100
超出可能是由于归档失败不断累积造成的.
下一篇将介绍check_postgres在nagios监控平台中的配置.