使用Grafana+Prometheus搭建PostgreSQL监控系统
PostgreSQL的监控方案很多,这里介绍一个比较炫酷的监控方案(Grafana+Prometheus+Alertmanager+pgSCV),这些组件的介绍可自行百度,此文不作介绍,本文重点介绍如何安装和整合这些组件。
下图以Prometheus为核心的一个系统监控架构图(图片来自Prometheus官网)
服务器列表:
节点名 | IP | 操作系统 | 安装软件 | 备注 |
---|---|---|---|---|
pg_node1 | 192.168.210.15 | CentOS 7.6 | Prometheus_2.28.1/grafana_8.1.0/pgscv_0.7.1/node_exporter_1.2.1/alertmanager-0.22.2 | 部署监控服务,被监控节点 |
pg_node2 | 192.168.210.81 | CentOS 7.6 | pgscv_0.7.1/node_exporter-1.2.1 | 被监控节点 |
pg_node3 | 192.168.210.33 | CentOS 7.6 | pgscv_0.7.1/node_exporter-1.2.1 | 被监控节点 |
软件安装目录说明
软件 | 安装主机 | 安装目录 |
---|---|---|
grafana | 192.168.210.15 | /data/monitor/grafana |
prometheus | 192.168.210.15 | /data/monitor/prometheus |
alertmanager | 192.168.210.15 | /data/monitor/prometheus/alertmanager |
node_exporter | 192.168.210.15/192.168.210.81/192.168.210.33 | /data/monitor/prometheus/plugin/node_exporter |
pgscv | 192.168.210.15/192.168.210.81/192.168.210.33 | /data/monitor/prometheus/plugin/postgres_exporter |
主机配置
关闭selinux(所有节点)
sed -i 's/SELINUX=.*/SELINUX=disabled/g' /etc/selinux/config
setenforce 0
配置防火墙
根据节点部署的软件开放对应的端口。
grafana:3000
prometheus:9090
node_exporter:9100
pgscv:9890
alertmanager:465/9093
firewall-cmd --add-port=9090/tcp --permanent
firewall-cmd --add-port=3000/tcp --permanent
firewall-cmd --add-port=9100/tcp --permanent
firewall-cmd --add-port=9093/tcp --permanent
firewall-cmd --add-port=465/tcp --permanent
firewall-cmd --add-port=9890/tcp --permanent
firewall-cmd --reload
firewall-cmd --list-all
安装Grafana
mkdir -p /data/monitor/
wget https://dl.grafana.com/oss/release/grafana-8.1.0.linux-amd64.tar.gz
tar xf grafana-8.1.0.linux-amd64.tar.gz
mv grafana-8.1.0 grafana
rm -f grafana-8.1.0.linux-amd64.tar.gz
mkdir -p /data/monitor/grafana/data/log
#修改相关参数
vi grafana/conf/defaults.ini
data = /data/monitor/grafana/data
logs = /data/monitor/grafana/data/log
plugins = /var/lib/grafana/plugins
provisioning = /data/monitor/grafana/conf/provisioning
#配置启动文件
cat >> /usr/lib/systemd/system/grafana-server.service <<EOF
[Unit]
Description=Grafana instance
Documentation=http://docs.grafana.org
Wants=network-online.target
After=network-online.target
#After=postgresql.service
[Service]
User=root
Group=root
Type=simple
Restart=on-failure
WorkingDirectory=/data/monitor/grafana
RuntimeDirectory=grafana
RuntimeDirectoryMode=0750
ExecStart=/data/monitor/grafana/bin/grafana-server --config=/data/monitor/grafana/conf/defaults.ini
LimitNOFILE=10000
TimeoutStopSec=20
[Install]
WantedBy=multi-user.target
EOF
#启动
systemctl daemon-reload
systemctl enable grafana-server
systemctl start grafana-server
#安装插件(dashboards会用到)
./grafana/bin/grafana-cli plugins install digiapulssi-breadcrumb-panel
./grafana/bin/grafana-cli plugins install grafana-polystat-panel
./grafana/bin/grafana-cli plugins install yesoreyeram-boomtable-panel
#登录界面(默认账号密码admin/admin,初始登录会提示修改密码)
http://192.168.210.15:3000
安装Prometheus
wget https://github.com/prometheus/prometheus/releases/download/v2.28.1/prometheus-2.28.1.linux-amd64.tar.gz
tar xf prometheus-2.28.1.linux-amd64.tar.gz
mv prometheus-2.28.1.linux-amd64 prometheus
rm -f prometheus-2.28.1.linux-amd64.tar.gz
#修改相关参数
cd prometheus
vi prometheus.yml
# my global config
global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
# scrape_timeout is set to the global default (10s).
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
- 192.168.210.15:9093 #Alertmanager访问信息
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
- "rules/*.yml" #配置报警规则
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
- file_sd_configs: # 注意,如果指定从某配置文件加载监控目标,则在prometheus启动之前需要确保该文件在prometheus的工作目录下事先存在,否则可能后续配置过程中出现报错
- files:
- host.yml #下面会创建此文件
job_name: Host
metrics_path: /metrics
relabel_configs:
- source_labels: [__address__]
regex: (.*)
target_label: instance
replacement: $1
- source_labels: [__address__]
regex: (.*)
target_label: __address__
replacement: $1:9100
- file_sd_configs: # 注意,如果指定从某配置文件加载监控目标,则在prometheus启动之前需要确保该文件在prometheus的工作目录下事先存在,否则可能后续>配置过程中出现报错
- files:
- pgscv.yml #下面会创建此文件
job_name: pgscv
metrics_path: /metrics
relabel_configs:
- source_labels: [__address__]
regex: (.*)
target_label: instance
replacement: $1
- source_labels: [__address__]
regex: (.*)
target_label: __address__
replacement: $1:9890
- job_name: promethus
static_configs:
- targets:
- localhost:9090
#配置主机监控
vi host.yml
- labels:
node_name: pg_node1
service_name: pg_node1
targets:
- 192.168.210.15
- labels:
node_name: pg_node2
service_name: pg_node2
targets:
- 192.168.210.81
- labels:
node_name: pg_node3
service_name: pg_node3
targets:
- 192.168.210.33
#配置PG监控
vi pgscv.yml
- labels:
node_name: pg1
service_name: pg1
targets:
- 192.168.210.15
- labels:
node_name: pg2
service_name: pg2
targets:
- 192.168.210.81
- labels:
node_name: pg3
service_name: pg3
targets:
- 192.168.210.33
#配置启动文件
cat >> /usr/lib/systemd/system/prometheus.service <<EOF
[Unit]
Description=Prometheus instance
Wants=network-online.target
After=network-online.target
#After=postgresql.service
[Service]
User=root
Group=root
Type=simple
Restart=on-failure
WorkingDirectory=/data/monitor/prometheus
RuntimeDirectory=prometheus
RuntimeDirectoryMode=0750
ExecStart=/data/monitor/prometheus/prometheus --storage.tsdb.retention=30d --web.enable-lifecycle --web.enable-admin-api --config.file=/data/monitor/prometheus/prometheus.yml
LimitNOFILE=10000
TimeoutStopSec=20
[Install]
WantedBy=multi-user.target
EOF
#暂不启动,等数据采集器和告警组件部署后再启动
systemctl daemon-reload
systemctl enable prometheus
#systemctl start prometheus
安装Alertmanager
wget https://github.com/prometheus/alertmanager/releases/download/v0.22.2/alertmanager-0.22.2.linux-amd64.tar.gz
tar xf alertmanager-0.22.2.linux-amd64.tar.gz
mv alertmanager-0.22.2.linux-amd64 alertmanager
rm -f alertmanager-0.22.2.linux-amd64.tar.gz
#配置参数
vi alertmanager/alertmanager.yml
global:
resolve_timeout: 5m #处理超时时间,默认为5min
smtp_smarthost: 'smtp.163.com:465'
smtp_from: '[email protected]' #邮件发送地址
smtp_auth_username: '[email protected]' #邮件发送地址用户名
smtp_auth_password: 'HTBPT***********' #邮件发送地址授权码
smtp_require_tls: false
route:
group_by: ['alertname']
group_wait: 30s
group_interval: 5m
repeat_interval: 1h
receiver: 'default'
receivers:
- name: 'default'
email_configs:
- to: '[email protected]'
send_resolved: true
#配置告警规则(上文的prometheus.yml中用到),这里是一条告警样例
vi rules/memory_over.yml
groups:
- name: example
rules:
- alert: 主机内存超限
expr: (1 - (node_memory_MemAvailable_bytes / (node_memory_MemTotal_bytes))) * 100 > 80
for: 1m
labels:
severity: warning
annotations:
summary: "{{$labels.instance}}: 主机内存使用超过告警限制"
description: "{{$labels.instance}}: 内存使用率超过80% (当前值是:{{ $value }})"
#启动
nohup ./alertmanager --config.file=alertmanager.yml &
#web查看
http://192.168.210.15:9093
安装数据采集器
#安装主机监控数据采集器node_exporter
cd /data/monitor/prometheus/plugin
wget https://github.com/prometheus/node_exporter/releases/download/v1.2.1/node_exporter-1.2.1.linux-amd64.tar.gz
tar xf node_exporter-1.2.1.linux-amd64.tar.gz
mv node_exporter-1.2.1.linux-amd64 node_exporter
#启动
nohup node_exporter/node_exporter &
#安装PG监控数据采集器pgSCV
wget https://github.com/weaponry/pgscv/releases/download/0.7.1/pgscv_0.7.1_linux_amd64.tar.gz
tar xf pgscv_0.7.1_linux_amd64.tar.gz -C postgres_exporter/
#添加环境变量,也可以在启动是指定参数文件--config-file=,这样还可以在配置文件中添加自定义指标
cat >> /etc/profile <<EOF
export PGSCV_LISTEN_ADDRESS="0.0.0.0:9890"
export POSTGRES_DSN="postgresql://db_user:[email protected]:5432/postgres"
EOF
source /etc/profile
#启动
nohup ./postgres_exporter/pgscv &
#查看node_exporter采集的指标数据
http://192.168.210.15:9100/metrics
#查看pgscv采集的指标数据
http://192.168.210.15:9890/metrics
启动Prometheus
http://192.168.210.15:9090
配置Grafana
配置数据源
注意:如果你的dashboards是从线下导入的json,数据源的Name需要和你json模板中的datasource一致
在线导入dashboards
导入主机监控dashboards:https://grafana.com/grafana/dashboards/8919
按照上面步骤再导入PG监控的dashboards(https://grafana.com/grafana/dashboards/14540)
另外https://github.com/percona/grafana-dashboards/releases中还有丰富的dashboards,可根据自己的需要导入,及安装对应的采集器。当然也可以自己定义面板,这个要求比较高,可以拷贝一份模板出来修改。
查看主机监控面板
查看PG监控面板,关注的指标基本都包括了
查看告警邮件(把阀值调低,触发告警)
#参考文章
https://grafana.com/grafana/dashboards
https://github.com/weaponry/pgscv/wiki
https://github.com/prometheus/prometheus
https://www.cnblogs.com/ilifeilong/p/10543876.html
PostgreSQL中文社区欢迎广大技术人员投稿
投稿邮箱:[email protected]