最近在优化调整公司内部的服务器运维管理,以及线上生产环境的软件配置部署,Docker的使用逐渐开始增多。为了能及时了解Docker容器的运行情况,Docker的监控就必须要做了。
使用最新版Grafana + Prometheus + cAdvisor监控docker
在运行了Docker容器的机器(Agent)部署cAdvisor,用于收集容器监控数据。
Prometheus用于从Agent的cAdvisor拉取(pull)监控数据,并存储到本地,并根据告警规则,推送(push)告警信息到Alertmanager。
Alertmanager用于接收Prometheus推送过来的告警信息,进行告警管理与通知。
Grafana用于读取Prometheus的监控数据,生成数据仪表盘,提供一个Web界面查看显示。
2、部署说明
1)系统软件版本
CentOS7
Docker version 20.10.2
docker-compose version 1.27.4, build 40524192
Grafana 8.5.1
Prometheus 2.35.0
cAdvisor v0.39.3
-
Alertmanager 0.24.0
节点 |
部署软件 |
说明 |
10.2.0.102 |
Grafana、Prometheus、cAdvisor、Alertmanager |
Monitor Server |
10.2.0.103 |
cAdvisor |
Monitor Agent |
10.2.0.104 |
cAdvisor |
Monitor Agent |
10.2.0.105 |
cAdvisor |
Monitor Agent |
version: '3.4'
services:
cadvisor:
image: google/cadvisor:latest
hostname: cadvisor
container_name: cadvisor
restart: always
volumes:
- /etc/localtime:/etc/localtime:ro
- /etc/timezone:/etc/timezone:ro
- /:/rootfs:ro
- /var/run:/var/run:rw
- /sys:/sys:ro
- /var/lib/docker/:/var/lib/docker:ro
ports:
- "18080:8080"
sudo docker compose up -d
/data/docker-monitor/
docker-compose.yml
grafana/
data/
prometheus/
data/
conifig/
alertmanager.conf
prometheus.yml
rules/
docker_rules.yml
version: '3.4'
services:
grafana:
image: grafana/grafana-oss
container_name: grafana
hostname: grafana
restart: always
# user: "0"
ports:
"13000:3000"
volumes:
- /etc/localtime:/etc/localtime:ro
- /etc/timezone:/etc/timezone:ro
/data/docker-monitor/grafana/data:/var/lib/grafana
prometheus:
image: prom/prometheus:latest
hostname: prometheus
container_name: prometheus
restart: always
user: "0"
volumes:
- /etc/localtime:/etc/localtime:ro
- /etc/timezone:/etc/timezone:ro
/data/docker-monitor/prometheus/config/prometheus.yml:/etc/prometheus/prometheus.yml
/data/docker-monitor/prometheus/config/rules:/etc/prometheus/rules
/data/docker-monitor/prometheus/data:/prometheus
command:
'--config.file=/etc/prometheus/prometheus.yml'
'--storage.tsdb.path=/prometheus'
'--web.console.libraries=/etc/prometheus/console_libraries'
'--web.console.templates=/etc/prometheus/consoles'
'--storage.tsdb.retention.time=200h'
'--web.enable-lifecycle'
ports:
19090:9090
alertmanager:
container_name: alertmanager
hostname: alertmanager
image: prom/alertmanager
volumes:
- /etc/localtime:/etc/localtime:ro
- /etc/timezone:/etc/timezone:ro
/data/docker-monitor/prometheus/config/alertmanager.conf:/etc/alertmanager/alertmanager.conf
command:
'--config.file=/etc/alertmanager/alertmanager.conf'
ports:
- 19093:9093
# my global config
global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
# scrape_timeout is set to the global default (10s).
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
# alertmanager url
- 10.2.0.102:19093
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
- "rules/*.yml"
# A scrape configuration containing exactly one endpoint to scrape:
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: "prometheus"
static_configs:
- targets: ["10.2.0.102:19090"]
- job_name: "cadvisor"
static_configs:
- targets: ['10.2.0.102:18080', '10.2.0.103:18080', '10.2.0.104:18080', '10.2.0.105:18080']
告警规则:Prometheus会根据告警规则rule_files,将告警发送给Alertmanager
-
groups:
name: agent-down
rules:
alert: agent-down
expr: up{job="cadvisor"} == 0
for: 15s
labels:
severity: critical
team: ops
annotations:
summary: "Agent {{ $labels.job }} down!"
description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 15 seconds."
global:
resolve_timeout: 5m
smtp_from: '[email protected]'
smtp_smarthost: 'smtp.qq.com:465'
smtp_auth_username: '[email protected]'
smtp_auth_password: 'xxxxxxxxxxxxxxx'
smtp_require_tls: false
route:
group_by: ['alertname']
group_wait: 5s
group_interval: 5s
repeat_interval: 5m
receiver: 'ops'
inhibit_rules:
source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname']
receivers:
name: 'ops'
email_configs:
to: '[email protected]'
send_resolved: true
# webhook_configs:
# - url: 'http://10.2.0.222:18060/dingtalk/webhook/send'
# send_resolved: true
sudo docker compose up -d
导入监控模板:
在 https://grafana.com/grafana/dashboards/ 搜索合适的模板,输入模板id进行导入,我使用的id是15894:
导入模板,稍等一会后,就能看到监控的数据。
Prometheus的存储与高可用配置
Alertmanager高可用配置
对各个端口的访问进行防火墙访问控制
-
除了监控Docker以外,还可以使用node-exporter监控机器状态
5、参考资料
https://grafana.com/
https://github.com/google/cadvisor
https://prometheus.io/docs/introduction/overview/
https://grafana.com/docs/grafana-cloud/quickstart/docker-compose-linux/
https://dzlab.github.io/monitoring/2021/12/30/monitoring-stack-docker/
https://www.prometheus.io/webtools/alerting/routing-tree-editor/
https://grafana.com/grafana/dashboards/15894