使用最新版Grafana + Prometheus + cAdvisor监控docker

最近在优化调整公司内部的服务器运维管理，以及线上生产环境的软件配置部署，Docker的使用逐渐开始增多。为了能及时了解Docker容器的运行情况，Docker的监控就必须要做了。

本文使用开源的技术方案搭建了一套docker的监控系统，查看Docker运行的各项指标以及可以进行异常报警。

网上大多都采用Grafana + InfluxDB + cAdvisor的开源方案，且使用的InfluxDB版本较旧，我尝试过用最新的版本，可以搭建成功，但找不到适用的Grafana监控模板，转而使用Grafana + Prometheus + cAdvisor方案。

先上一张效果图：

在运行了Docker容器的机器（Agent）部署cAdvisor，用于收集容器监控数据。

Prometheus用于从Agent的cAdvisor拉取（pull）监控数据，并存储到本地，并根据告警规则，推送（push）告警信息到Alertmanager。

Alertmanager用于接收Prometheus推送过来的告警信息，进行告警管理与通知。

Grafana用于读取Prometheus的监控数据，生成数据仪表盘，提供一个Web界面查看显示。

2、部署说明

1）系统软件版本

CentOS7
Docker version 20.10.2
docker-compose version 1.27.4, build 40524192
Grafana 8.5.1
Prometheus 2.35.0
cAdvisor v0.39.3
Alertmanager 0.24.0

在本次部署中，使用Docker进行安装配置，Grafana、Prometheus、cAdvisor、Alertmanager都是拉取最新版本镜像。

2）部署节点

节点	部署软件	说明
10.2.0.102	Grafana、Prometheus、cAdvisor、Alertmanager	Monitor Server
10.2.0.103	cAdvisor	Monitor Agent
10.2.0.104	cAdvisor	Monitor Agent
10.2.0.105	cAdvisor	Monitor Agent

3、部署过程

3.1、安装配置Monitor Agent

Monitor Agent通过安装部署cAdvisor来收集容器数据，以提供给Prometheus。

1）配置docker-compose.yml

在以上3台Agent服务器中配置：

version: '3.4'
services:
 cadvisor: image: google/cadvisor:latest hostname: cadvisor container_name: cadvisor restart: always volumes: - /etc/localtime:/etc/localtime:ro - /etc/timezone:/etc/timezone:ro - /:/rootfs:ro - /var/run:/var/run:rw - /sys:/sys:ro - /var/lib/docker/:/var/lib/docker:ro ports: - "18080:8080"

2）启动

分别执行以下命令启动cAdvisor：

   
     
     
   
    
      
      
    $ sudo docker compose up -d

3.2、安装配置Monitor Server

Monitor Server主要由3部分组成：Grafana + Prometheus + Alertmanager。

Server也同时安装部署cAdvisor用于监控Server的容器。

目录结构如下：

   
     
     
   
    
      
      
    /data/docker-monitor/
    
      
      
     docker-compose.yml
    
      
      
     grafana/
    
      
      
     data/
    
      
      
     prometheus/
    
      
      
     data/
    
      
      
     conifig/
    
      
      
     alertmanager.conf
    
      
      
     prometheus.yml
    
      
      
     rules/
    
      
      
     docker_rules.yml

1）配置docker-compose.yml

在Server服务器中配置：

   
     
     
   
    
      
      
    version: '3.4'
    
      
      
    

    
      
      
    services:
    
      
      
    

    
      
      
     grafana:
    
      
      
     image: grafana/grafana-oss
    
      
      
     container_name: grafana
    
      
      
     hostname: grafana
    
      
      
     restart: always
    
      
      
     # user: "0"
    
      
      
     ports:
    
      
      
     - "13000:3000"
    
      
      
     volumes:
    
      
      
     - /etc/localtime:/etc/localtime:ro
    
      
      
     - /etc/timezone:/etc/timezone:ro
    
      
      
     - /data/docker-monitor/grafana/data:/var/lib/grafana
    
      
      
    

    
      
      
     prometheus:
    
      
      
     image: prom/prometheus:latest
    
      
      
     hostname: prometheus
    
      
      
     container_name: prometheus
    
      
      
     restart: always
    
      
      
     user: "0"
    
      
      
     volumes:
    
      
      
     - /etc/localtime:/etc/localtime:ro
    
      
      
     - /etc/timezone:/etc/timezone:ro
    
      
      
     - /data/docker-monitor/prometheus/config/prometheus.yml:/etc/prometheus/prometheus.yml
    
      
      
     - /data/docker-monitor/prometheus/config/rules:/etc/prometheus/rules
    
      
      
     - /data/docker-monitor/prometheus/data:/prometheus
    
      
      
     command:
    
      
      
     - '--config.file=/etc/prometheus/prometheus.yml'
    
      
      
     - '--storage.tsdb.path=/prometheus'
    
      
      
     - '--web.console.libraries=/etc/prometheus/console_libraries'
    
      
      
     - '--web.console.templates=/etc/prometheus/consoles'
    
      
      
     - '--storage.tsdb.retention.time=200h'
    
      
      
     - '--web.enable-lifecycle'
    
      
      
     ports:
    
      
      
     - 19090:9090
    
      
      
    

    
      
      
     alertmanager:
    
      
      
     container_name: alertmanager
    
      
      
     hostname: alertmanager
    
      
      
     image: prom/alertmanager
    
      
      
     volumes:
    
      
      
     - /etc/localtime:/etc/localtime:ro
    
      
      
     - /etc/timezone:/etc/timezone:ro
    
      
      
     - /data/docker-monitor/prometheus/config/alertmanager.conf:/etc/alertmanager/alertmanager.conf
    
      
      
     command:
    
      
      
     - '--config.file=/etc/alertmanager/alertmanager.conf'
    
      
      
     ports:
    
      
      
     - 19093:9093

2）配置Prometheus

配置文件位于宿主机`docker-monitor/prometheus/config/prometheus.yml`，挂载到prometheus。

   
     
     
   
    
      
      
    # my global config
    
      
      
    global:
    
      
      
     scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
    
      
      
     evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
    
      
      
     # scrape_timeout is set to the global default (10s).
    
      
      
    

    
      
      
    # Alertmanager configuration
    
      
      
    alerting:
    
      
      
     alertmanagers:
    
      
      
     - static_configs:
    
      
      
     - targets:
    
      
      
     # alertmanager url
    
      
      
     - 10.2.0.102:19093
    
      
      
     
    
      
      
    # Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
    
      
      
    rule_files:
    
      
      
     - "rules/*.yml"
    
      
      
    

    
      
      
    # A scrape configuration containing exactly one endpoint to scrape:
    
      
      
    scrape_configs:
    
      
      
     # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
    
      
      
     - job_name: "prometheus"
    
      
      
     static_configs:
    
      
      
     - targets: ["10.2.0.102:19090"]
    
      
      
     - job_name: "cadvisor"
    
      
      
     static_configs:
    
      
      
     - targets: ['10.2.0.102:18080', '10.2.0.103:18080', '10.2.0.104:18080', '10.2.0.105:18080']

3）Prometheus告警规则配置

Prometheus的告警机制由2部分组成：

告警规则：Prometheus会根据告警规则rule_files，将告警发送给Alertmanager

配置Prometheus的告警规则文件`docker_rules.yml`：

   
     
     
   
    
      
      
    groups:
    
      
      
     - name: agent-down
    
      
      
     rules:
    
      
      
     - alert: agent-down
    
      
      
     expr: up{job="cadvisor"} == 0
    
      
      
     for: 15s
    
      
      
     labels:
    
      
      
     severity: critical
    
      
      
     team: ops
    
      
      
     annotations:
    
      
      
     summary: "Agent {{ $labels.job }} down！"
    
      
      
     description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 15 seconds."

4）Alertmanager配置

配置文件位于宿主机`docker-monitor/prometheus/config/alertmanager.conf`，挂载到alertmanager。

   
     
     
   
    
      
      
    global:
    
      
      
     resolve_timeout: 5m
    
      
      
     smtp_from: '[email protected]'
    
      
      
     smtp_smarthost: 'smtp.qq.com:465'
    
      
      
     smtp_auth_username: '[email protected]'
    
      
      
     smtp_auth_password: 'xxxxxxxxxxxxxxx'
    
      
      
     smtp_require_tls: false
    
      
      
    

    
      
      
    route:
    
      
      
     group_by: ['alertname']
    
      
      
     group_wait: 5s
    
      
      
     group_interval: 5s
    
      
      
     repeat_interval: 5m
    
      
      
     receiver: 'ops'
    
      
      
    

    
      
      
    inhibit_rules:
    
      
      
     - source_match:
    
      
      
     severity: 'critical'
    
      
      
     target_match:
    
      
      
     severity: 'warning'
    
      
      
     equal: ['alertname']
    
      
      
    

    
      
      
    receivers:
    
      
      
    - name: 'ops'
    
      
      
     email_configs:
    
      
      
     - to: '[email protected]'
    
      
      
     send_resolved: true
    
      
      
     # webhook_configs:
    
      
      
     # - url: 'http://10.2.0.222:18060/dingtalk/webhook/send'
    
      
      
     # send_resolved: true

5）启动Server

在Server执行以下命令启动Grafana、Prometheus、Alertmanager：

$ sudo docker compose up -d

6）配置Grafana

访问http://10.2.0.102:13000，用默认账号密码admin登录。

配置Data Sources，指定添加Prometheus：

使用最新版Grafana + Prometheus + cAdvisor监控docker

配置Prometheus参数：

导入监控模板：

使用最新版Grafana + Prometheus + cAdvisor监控docker

在 https://grafana.com/grafana/dashboards/ 搜索合适的模板，输入模板id进行导入，我使用的id是15894：

导入模板，稍等一会后，就能看到监控的数据。

3.3、报警测试

尝试把某一台Agent的cAdvisor停止，会收到报警邮件：

Alertmanager还可自定义报警模板，让报警信息显示更清晰。

至此，一个基本的Docker监控系统已经搭建完成。细节上可做更多的优化调整。

4、生产环境要注意的问题

如果要在生产上部署，要考虑以下问题：

Prometheus的存储与高可用配置
Alertmanager高可用配置
对各个端口的访问进行防火墙访问控制
除了监控Docker以外，还可以使用node-exporter监控机器状态

5、参考资料

https://grafana.com/
https://github.com/google/cadvisor
https://prometheus.io/docs/introduction/overview/
https://grafana.com/docs/grafana-cloud/quickstart/docker-compose-linux/
https://dzlab.github.io/monitoring/2021/12/30/monitoring-stack-docker/
https://www.prometheus.io/webtools/alerting/routing-tree-editor/
https://grafana.com/grafana/dashboards/15894

vlambda博客
学习文章列表