vlambda博客
学习文章列表

使用最新版Grafana + Prometheus + cAdvisor监控docker

最近在优化调整公司内部的服务器运维管理,以及线上生产环境的软件配置部署,Docker的使用逐渐开始增多。为了能及时了解Docker容器的运行情况,Docker的监控就必须要做了。

本文使用开源的技术方案搭建了一套docker的监控系统,查看Docker运行的各项指标以及可以进行异常报警。
网上大多都采用Grafana + InfluxDB + cAdvisor的开源方案,且使用的InfluxDB版本较旧,我尝试过用最新的版本,可以搭建成功,但找不到适用的Grafana监控模板,转而使用Grafana + Prometheus + cAdvisor方案。
先上一张效果图:
1、Docker监控架构
整个Docker监控系统的架构图如下(暂不考虑Server的集群部署):
使用最新版Grafana + Prometheus + cAdvisor监控docker

在运行了Docker容器的机器(Agent)部署cAdvisor,用于收集容器监控数据。

Prometheus用于从Agent的cAdvisor拉取(pull)监控数据,并存储到本地,并根据告警规则,推送(push)告警信息到Alertmanager。

Alertmanager用于接收Prometheus推送过来的告警信息,进行告警管理与通知。

Grafana用于读取Prometheus的监控数据,生成数据仪表盘,提供一个Web界面查看显示。

2、部署说明

1)系统软件版本

  • CentOS7

  • Docker version 20.10.2

  • docker-compose version 1.27.4, build 40524192

  • Grafana 8.5.1

  • Prometheus 2.35.0

  • cAdvisor v0.39.3

  • Alertmanager 0.24.0

在本次部署中,使用Docker进行安装配置,Grafana、Prometheus、cAdvisor、Alertmanager都是拉取最新版本镜像。
2)部署节点

节点

部署软件

说明

10.2.0.102

Grafana、Prometheus、cAdvisor、Alertmanager

Monitor Server

10.2.0.103

cAdvisor

Monitor Agent

10.2.0.104

cAdvisor

Monitor Agent

10.2.0.105

cAdvisor

Monitor Agent

3、部署过程
3.1、安装配置Monitor Agent
Monitor Agent通过安装部署cAdvisor来收集容器数据,以提供给Prometheus。
1)配置docker-compose.yml
在以上3台Agent服务器中配置:
version: '3.4'
services:
cadvisor: image: google/cadvisor:latest hostname: cadvisor container_name: cadvisor restart: always volumes: - /etc/localtime:/etc/localtime:ro - /etc/timezone:/etc/timezone:ro - /:/rootfs:ro - /var/run:/var/run:rw - /sys:/sys:ro - /var/lib/docker/:/var/lib/docker:ro ports: - "18080:8080"
2)启动
分别执行以下命令启动cAdvisor:
   
     
     
   
$ sudo docker compose up -d
3.2、安装配置Monitor Server
Monitor Server主要由3部分组成:Grafana + Prometheus + Alertmanager。
Server也同时安装部署cAdvisor用于监控Server的容器。
目录结构如下:
   
     
     
   
/data/docker-monitor/ docker-compose.yml grafana/ data/ prometheus/ data/ conifig/ alertmanager.conf prometheus.yml rules/ docker_rules.yml
1)配置docker-compose.yml
在Server服务器中配置:
   
     
     
   
version: '3.4'
services:
grafana: image: grafana/grafana-oss container_name: grafana hostname: grafana restart: always # user: "0" ports: - "13000:3000" volumes: - /etc/localtime:/etc/localtime:ro - /etc/timezone:/etc/timezone:ro - /data/docker-monitor/grafana/data:/var/lib/grafana
prometheus: image: prom/prometheus:latest hostname: prometheus container_name: prometheus restart: always user: "0" volumes: - /etc/localtime:/etc/localtime:ro - /etc/timezone:/etc/timezone:ro - /data/docker-monitor/prometheus/config/prometheus.yml:/etc/prometheus/prometheus.yml - /data/docker-monitor/prometheus/config/rules:/etc/prometheus/rules - /data/docker-monitor/prometheus/data:/prometheus command: - '--config.file=/etc/prometheus/prometheus.yml' - '--storage.tsdb.path=/prometheus' - '--web.console.libraries=/etc/prometheus/console_libraries' - '--web.console.templates=/etc/prometheus/consoles' - '--storage.tsdb.retention.time=200h' - '--web.enable-lifecycle' ports: - 19090:9090
alertmanager: container_name: alertmanager hostname: alertmanager image: prom/alertmanager volumes: - /etc/localtime:/etc/localtime:ro - /etc/timezone:/etc/timezone:ro - /data/docker-monitor/prometheus/config/alertmanager.conf:/etc/alertmanager/alertmanager.conf command: - '--config.file=/etc/alertmanager/alertmanager.conf' ports: - 19093:9093
2)配置Prometheus
配置文件位于宿主机`docker-monitor/prometheus/config/prometheus.yml`,挂载到prometheus。
   
     
     
   
# my global config global: scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute. evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute. # scrape_timeout is set to the global default (10s).
# Alertmanager configuration alerting: alertmanagers: - static_configs: - targets: # alertmanager url - 10.2.0.102:19093 # Load rules once and periodically evaluate them according to the global 'evaluation_interval'. rule_files: - "rules/*.yml"
# A scrape configuration containing exactly one endpoint to scrape: scrape_configs: # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config. - job_name: "prometheus" static_configs: - targets: ["10.2.0.102:19090"] - job_name: "cadvisor" static_configs: - targets: ['10.2.0.102:18080', '10.2.0.103:18080', '10.2.0.104:18080', '10.2.0.105:18080']
3)Prometheus告警规则配置
Prometheus的告警机制由2部分组成:
  • 告警规则:Prometheus会根据告警规则rule_files,将告警发送给Alertmanager

配置Prometheus的告警规则文件`docker_rules.yml`:
   
     
     
   
groups: - name: agent-down rules: - alert: agent-down expr: up{job="cadvisor"} == 0 for: 15s labels: severity: critical team: ops annotations: summary: "Agent {{ $labels.job }} down!" description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 15 seconds."
4)Alertmanager配置
配置文件位于宿主机`docker-monitor/prometheus/config/alertmanager.conf`,挂载到alertmanager
   
     
     
   
global: resolve_timeout: 5m smtp_from: '[email protected]' smtp_smarthost: 'smtp.qq.com:465' smtp_auth_username: '[email protected]' smtp_auth_password: 'xxxxxxxxxxxxxxx' smtp_require_tls: false
route: group_by: ['alertname'] group_wait: 5s group_interval: 5s repeat_interval: 5m receiver: 'ops'
inhibit_rules: - source_match: severity: 'critical' target_match: severity: 'warning' equal: ['alertname']
receivers: - name: 'ops' email_configs: - to: '[email protected]' send_resolved: true # webhook_configs: # - url: 'http://10.2.0.222:18060/dingtalk/webhook/send' # send_resolved: true
5)启动Server
在Server执行以下命令启动Grafana、Prometheus、Alertmanager:
$ sudo docker compose up -d
6)配置Grafana
访问http://10.2.0.102:13000,用默认账号密码admin登录。
配置Data Sources,指定添加Prometheus:
使用最新版Grafana + Prometheus + cAdvisor监控docker
配置Prometheus参数:
使用最新版Grafana + Prometheus + cAdvisor监控docker

导入监控模板:

使用最新版Grafana + Prometheus + cAdvisor监控docker

https://grafana.com/grafana/dashboards/ 搜索合适的模板,输入模板id进行导入,我使用的id是15894:


导入模板,稍等一会后,就能看到监控的数据。

3.3、报警测试
尝试把某一台Agent的cAdvisor停止,会收到报警邮件:
Alertmanager还可自定义报警模板,让报警信息显示更清晰。
至此,一个基本的Docker监控系统已经搭建完成。细节上可做更多的优化调整。
4、生产环境要注意的问题
如果要在生产上部署,要考虑以下问题:
  • Prometheus的存储与高可用配置

  • Alertmanager高可用配置

  • 对各个端口的访问进行防火墙访问控制

  • 除了监控Docker以外,还可以使用node-exporter监控机器状态

5、参考资料

  • https://grafana.com/

  • https://github.com/google/cadvisor

  • https://prometheus.io/docs/introduction/overview/

  • https://grafana.com/docs/grafana-cloud/quickstart/docker-compose-linux/

  • https://dzlab.github.io/monitoring/2021/12/30/monitoring-stack-docker/

  • https://www.prometheus.io/webtools/alerting/routing-tree-editor/

  • https://grafana.com/grafana/dashboards/15894