浅析开源监控告警系统Prometheus
前言
监控架构
整个流程大致分为收集数据,存储数据,展示监控数据,监控告警;核心组件包括:Exporters,Prometheus Server,AlertManager,PushGateway;
Exporters:监控数据采集器,将数据通过Http的方式暴露给Prometheus Server;
Prometheus Server:负责对监控数据的获取,存储以及查询;获取的监控数据需要是指定的Metrics 格式,这样才能处理监控数据;对于查询Prometheus提供了PromQL方便对数据进行查询汇总,当然Prometheus本身也提供了Web UI;
AlertManager:Prometheus支持通过PromQL来创建告警规则,如果满足规则则创建一条告警,后续的告警流程就交给AlertManager,其提供了多种告警方式包括email,webhook等方式;
PushGateway:正常情况下Prometheus Server能够直接与Exporter进行通信,然后pull数据;当网络需求无法满足时就可以使用PushGateway作为中转站了;
收集数据
数据的来源多种多样包括:系统级监控数据比如节点的cpu,io等,中间件比如mysql,mq等,进程级监控比如jvm等,业务监控数据等;除了监控的业务数据每个系统可能不一样,除此之外其他的监控数据其实每个系统都是大同小异的;所以在Exporter的来源分成了两类:社区提供的,用户自定义的;
Exporter来源
社区提供
范围 |
常用Exporter |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
用户自定义
Exporter运行方式
独立运行
集成到应用中
数据格式
<metric name>{ <label name>= <label value>, ...}
metric name:指标的名称,主要反映被监控样本的含义
a-zA-Z_:*
_label name: 标签 反映了当前样本的特征维度
[a-zA-Z0-9_]*
label value: 各个标签的值,不限制格式
# HELP jvm_memory_max_bytes The maximum amount of memory in bytes that can be used for memory management
# TYPE jvm_memory_max_bytes gauge
jvm_memory_max_bytes{application= "springboot-actuator-prometheus-test",area= "nonheap", id= "Metaspace",} -1.0
jvm_memory_max_bytes{application= "springboot-actuator-prometheus-test",area= "heap", id= "PS Eden Space",} 1.033895936E9
jvm_memory_max_bytes{application= "springboot-actuator-prometheus-test",area= "nonheap", id= "Code Cache",} 2.5165824E8
jvm_memory_max_bytes{application= "springboot-actuator-prometheus-test",area= "nonheap", id= "Compressed Class Space",} 1.073741824E9
jvm_memory_max_bytes{application= "springboot-actuator-prometheus-test",area= "heap", id= "PS Survivor Space",} 2621440.0
jvm_memory_max_bytes{application= "springboot-actuator-prometheus-test",area= "heap", id= "PS Old Gen",} 2.09190912E9
数据类型
Counter
# HELP jvm_gc_memory_allocated_bytes_total Incremented for an increase in the size of the young generation memory pool after one GC to before the next
# TYPE jvm_gc_memory_allocated_bytes_total counter
jvm_gc_memory_allocated_bytes_total{application= "springboot-actuator-prometheus-test",} 6.3123664 E9
Gauge
# HELP jvm_threads_live_threads The current number of live threads including both daemon and non-daemon threads
# TYPE jvm_threads_live_threads gauge
jvm_threads_live_threads{application= "springboot-actuator-prometheus-test",} 20.0
Histogram和Summary
# HELP jvm_gc_pause_seconds Time spent in GC pause
# TYPE jvm_gc_pause_seconds summary
jvm_gc_pause_seconds_count{action= "end of minor GC",application= "springboot-actuator-prometheus-test",cause= "Metadata GC Threshold",} 1.0
jvm_gc_pause_seconds_sum{action= "end of minor GC",application= "springboot-actuator-prometheus-test",cause= "Metadata GC Threshold",} 0 . 00 8
jvm_gc_pause_seconds_count{action= "end of minor GC",application= "springboot-actuator-prometheus-test",cause= "Allocation Failure",} 38.0
jvm_gc_pause_seconds_sum{action= "end of minor GC",application= "springboot-actuator-prometheus-test",cause= "Allocation Failure",} 0 . 134
jvm_gc_pause_seconds_count{action= "end of major GC",application= "springboot-actuator-prometheus-test",cause= "Metadata GC Threshold",} 1.0
jvm_gc_pause_seconds_sum{action= "end of major GC",application= "springboot-actuator-prometheus-test",cause= "Metadata GC Threshold",} 0 . 073
展示数据
Grafana是一款采用go语言编写的开源应用,允许您从Elasticsearch,Prometheus,Graphite,InfluxDB等各种数据源中获取数据,并通过精美的图形将其可视化;
Prometheus UI
Grafana
PromQL简介
操作符
rabbitmq_queue_messages> 0
聚合函数
sum
(求和),
min
(最小值),
max
(最大值),
avg
(平均值)等等;
sum(rabbitmq_queue_messages)> 0
告警
自定义告警规则
- name: queue-messages-warning
rules:
- alert: queue-messages-warning
expr: sum(rabbitmq_queue_messages{job='rabbit-state-metrics'}) > 500
labels:
team: webhook-warning
annotations:
summary: High queue-messages usage detected
threshold: 500
current: '{{ $value }}'
alert:告警规则的名称;
expr:基于PromQL表达式告警触发条件;
labels:自定义标签,通过其关联到具体Alertmanager上;
annotations:用于指定一组附加信息,比如用于描述告警详细信息的文字等;
AlertManager
global:
resolve_timeout: 5m
route:
receiver: webhook
group_wait: 30s
group_interval: 1m
repeat_interval: 5m
group_by:
- alertname
routes:
- receiver: webhook
group_wait: 10s
match:
team: webhook-warning
receivers:
- name: webhook
webhook_configs:
- url: 'http://ip:port/api/v1/monitor/alert-receiver'
send_resolved: true
更多:https://prometheus.io/docs/alerting/latest/overview/
安装与配置
Prometheus和AlertManager
apiVersion: apps/v1
kind: Deployment
metadata:
annotations:
deployment.kubernetes.io/revision: '18'
generation: 18
labels:
app: prometheus
name: prometheus
namespace: monitoring
spec:
progressDeadlineSeconds: 600
replicas: 1
revisionHistoryLimit: 10
selector:
matchLabels:
app: prometheus
template:
metadata:
labels:
app: prometheus
spec:
containers:
- image: 'prom/prometheus:latest'
imagePullPolicy: Always
name: prometheus-0
ports:
- containerPort: 9090
name: p-port
protocol: TCP
resources:
requests:
cpu: 250m
memory: 512Mi
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /etc/prometheus
name: config-volume
- image: 'prom/alertmanager:latest'
imagePullPolicy: Always
name: prometheus-1
ports:
- containerPort: 9093
name: a-port
protocol: TCP
resources: {}
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /etc/alertmanager
name: alertcfg
dnsPolicy: ClusterFirst
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
terminationGracePeriodSeconds: 30
volumes:
- name: data
persistentVolumeClaim:
claimName: monitoring-nfs-pvc
- configMap:
defaultMode: 420
name: prometheus-config
name: config-volume
- configMap:
defaultMode: 420
name: alert-config
name: alertcfg
global:
scrape_interval: 15s
evaluation_interval: 15s
rule_files:
- 'rabbitmq_warn.yml'
alerting:
alertmanagers:
- static_configs:
- targets: [ '127.0.0.1:9093']
scrape_configs:
- job_name: 'rabbit-state-metrics'
static_configs:
- targets: [ 'ip:port']
查看Exporter
可以在status/targets目录下查看到当前的所有exporter,如果状态都为up表示,表示prometheus已经可以接受监控数据了,比如我这里配置的接收rabbitmq相关监控数据;
查看Alerts
Grafana
apiVersion: apps/v1
kind: Deployment
metadata:
annotations:
deployment.kubernetes.io/revision: '1'
generation: 1
labels:
app: grafana
name: grafana
namespace: monitoring
spec:
progressDeadlineSeconds: 600
replicas: 1
revisionHistoryLimit: 10
selector:
matchLabels:
app: grafana
template:
metadata:
labels:
app: grafana
spec:
containers:
- image: grafana/grafana
imagePullPolicy: Always
name: grafana
ports:
- containerPort: 3000
protocol: TCP
resources: {}
dnsPolicy: ClusterFirst
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
terminationGracePeriodSeconds: 30
Exporter
apiVersion: apps/v1
kind: Deployment
metadata:
annotations:
deployment.kubernetes.io/ revision: '3'
labels:
k8s-app: rabbitmq-exporter
name: rabbitmq-exporter
namespace: monitoring
spec:
progressDeadlineSeconds: 600
replicas: 1
revisionHistoryLimit: 2
selector:
matchLabels:
k8s-app: rabbitmq-exporter
template:
metadata:
labels:
k8s-app: rabbitmq-exporter
spec:
containers:
- env:
- name: PUBLISH_PORT
value: '9098'
- name: RABBIT_CAPABILITIES
value: 'bert,no_sort'
- name: RABBIT_USER
value: xxxx
- name: RABBIT_PASSWORD
value: xxxx
- name: RABBIT_URL
value: 'http://ip:15672'
image: kbudde/rabbitmq-exporter
imagePullPolicy: IfNotPresent
name: rabbitmq-exporter
ports:
- containerPort: 9098
protocol: TCP
MicroMeter
添加依赖
<dependency>
<groupId>io.micrometer </groupId>
<artifactId>micrometer-registry-prometheus </artifactId>
</dependency>
PrometheusMeterRegistry
和
CollectorRegistry
来以Prometheus 可以抓取的格式收集和导出指标数据;
/prometheus
端点暴露出来。Prometheus 可以抓取该端点以定期获取度量标准数据。
prometheus端点
# HELP tomcat_sessions_created_sessions_total
# TYPE tomcat_sessions_created_sessions_total counter
tomcat_sessions_created_sessions_total{application= "springboot-actuator-prometheus-test",} 1782.0
# HELP tomcat_sessions_active_current_sessions
# TYPE tomcat_sessions_active_current_sessions gauge
tomcat_sessions_active_current_sessions{application= "springboot-actuator-prometheus-test",} 365.0
# HELP jvm_threads_daemon_threads The current number of live daemon threads
# TYPE jvm_threads_daemon_threads gauge
jvm_threads_daemon_threads{application= "springboot-actuator-prometheus-test",} 16.0
# HELP process_cpu_usage The "recent cpu usage" for the Java Virtual Machine process
# TYPE process_cpu_usage gauge
process_cpu_usage{application= "springboot-actuator-prometheus-test",} 0.0102880658436214
# HELP jvm_gc_memory_allocated_bytes_total Incremented for an increase in the size of the young generation memory pool after one GC to before the next
# TYPE jvm_gc_memory_allocated_bytes_total counter
jvm_gc_memory_allocated_bytes_total{application= "springboot-actuator-prometheus-test",} 9.13812704E8
# HELP jvm_buffer_count_buffers An estimate of the number of buffers in the pool
# TYPE jvm_buffer_count_buffers gauge
jvm_buffer_count_buffers{application= "springboot-actuator-prometheus-test",id= "mapped",} 0.0
jvm_buffer_count_buffers{application= "springboot-actuator-prometheus-test",id= "direct",} 10.0
...
prometheus配置target
- job_name: 'springboot-actuator-prometheus-test'
metrics_path: '/actuator/prometheus'
scrape_interval: 5s
basic_auth:
username: 'actuator'
password: 'actuator'
static_configs:
- targets: ['ip:8080']
curl -X POST http: ``/ /ip:9090/- /reload
Grafana
业务埋点
Counter:允许以固定的数值递增,该数值必须为正数;
Gauge:获取当前值的句柄。典型的例子是,获取集合、map、或运行中的线程数等;
Timer:Timer用于测量短时间延迟和此类事件的频率。所有Timer实现至少将总时间和事件次数报告为单独的时间序列;
LongTaskTimer:长任务计时器用于跟踪所有正在运行的长时间运行任务的总持续时间和此类任务的数量;
DistributionSummary:用于跟踪分布式的事件;
总结
往期精彩回顾