vlambda博客
学习文章列表

一文说透MySQL的监控



感谢 MySQL Prometheus Exporter,让监控 MySQL 变得非常容易。MySQL 本身已经无需介绍了,它是世界上最被广泛使用的关系型数据库之一,而且开源!因为 MySQL 非常流行,所以 MySQL 背后的社区也非常庞大,你的需求,基本上社区都有解决方案。本文就来介绍一下如何使用 Prometheus 和 Nightingale 来监控 MySQL。


开始配置


监控 MySQL 的核心原理,就是连上 MySQL,然后执行一些命令(比如 `show global status`)获取指标数据,那要连上 MySQL 是需要用户名和密码的,所以我们这里要先创建一个账号专门给采集器来用,这个账号权限控制的小一些,提高安全性:



CREATE USER 'exporter'@'%' IDENTIFIED BY 'YOUR-PASSWORD' WITH MAX_USER_CONNECTIONS 3;GRANT PROCESS, REPLICATION CLIENT, SELECT ON *.* TO 'exporter'@'%';


配置采集器


这里采用 grafana-agent 作为采集器,也可以直接部署 mysqld_exporter 然后在 prometheus.yaml 中配置 scrape 规则,grafana-agent 内置了一些常用的 exporter,省的去找其他的 exporter 了。grafana-agent 采集 mysql 的配置样例如下:


integrations: mysqld_exporter: enabled: true data_source_name: exporter:YOUR-PASSWORD@tcp(127.0.0.1:3306)/ disable_collectors: ["info_schema.innodb_cmpmem", "info_schema.innodb_cmp"] relabel_configs: - source_labels: [__address__] target_label: instance replacement: cloud-s3-mysql01.bj metric_relabel_configs: - action: drop source_labels: [command] regex: '^(xa|alter|create|show|drop|start|stop|re).+' - action: drop source_labels: [__name__] regex: '^(mysql_global_status_acl|mysql_global_status_feature|mysql_global_status_innodb_encryption|mysql_global_status_performance|mysql_global_status_rpl|mysql_global_variables_aria|mysql_global_variables_performance_schema|mysql_global_variables_rpl|mysql_global_variables_wsrep).+'



grafana-agent 的配置中,有个 integrations 的部分,放置各类内置的 exporter 的配置,mysqld_exporter 的相关配置就是在这个 section 内,mysqld_exporter 内置启用了一些 collector,相关代码如下:


&collector.ScrapeAutoIncrementColumns{}: false,&collector.ScrapeBinlogSize{}: false,&collector.ScrapeClientStat{}: false,&collector.ScrapeEngineInnodbStatus{}: false,&collector.ScrapeEngineTokudbStatus{}: false,&collector.ScrapeGlobalStatus{}: true,&collector.ScrapeGlobalVariables{}: true,&collector.ScrapeInfoSchemaInnodbTablespaces{}: false,&collector.ScrapeInnodbCmpMem{}: true,&collector.ScrapeInnodbCmp{}: true,&collector.ScrapeInnodbMetrics{}: false,&collector.ScrapePerfEventsStatementsSum{}: false,&collector.ScrapePerfEventsWaits{}: false,&collector.ScrapePerfFileEvents{}: false,&collector.ScrapePerfIndexIOWaits{}: false,&collector.ScrapePerfReplicationApplierStatsByWorker{}: false,&collector.ScrapePerfReplicationGroupMemberStats{}: false,&collector.ScrapePerfReplicationGroupMembers{}: false,&collector.ScrapePerfTableIOWaits{}: false,&collector.ScrapePerfTableLockWaits{}: false,&collector.ScrapeQueryResponseTime{}: true,&collector.ScrapeReplicaHost{}: false,&collector.ScrapeSchemaStat{}: false,&collector.ScrapeSlaveHosts{}: false,&collector.ScrapeSlaveStatus{}: true,&collector.ScrapeTableStat{}: false,&collector.ScrapeUserStat{}: false,


如果大家觉得哪个默认启用的 collector 不需要,可以通过 disable_collectors 来 disable 掉,如果觉得哪个默认关闭的 collector 需要打开,可以通过 enable_collectors 来 enable,上例中是把 `info_schema.innodb_cmpmem`, `info_schema.innodb_cmp` 给 disable 了


另外 relabel_configs 中的配置是设置了一个较为易读很能代表数据库作用的 instance 标签值,metric_relabel_configs 则把一些不关键的指标给 drop 掉了,节省时序存储的成本。


本人对 MySQL 了解有限,不是专业的 DBA,如果有哪位仁兄觉得上面的配置可以优化改进,欢迎讨论哈。


核心指标 1:可用性


有两个关键指标来标识 MySQL 实例的可用性,一个是 mysql_up,如果这个值为 0,表示 exporter 无法访问数据库实例了,可以配置一条告警规则:


mysql_up == 0


那如果 MySQL 实例重启速度比较快,exporter 去探测的时候,恰好已经重启完了,那 mysql_up 指标就无法发现了,如果您觉得这种情况也需要了解到,mysql_global_status_uptime 这个指标可以帮到你,这个指标表示 MySQL 实例的启动时长,可以配置一条 info 级别的告警规则:


mysql_global_status_uptime < 1800


核心指标 2:连接相关


数据库中错误的主要来源之一是连接错误。mysql_global_status_connection_errors_total 允许您检测数据库何时生成这些错误:


rate(mysql_global_status_connection_errors_total[5m]) > 0



连接错误的一个常见原因是缺少可用的连接。您可以检查可用连接的百分比,如果当前连接数超过了最大允许连接的 80% 就告警:


100 * mysql_global_status_threads_connected / mysql_global_variables_max_connections > 80


关于数据库最大连接的问题,上一篇文章《》做了详细解释,有兴趣的小伙伴可以查阅


核心指标 3:慢查询


和许多数据库一样,MySQL 也会记录慢查询的日志。此日志中的条目数可以通过 mysql_global_status_slow_queries 来查询。您可以使用以下 PromQL 创建告警,以便在慢查询日志中有新条目时发出通知,这可能意味着存在性能问题:


rate(mysql_global_status_slow_queries[5m]) > 3


核心指标 4:Cache 命中率


MySQL使用内存缓存优化磁盘读写操作。缓存低命中率将影响数据库的性能。此 PromQL 将为您提供 open tables 缓存命中率的值,分子是 table_open_cache 命中的数量,分母是命中的数量+未命中的数量,即总数:


rate(mysql_global_status_table_open_cache_hits[5m]) /(rate(mysql_global_status_table_open_cache_hits[5m]) + rate(mysql_global_status_table_open_cache_misses[5m]))


同时,您也可以通过如下 PromQL 监控 buffer pool cache,分子是逻辑读请求数减去InnoDB 无法从 buffer pool 满足,需要直接读取硬盘的数量,分母是逻辑读请求数:


(rate(mysql_global_status_innodb_buffer_pool_read_requests[5m]) -rate(mysql_global_status_innodb_buffer_pool_reads[5m]))/rate(mysql_global_status_innodb_buffer_pool_read_requests[5m]))


核心指标 5:查询类型


对于各种类型的 SELECT,可以通过如下一些监控指标获取其统计值:


- mysql_global_status_select_full_join: The number of joins that perform table scans because they do not use indexes. If this value is not 0, you should carefully check the indexes of your tables.(Counter)

- mysql_global_status_select_full_range_join: The number of joins that used a range search on a reference table.(Counter)

- mysql_global_status_select_range: The number of joins that used ranges on the first table. This is normally not a critical issue even if the value is quite large.(Counter)

- mysql_global_status_select_range_check: The number of joins without keys that check for key usage after each row. If this is not 0, you should carefully check the indexes of your tables.(Counter)

- mysql_global_status_select_scan: The number of joins that did a full scan of the first table.(Counter)


这些值都是 Counter 类型,所以要使用 rate 函数,比如:


rate(mysql_global_status_select_full_join[5m]) > 10


类似的,我们还要关注 sort 相关的指标,sort 也是影响查询性能的一个关键方面,相关指标如下:


- mysql_global_status_sort_rows: The number of sorted rows.(Counter)

- mysql_global_status_sort_range: The number of sorts that were done using ranges.(Counter)

- mysql_global_status_sort_merge_passes: The number of merge passes that the sort algorithm has had to do. If this value is large, you should consider increasing the value of the sort_buffer_size system variable.(Counter)

- mysql_global_status_sort_scan: The number of sorts that were done by scanning the table.(Counter)


告警规则


这里我把常用的的告警规则整理到夜莺里并导出为 JSON 了,大家如果使用 mysqld_exporter 或者 grafana-agent,则可以直接导入下面的 JSON 作为告警规则使用:


[ { "name": "MysqlInnodbLogWaits", "note": "MySQL innodb log writes stalling", "severity": 2, "disabled": 0, "prom_for_duration": 0, "prom_ql": "rate(mysql_global_status_innodb_log_waits[15m]) > 10", "prom_eval_interval": 15, "enable_stime": "00:00", "enable_etime": "23:59", "enable_days_of_week": [ "1", "2", "3", "4", "5", "6", "0" ], "enable_in_bg": 0, "notify_recovered": 1, "notify_channels": [], "notify_repeat_step": 60, "recover_duration": 0, "callbacks": [], "runbook_url": "", "append_tags": [ "alertname=MysqlInnodbLogWaits" ] }, { "name": "MysqlSlaveIoThreadNotRunning", "note": "MySQL Slave IO thread not running", "severity": 1, "disabled": 0, "prom_for_duration": 0, "prom_ql": "mysql_slave_status_master_server_id > 0 and ON (instance) mysql_slave_status_slave_io_running == 0", "prom_eval_interval": 15, "enable_stime": "00:00", "enable_etime": "23:59", "enable_days_of_week": [ "1", "2", "3", "4", "5", "6", "0" ], "enable_in_bg": 0, "notify_recovered": 1, "notify_channels": [], "notify_repeat_step": 60, "recover_duration": 0, "callbacks": [], "runbook_url": "", "append_tags": [ "alertname=MysqlSlaveIoThreadNotRunning" ] }, { "name": "MysqlSlaveReplicationLag", "note": "", "severity": 1, "disabled": 0, "prom_for_duration": 60, "prom_ql": "mysql_slave_status_master_server_id > 0 and ON (instance) (mysql_slave_status_seconds_behind_master - mysql_slave_status_sql_delay) > 30", "prom_eval_interval": 15, "enable_stime": "00:00", "enable_etime": "23:59", "enable_days_of_week": [ "1", "2", "3", "4", "5", "6", "0" ], "enable_in_bg": 0, "notify_recovered": 1, "notify_channels": [], "notify_repeat_step": 60, "recover_duration": 0, "callbacks": [], "runbook_url": "", "append_tags": [ "alertname=MysqlSlaveReplicationLag" ] }, { "name": "MysqlSlaveSqlThreadNotRunning", "note": "MySQL Slave SQL thread not running", "severity": 1, "disabled": 0, "prom_for_duration": 0, "prom_ql": "mysql_slave_status_master_server_id > 0 and ON (instance) mysql_slave_status_slave_sql_running == 0", "prom_eval_interval": 15, "enable_stime": "00:00", "enable_etime": "23:59", "enable_days_of_week": [ "1", "2", "3", "4", "5", "6", "0" ], "enable_in_bg": 0, "notify_recovered": 1, "notify_channels": [], "notify_repeat_step": 60, "recover_duration": 0, "callbacks": [], "runbook_url": "", "append_tags": [ "alertname=MysqlSlaveSqlThreadNotRunning" ] }, { "name": "Mysql刚刚有重启,请注意", "note": "MySQL has just been restarted, less than one minute ago", "severity": 3, "disabled": 0, "prom_for_duration": 0, "prom_ql": "mysql_global_status_uptime < 60", "prom_eval_interval": 15, "enable_stime": "00:00", "enable_etime": "23:59", "enable_days_of_week": [ "1", "2", "3", "4", "5", "6", "0" ], "enable_in_bg": 0, "notify_recovered": 1, "notify_channels": [], "notify_repeat_step": 60, "recover_duration": 0, "callbacks": [], "runbook_url": "", "append_tags": [ "alertname=MysqlRestarted" ] }, { "name": "Mysql实例挂了", "note": "", "severity": 1, "disabled": 0, "prom_for_duration": 0, "prom_ql": "mysql_up == 0", "prom_eval_interval": 15, "enable_stime": "00:00", "enable_etime": "23:59", "enable_days_of_week": [ "1", "2", "3", "4", "5", "6", "0" ], "enable_in_bg": 0, "notify_recovered": 1, "notify_channels": [], "notify_repeat_step": 60, "recover_duration": 0, "callbacks": [], "runbook_url": "", "append_tags": [ "alertname=MysqlDown" ] }, { "name": "Mysql打开了很多文件句柄,请注意", "note": "More than 80% of MySQL files open", "severity": 2, "disabled": 0, "prom_for_duration": 120, "prom_ql": "avg by (instance) (mysql_global_status_innodb_num_open_files) / avg by (instance)(mysql_global_variables_open_files_limit) * 100 > 80", "prom_eval_interval": 15, "enable_stime": "00:00", "enable_etime": "23:59", "enable_days_of_week": [ "1", "2", "3", "4", "5", "6", "0" ], "enable_in_bg": 0, "notify_recovered": 1, "notify_channels": [], "notify_repeat_step": 60, "recover_duration": 0, "callbacks": [], "runbook_url": "", "append_tags": [ "alertname=MysqlHighOpenFiles" ] }, { "name": "Mysql最近一分钟有慢查询出现", "note": "MySQL server mysql has some new slow query", "severity": 2, "disabled": 0, "prom_for_duration": 120, "prom_ql": "increase(mysql_global_status_slow_queries[1m]) > 0", "prom_eval_interval": 15, "enable_stime": "00:00", "enable_etime": "23:59", "enable_days_of_week": [ "1", "2", "3", "4", "5", "6", "0" ], "enable_in_bg": 0, "notify_recovered": 1, "notify_channels": [], "notify_repeat_step": 60, "recover_duration": 0, "callbacks": [], "runbook_url": "", "append_tags": [ "alertname=MysqlSlowQueries" ] }, { "name": "Mysql有超过60%的连接是running状态", "note": "", "severity": 2, "disabled": 0, "prom_for_duration": 120, "prom_ql": "avg by (instance) (mysql_global_status_threads_running) / avg by (instance) (mysql_global_variables_max_connections) * 100 > 60", "prom_eval_interval": 15, "enable_stime": "00:00", "enable_etime": "23:59", "enable_days_of_week": [ "1", "2", "3", "4", "5", "6", "0" ], "enable_in_bg": 0, "notify_recovered": 1, "notify_channels": [], "notify_repeat_step": 60, "recover_duration": 0, "callbacks": [], "runbook_url": "", "append_tags": [ "alertname=MysqlHighThreadsRunning" ] }, { "name": "Mysql连接数已超过80%", "note": "More than 80% of MySQL connections are in use", "severity": 2, "disabled": 0, "prom_for_duration": 120, "prom_ql": "avg by (instance) (mysql_global_status_threads_connected) / avg by (instance) (mysql_global_variables_max_connections) * 100 > 80", "prom_eval_interval": 15, "enable_stime": "00:00", "enable_etime": "23:59", "enable_days_of_week": [ "1", "2", "3", "4", "5", "6", "0" ], "enable_in_bg": 0, "notify_recovered": 1, "notify_channels": [], "notify_repeat_step": 60, "recover_duration": 0, "callbacks": [], "runbook_url": "", "append_tags": [ "alertname=MysqlTooManyConnections" ] }]


另外,一些重要的监控指标的释义,也做了整理,大家可以直接放到夜莺的 etc 配置下的 metrics.yaml 中,这样在页面上就能看到相关指标的解释了:



# [mysqld_exporter]mysql_global_status_uptime: The number of seconds that the server has been up.(Gauge)mysql_global_status_uptime_since_flush_status: The number of seconds since the most recent FLUSH STATUS statement.(Gauge)mysql_global_status_queries: The number of statements executed by the server. This variable includes statements executed within stored programs, unlike the Questions variable. It does not count COM_PING or COM_STATISTICS commands.(Counter)mysql_global_status_threads_connected: The number of currently open connections.(Counter)mysql_global_status_connections: The number of connection attempts (successful or not) to the MySQL server.(Gauge)mysql_global_status_max_used_connections: The maximum number of connections that have been in use simultaneously since the server started.(Gauge)mysql_global_status_threads_running: The number of threads that are not sleeping.(Gauge)mysql_global_status_questions: The number of statements executed by the server. This includes only statements sent to the server by clients and not statements executed within stored programs, unlike the Queries variable. This variable does not count COM_PING, COM_STATISTICS, COM_STMT_PREPARE, COM_STMT_CLOSE, or COM_STMT_RESET commands.(Counter)mysql_global_status_threads_cached: The number of threads in the thread cache.(Counter)mysql_global_status_threads_created: The number of threads created to handle connections. If Threads_created is big, you may want to increase the thread_cache_size value. The cache miss rate can be calculated as Threads_created/Connections.(Counter)mysql_global_status_created_tmp_tables: The number of internal temporary tables created by the server while executing statements.(Counter)mysql_global_status_created_tmp_disk_tables: The number of internal on-disk temporary tables created by the server while executing statements. You can compare the number of internal on-disk temporary tables created to the total number of internal temporary tables created by comparing Created_tmp_disk_tables and Created_tmp_tables values.(Counter)mysql_global_status_created_tmp_files: How many temporary files mysqld has created.(Counter)mysql_global_status_select_full_join: The number of joins that perform table scans because they do not use indexes. If this value is not 0, you should carefully check the indexes of your tables.(Counter)mysql_global_status_select_full_range_join: The number of joins that used a range search on a reference table.(Counter)mysql_global_status_select_range: The number of joins that used ranges on the first table. This is normally not a critical issue even if the value is quite large.(Counter)mysql_global_status_select_range_check: The number of joins without keys that check for key usage after each row. If this is not 0, you should carefully check the indexes of your tables.(Counter)mysql_global_status_select_scan: The number of joins that did a full scan of the first table.(Counter)mysql_global_status_sort_rows: The number of sorted rows.(Counter)mysql_global_status_sort_range: The number of sorts that were done using ranges.(Counter)mysql_global_status_sort_merge_passes: The number of merge passes that the sort algorithm has had to do. If this value is large, you should consider increasing the value of the sort_buffer_size system variable.(Counter)mysql_global_status_sort_scan: The number of sorts that were done by scanning the table.(Counter)mysql_global_status_slow_queries: The number of queries that have taken more than long_query_time seconds. This counter increments regardless of whether the slow query log is enabled.(Counter)mysql_global_status_aborted_connects: The number of failed attempts to connect to the MySQL server.(Counter)mysql_global_status_aborted_clients: The number of connections that were aborted because the client died without closing the connection properly.(Counter)mysql_global_status_table_locks_immediate: The number of times that a request for a table lock could be granted immediately. Locks Immediate rising and falling is normal activity.(Counter)mysql_global_status_table_locks_waited: The number of times that a request for a table lock could not be granted immediately and a wait was needed. If this is high and you have performance problems, you should first optimize your queries, and then either split your table or tables or use replication.(Counter)mysql_global_status_bytes_received: The number of bytes received from all clients.(Counter)mysql_global_status_bytes_sent: The number of bytes sent to all clients.(Counter)mysql_global_status_innodb_page_size: InnoDB page size (default 16KB). Many values are counted in pages; the page size enables them to be easily converted to bytes.(Gauge)mysql_global_status_buffer_pool_pages: The number of pages in the InnoDB buffer pool.(Gauge)mysql_global_status_commands_total: The number of times each xxx statement has been executed.(Counter)mysql_global_status_handlers_total: Handler statistics are internal statistics on how MySQL is selecting, updating, inserting, and modifying rows, tables, and indexes. This is in fact the layer between the Storage Engine and MySQL.(Counter)mysql_global_status_opened_files: The number of files that have been opened with my_open() (a mysys library function). Parts of the server that open files without using this function do not increment the count.(Counter)mysql_global_status_open_tables: The number of tables that are open.(Gauge)mysql_global_status_opened_tables: The number of tables that have been opened. If Opened_tables is big, your table_open_cache value is probably too small.(Counter)mysql_global_status_table_open_cache_hits: The number of hits for open tables cache lookups.(Counter)mysql_global_status_table_open_cache_misses: The number of misses for open tables cache lookups.(Counter)mysql_global_status_table_open_cache_overflows: The number of overflows for the open tables cache.(Counter)mysql_global_status_innodb_num_open_files: The number of files InnoDB currently holds open.(Gauge)mysql_global_status_connection_errors_total: These variables provide information about errors that occur during the client connection process.(Counter)mysql_global_status_innodb_buffer_pool_read_requests: The number of logical read requests.(Counter)mysql_global_status_innodb_buffer_pool_reads: The number of logical reads that InnoDB could not satisfy from the buffer pool, and had to read directly from disk.(Counter)

mysql_global_variables_thread_cache_size: How many threads the server should cache for reuse.(Gauge)mysql_global_variables_max_connections: The maximum permitted number of simultaneous client connections.(Gauge)mysql_global_variables_innodb_buffer_pool_size: The size in bytes of the buffer pool, the memory area where InnoDB caches table and index data. The default value is 134217728 bytes (128MB).(Gauge)mysql_global_variables_innodb_log_buffer_size: The size in bytes of the buffer that InnoDB uses to write to the log files on disk.(Gauge)mysql_global_variables_key_buffer_size: Index blocks for MyISAM tables are buffered and are shared by all threads.(Gauge)mysql_global_variables_query_cache_size: The amount of memory allocated for caching query results.(Gauge)mysql_global_variables_table_open_cache: The number of open tables for all threads.(Gauge)mysql_global_variables_open_files_limit: The number of file descriptors available to mysqld from the operating system.(Gauge)


另外就是监控大盘,大家可以先使用 grafana 看图,夜莺的图表类型当前只支持折线图,正在开发更多种类的图表,完事我们会整理各种常用的监控大盘给大家使用。


我们希望和您一起,把监控告警这个事,做到极致,把夜莺发展为国内最好用的监控系统!如果觉得有用,请点赞、收藏、分享一键三连哦~


参考资料:

  • https://dev.mysql.com/doc/refman/5.7/en/server-status-variables.html

  • https://sysdig.com/blog/mysql-monitoring/