数据仓库之CK监控初探_vlambda技术博客

vlambda
2022-05-15

数据仓库之CK监控初探

经过这周的折腾，我们的CK终于在生产环境集群跑起来了，开发的代码提交部署好之后就完事了，我们运维还得继续折腾CK的监控。

先去官网翻一下关于监控方面的介绍，CK的官网的监控介绍比较简单，监控分为二方面，首先是CK所在服务器的资源使用情况监控，这个得自己搞。其次是ClickHouse server的监控，这方面CK已经收集好了：

Different metrics of how the server uses computational resources.
Common statistics on query processing.

放在自己的system库的表里面，可以很方便通过http的形式暴露出来给 Prometheus收集。官网也没有推荐任何其他可视化的工具。

CK + Prometheus

网上查了下很多文章介绍的CK监控的解决方案都是参考这篇《一文读懂clickhouse集群监控》clickhouse-exporter + prometheus + grafana 的方案

( https://zhuanlan.zhihu.com/p/353594919 )

这个方案在使用较新版本的CK的时候，需要注意一个问题是clickhouse-exporter 这个组件:

clickhuse-exporter是一个用于采集clickhouse指标的开源组件(https://github.com/ClickHouse/clickhouse_exporter)，它会定时查询clickhouse-server中的系统表，转化成监控指标，并通过HTTP接口暴露给prometheus.

https://zhuanlan.zhihu.com/p/353594919

这个组件其实在我们安装的CK 22.4.5.9已经不再需要了，文章的作者也在回复中提到了这点，应该是在成文时CK还没有开发这一块。

CK 22.4.5.9已经可以直接将监控指标通过http接口的形式暴露出来，不再需要经过clickhouse_exporter去查system表，然后再暴露出来。在config.xml文件中打开Serve endpoint for Prometheus monitoring 这一段即可：

 <!-- Serve endpoint for Prometheus monitoring. --> <!-- endpoint - mertics path (relative to root, statring with "/") port - port to setup server. If not defined or 0 than http_port used metrics - send data from table system.metrics events - send data from table system.events asynchronous_metrics - send data from table system.asynchronous_metrics status_info - send data from different component from CH, ex: Dictionaries status --> <prometheus> <endpoint>/metrics</endpoint> <port>9363</port>
 <metrics>true</metrics> <events>true</events> <asynchronous_metrics>true</asynchronous_metrics> <status_info>true</status_info> </prometheus>

然后在prometheus的scrape_configs里面加上如下配置即可，不需要再去单独编译和安装clickhouse-exporter

scrape_configs: - job_name: 'clickhouse'
 # metrics_path defaults to '/metrics'  # # scheme defaults to 'http'.  static_configs: - targets: ['10.72.83.168:9363', '10.72.83.169:9363']

同样，网上给出的一些编辑好的CK 的Grafana Dashboard, 因为还是对接的clickhouse-exporter的接口数据，直接用CK的接口数据以后有些metrics就对不上了，这个也很坑。

CKMAN

另外一个就是前文提到的国内开源的CK管理监控工具CKman，原理上也是类似的，它是直接对接的promotheus，如果看下他在B站的视频会发现，它的promotheus的配置就是直接对接的CK http 9363端口的http api，不需要clickhouse-exporter。

CKMAN的监控对新手比较友好，已经集成好了一些集群的基本监控信息，还有ZK的监控，但相比Grafana来说，没办法自己编辑需要看的指标，不足够灵活。

CK毕竟还是比较新的OLAP，现在在网上还没有找到特别成熟的监控和管理解决方案，个人还是比较推荐去试试CKman的，但也需要注意的是因为CK更新的比较快，使用工具的同时还是需要搞懂底层原理，否则很容易因为CK更新而出现问题。网上的一些文章也需要去仔细理解和甄别，避免出现因为版本不同出现一些兼容性等问题。

CK的监控搭起来不算复杂，基本都是基于CK自己暴露的endpoint，然后用promotheus去拉那些指标，我粗略看了下，提供的指标非常多，具体哪些指标需要重点关注仍然需要在生产环境中进一步逐步摸索。