

Monitoring Fundamentals

本章为将在本书中使用的几个关键概念奠定了基础。从监控的定义开始,我们将探讨各种观点和因素,这些观点和因素强调了为什么系统分析具有不同程度的重要性并对组织产生影响。您将了解不同监控机制的优缺点,详细了解 Prometheus 收集指标的方法。最后,我们将讨论一些对 Prometheus 堆栈的设计和架构至关重要的有争议的决定,以及为什么在设计自己的监控系统时应该考虑这些决定。


  • Defining of monitoring
  • Whitebox versus blackbox monitoring
  • Understanding metrics collection

Definition of monitoring



The value of monitoring





Organizational contexts

纵观组织环境,系统管理员、质量保证工程师、站点可靠性工程师 (SRE) 或产品所有者等角色对监控有不同的期望。了解每个角色表面的要求可以更容易地理解为什么在讨论监控时上下文如此有用。让我们在提供一些示例的同时扩展以下语句:

  • System administrators are interested in high-resolution, low-latency, and high-diversity data. For a system administrator, the main objective of monitoring is to obtain visibility across the infrastructure and manage data from CPU usage to Hypertext Transfer Protocol (HTTP) request rate so that problems are quickly discovered and the root causes are identified as soon as possible. In this approach, exposing monitoring data in high resolution is critical to be able to drill down into the affected system. If a problem is occurring, you don't have the privilege to wait several hours for your next data point, and so data has to be provided in near real time or, in other words, with low latency. Lastly, since there is no easy way to identify or predict which systems are prone to be affected, we need to collect as much data as possible from all systems; namely, a high diversity of data.
  • Quality assurance engineers are interested in high-resolution, high-latency, and high-diversity data. Besides being important for quality assurance engineers to have high resolution monitoring data collected, which enables a deeper drill down into effects, the latency is not as critical as it is for system administrators. In this case, historical data is much more critical for comparing software releases than the freshness of the data. Since we can't wholly predict the ramifications of a new release, the available data needs to be spread across as much of the infrastructure as possible, touching every system the software release might use and invoke it or generally interact with it (directly or indirectly), so that we have as much data as possible.
  • SREs focused on capacity planning are interested in low-resolution, high-latency, and high-diversity data. In this scenario, historical data carries much more importance for SREs than the resolution that this data is presented in. For example, to predict the increase in infrastructure, it is not critical for a SRE to know that some months ago at 4 A.M., one of the nodes had a spike of CPU usage reaching 100% in 10 seconds, but is useful to understand the trend of the load across the fleet of nodes to infer the number of nodes required to handle new scale requirements. As such, it is also important for SREs to have a broad visualization of all the different parts of the infrastructure that are affected by those requirements to predict, for example, the amount of storage for logs, network bandwidth increase, and so on, making the high diversity of monitoring data mandatory.
  • Product owners are interested in low-resolution, high-latency, and low-diversity data. Where product owners are concerned, monitoring data usually steps away from infrastructure to the realm of business. Product owners strive to understand the trends of specific software products, where historical data is fundamental and resolution is not so critical. Keeping in mind the objective of evaluating the impact of software releases on the customers, latency is not as essential for them as it is for system administrators. The product owner manages a specific set of products, so a low diversity of monitoring data is expected, comprised mostly of business metrics.





















Monitoring components


  • Metrics: This exposes a certain system resource, application action, or business characteristic as a specific point in time value. This information is obtained in an aggregated form; for example, you can find out how many requests per second were served but not the exact time for a specific request, and without context, you won't know the ID of the requests.
  • Logging: Containing much more data than a metric, this manifests itself as an event from a system or application, containing all the information that's produced by such an event. This information is not aggregated and has the full context.
  • Tracing: This is a special case of logging where a request is given a unique identifier so that it can be tracked during its entire life cycle across every system. Due to the increase of the dataset with the number of requests, it is a good idea to use samples instead of tracking all requests.
  • Alerting: This is the continuous threshold validation of metrics or logs, and fires an action or notification in the case of a transgression of the said threshold.
  • Visualization: This is a graphical representation of metrics, logs, or traces.

最近,监控这个术语已经被一个称为observability的超集所取代,这被认为是监控的演变,或者是一种不同的包装来炒作并重振这个概念(与 DevOps 发生的方式相同) )。从目前的情况来看,可观察性确实包括我们在此描述的所有组件。



Whitebox versus blackbox monitoring



  • Does the host respond to Internet Control Message Protocol (ICMP) echo requests (more commonly known as ping)?
  • Is a given TCP port open?
  • Does the application respond with the correct data and status code when it receives a specific HTTP request?
  • Is the process for a specific application running in its host?


  • Exported through logging: This is by far the most common case and how applications exposed their inner workings before instrumentation libraries were widespread. For instance, an HTTP server's access log can be processed to monitor request rates, latencies, and error percentages.
  • Emitted as structured events: This approach is similar to logging but instead of being written to disk, the data is sent directly to processing systems for analysis and aggregation.
  • Maintained in memory as aggregates: Data in this format can be hosted in an endpoint or read directly from command-line tools. Examples of this approach are /metrics with Prometheus metrics, HAProxy's stats page, or the varnishstats command-line tool.



Understanding metrics collection

监控系统衡量指标的过程通常可以分为两种方法——推送和拉取。正如我们将在以下主题中看到的那样,这两种方法都是有效的,并且各有利弊,我们将对此进行深入讨论。尽管如此,必须牢牢掌握它们的不同之处才能理解和充分利用 Prometheus。在了解了收集指标的工作原理之后,我们将深入研究应该收集的内容。有几种经过验证的方法可以实现这一点,我们将对每一种进行概述。

An overview of the two collection approaches


Figure 1.1: Push-based monitoring system

处理原始事件数据的系统通常更喜欢推送,因为事件生成的频率非常高——每个实例每秒数百、数千甚至数万次——这将使轮询数据变得不切实际和复杂。需要某种缓冲机制来保持轮询之间生成的事件,与仅推送数据相比,事件新鲜度仍然是一个问题。使用这种方法的一些示例包括 Riemann、StatsD、Elasticsearch、Logstash 和 Kibana(ELK ) 堆。

这并不是说只有这些类型的系统使用推送。一些监控系统,例如 Graphite、OpenTSDB,以及 Telegraf、InfluxDB、Chronograph 和 Kapacitor(TICK) 堆栈是使用这种方法设计的。即使是好的旧 Nagios 也支持通过 Nagios Service Check Acceptor (NSCA) 推送,通常称为被动检查:

Figure 1.2: Pull-based monitoring system

相比之下,基于拉的监控系统直接从应用程序或从使这些指标对系统可用的代理进程收集指标。一些著名的使用 pull 的监控软件是 Nagios 和 Nagios 风格的系统(Icinga、Zabbix、Zenoss 和 Sensu,仅举几例)。 Prometheus 也是接受拉取方法的公司之一,并且对此非常固执己见。

Push versus pull






尽管 Prometheus 是一个基于拉取的监控系统,但它也提供了一种通过使用从推送转换为拉取的网关来摄取推送指标的方法。这对于监视一类非常狭窄的进程很有用,我们将在本书后面看到。

What to measure


Google's four golden signals


  • Latency: The time required to serve a request
  • Traffic: The number of requests being made
  • Errors: The rate of failing requests
  • Saturation: The amount of work not being processed, which is usually queued

Brendan Gregg's USE method

Brendan 的方法更侧重于机器,它指出对于每个资源(CPU、磁盘、网络接口等),应监控以下指标:

  • Utilization: Measured as the percentage of the resource that was busy
  • Saturation: The amount of work the resource was not able to process, which is usually queued
  • Errors: Amount of errors that occurred

Tom Wilkie's RED method

RED 方法更侧重于服务级别的方法,而不是底层系统本身。显然,这种策略对于监控服务很有用,对于预测外部客户的体验也很有价值。如果服务的错误率增加,则可以合理地假设这些错误将直接或间接影响客户体验。这些是需要注意的指标:

  • Rate: Translated as requests per second
  • Errors: The amount of failing requests per second
  • Duration: The time taken by those requests



在下一章中,我们将看一下 Prometheus 生态系统的概述,并讨论它的几个组件。


  1. Why is monitoring definition so hard to clearly define?
  2. Does a high latency of metrics impact the work of a system administrator who's focused on fixing a live incident?
  3. What are the monitoring requirements to properly do capacity planning?
  4. Is logging considered monitoring?
  5. Regarding the available strategies for metrics collection, what are the downsides of using the push-based approach?
  6. If you had to choose three basic metrics from a generic web service to focus on, which would they be?
  7. When a check verifies whether a given process is running on a host by way of listing the running processes in said host, is that whitebox or blackbox monitoring?