Monitoring Fundamentals

本章为将在本书中使用的几个关键概念奠定了基础。从监控的定义开始，我们将探讨各种观点和因素，这些观点和因素强调了为什么系统分析具有不同程度的重要性并对组织产生影响。您将了解不同监控机制的优缺点，详细了解 Prometheus 收集指标的方法。最后，我们将讨论一些对 Prometheus 堆栈的设计和架构至关重要的有争议的决定，以及为什么在设计自己的监控系统时应该考虑这些决定。

我们将在本章中介绍以下主题：

Defining of monitoring
Whitebox versus blackbox monitoring
Understanding metrics collection

Definition of monitoring

很难对监控达成一致的定义，因为它会在行业甚至特定工作的环境之间迅速转换。观点的多样性、监控系统的组成部分，甚至数据的收集或使用方式，都是导致难以达成明确定义的因素。

没有共同点，就很难维持讨论，而且通常情况下，期望是不匹配的。因此，在以下主题中，我们将概述一个基线，旨在获得一个监控的定义，这将指导我们阅读本书。

The value of monitoring

随着基础设施日益复杂，采用面向微服务的架构呈指数增长，获得基础设施所有不同组件的全局视图变得至关重要。手动验证每个实例、缓存服务、数据库或负载均衡器的运行状况是不可想象的。有太多动人的作品数不胜数——更不用说密切关注了。

如今，预计监控将跟踪来自这些组件的数据。但是，数据可能有多种形式，可以用于不同的目的。

警报是监控数据的标准用途之一，但此类数据的应用可以远不止于此。您可能需要历史信息来帮助您进行容量规划或事件调查，或者您可能需要更高的分辨率来深入研究问题，甚至需要更高的新鲜度来减少中断期间的平均恢复时间。

您可以将监控视为维护健康系统、生产和业务方面的信息来源。

Organizational contexts

纵观组织环境，系统管理员、质量保证工程师、站点可靠性工程师 (SRE) 或产品所有者等角色对监控有不同的期望。了解每个角色表面的要求可以更容易地理解为什么在讨论监控时上下文如此有用。让我们在提供一些示例的同时扩展以下语句：

System administrators are interested in high-resolution, low-latency, and high-diversity data. For a system administrator, the main objective of monitoring is to obtain visibility across the infrastructure and manage data from CPU usage to Hypertext Transfer Protocol (HTTP) request rate so that problems are quickly discovered and the root causes are identified as soon as possible. In this approach, exposing monitoring data in high resolution is critical to be able to drill down into the affected system. If a problem is occurring, you don't have the privilege to wait several hours for your next data point, and so data has to be provided in near real time or, in other words, with low latency. Lastly, since there is no easy way to identify or predict which systems are prone to be affected, we need to collect as much data as possible from all systems; namely, a high diversity of data.
Quality assurance engineers are interested in high-resolution, high-latency, and high-diversity data. Besides being important for quality assurance engineers to have high resolution monitoring data collected, which enables a deeper drill down into effects, the latency is not as critical as it is for system administrators. In this case, historical data is much more critical for comparing software releases than the freshness of the data. Since we can't wholly predict the ramifications of a new release, the available data needs to be spread across as much of the infrastructure as possible, touching every system the software release might use and invoke it or generally interact with it (directly or indirectly), so that we have as much data as possible.
SREs focused on capacity planning are interested in low-resolution, high-latency, and high-diversity data. In this scenario, historical data carries much more importance for SREs than the resolution that this data is presented in. For example, to predict the increase in infrastructure, it is not critical for a SRE to know that some months ago at 4 A.M., one of the nodes had a spike of CPU usage reaching 100% in 10 seconds, but is useful to understand the trend of the load across the fleet of nodes to infer the number of nodes required to handle new scale requirements. As such, it is also important for SREs to have a broad visualization of all the different parts of the infrastructure that are affected by those requirements to predict, for example, the amount of storage for logs, network bandwidth increase, and so on, making the high diversity of monitoring data mandatory.
Product owners are interested in low-resolution, high-latency, and low-diversity data. Where product owners are concerned, monitoring data usually steps away from infrastructure to the realm of business. Product owners strive to understand the trends of specific software products, where historical data is fundamental and resolution is not so critical. Keeping in mind the objective of evaluating the impact of software releases on the customers, latency is not as essential for them as it is for system administrators. The product owner manages a specific set of products, so a low diversity of monitoring data is expected, comprised mostly of business metrics.

下表以更简洁的形式总结了前面的示例：

	数据分辨率	数据延迟	数据多样性
基础设施警报	高的	低的	高的
软件发布视图	高的	高的	高的
容量规划	低的	高的	高的
产品/业务视图	低的	高的	低的

Monitoring components

与监控定义跨上下文变化的方式相同，其组件也遵循相同的困境。根据您想要的范围，我们可以在以下主题中找到部分或全部这些组件：

Metrics: This exposes a certain system resource, application action, or business characteristic as a specific point in time value. This information is obtained in an aggregated form; for example, you can find out how many requests per second were served but not the exact time for a specific request, and without context, you won't know the ID of the requests.
Logging: Containing much more data than a metric, this manifests itself as an event from a system or application, containing all the information that's produced by such an event. This information is not aggregated and has the full context.
Tracing: This is a special case of logging where a request is given a unique identifier so that it can be tracked during its entire life cycle across every system. Due to the increase of the dataset with the number of requests, it is a good idea to use samples instead of tracking all requests.
Alerting: This is the continuous threshold validation of metrics or logs, and fires an action or notification in the case of a transgression of the said threshold.
Visualization: This is a graphical representation of metrics, logs, or traces.

最近，监控这个术语已经被一个称为observability的超集所取代，这被认为是监控的演变，或者是一种不同的包装来炒作并重振这个概念（与 DevOps 发生的方式相同） )。从目前的情况来看，可观察性确实包括我们在此描述的所有组件。

在本书中，我们的监控定义包含了指标、警报和可视化。

监控是具有相关警报和可视化的指标。

Whitebox versus blackbox monitoring

我们可以通过多种方式进行监控，但它们主要分为两大类，即黑盒和白盒监控。

在黑盒监控中，应用程序或主机是从外部观察的，因此，这种方法可能相当有限。进行检查以评估被观察的系统是否以已知方式响应探针：

Does the host respond to Internet Control Message Protocol (ICMP) echo requests (more commonly known as ping)?
Is a given TCP port open?
Does the application respond with the correct data and status code when it receives a specific HTTP request?
Is the process for a specific application running in its host?

另一方面，在白盒监控中，被观察的系统会显示有关其内部状态和关键部分性能的数据。这种类型的内省可能非常强大，因为它暴露了操作遥测，从而暴露了不同内部组件的健康状况，否则这些组件将很难甚至不可能确定。这种遥测数据通常通过以下方式处理：

Exported through logging: This is by far the most common case and how applications exposed their inner workings before instrumentation libraries were widespread. For instance, an HTTP server's access log can be processed to monitor request rates, latencies, and error percentages.
Emitted as structured events: This approach is similar to logging but instead of being written to disk, the data is sent directly to processing systems for analysis and aggregation.
Maintained in memory as aggregates: Data in this format can be hosted in an endpoint or read directly from command-line tools. Examples of this approach are /metrics with Prometheus metrics, HAProxy's stats page, or the varnishstats command-line tool.

并非所有软件都经过检测并准备好公开其内部状态以进行度量收集。例如，它可以是第三方封闭源应用程序，无法显示其内部工作原理。在这些情况下，外部探测可能是一种可行的选择，可以收集被认为对正确的服务状态验证至关重要的数据。

无论如何，不仅第三方应用程序受益于黑盒监控。从客户的角度验证您的应用程序可能很有用，例如通过负载平衡器和防火墙。探测可能是您的最后一道防线——如果一切都失败了，您可以依靠黑盒监控来评估可用性。

Understanding metrics collection

监控系统衡量指标的过程通常可以分为两种方法——推送和拉取。正如我们将在以下主题中看到的那样，这两种方法都是有效的，并且各有利弊，我们将对此进行深入讨论。尽管如此，必须牢牢掌握它们的不同之处才能理解和充分利用 Prometheus。在了解了收集指标的工作原理之后，我们将深入研究应该收集的内容。有几种经过验证的方法可以实现这一点，我们将对每一种进行概述。

An overview of the two collection approaches

在基于推送的监控系统中，发出的指标或事件要么直接从生产应用程序发送，要么从本地代理发送到收集服务，如下所示：

读书笔记《hands-on-infrastructure-monitoring-with-prometheus》监测基础知识

Figure 1.1: Push-based monitoring system

处理原始事件数据的系统通常更喜欢推送，因为事件生成的频率非常高——每个实例每秒数百、数千甚至数万次——这将使轮询数据变得不切实际和复杂。需要某种缓冲机制来保持轮询之间生成的事件，与仅推送数据相比，事件新鲜度仍然是一个问题。使用这种方法的一些示例包括 Riemann、StatsD、Elasticsearch、Logstash 和 Kibana（ELK ）堆。

这并不是说只有这些类型的系统使用推送。一些监控系统，例如 Graphite、OpenTSDB，以及 Telegraf、InfluxDB、Chronograph 和 Kapacitor（TICK) 堆栈是使用这种方法设计的。即使是好的旧 Nagios 也支持通过 Nagios Service Check Acceptor (NSCA) 推送，通常称为被动检查：

Figure 1.2: Pull-based monitoring system

相比之下，基于拉的监控系统直接从应用程序或从使这些指标对系统可用的代理进程收集指标。一些著名的使用 pull 的监控软件是 Nagios 和 Nagios 风格的系统（Icinga、Zabbix、Zenoss 和 Sensu，仅举几例）。 Prometheus 也是接受拉取方法的公司之一，并且对此非常固执己见。

Push versus pull

在监控社区中，围绕这些设计决策中的每一个的优点存在很多争论。争论的焦点通常是关于目标发现，我们将在下面的段落中讨论。

在基于推送的系统中，受监视的主机和服务通过向监视系统报告来使自己知道。这里的优点是不需要新系统的先验知识就可以使用它们。但是，这意味着需要将监控服务的位置传播到所有目标，通常使用某种形式的配置管理。陈旧性是这种方法的一大缺点：如果系统有一段时间没有报告，这是否意味着它有问题，或者它是故意退役的？

此外，当您管理将数据推送到中心点的分布式主机和服务队列时，如雷鸣般的群（由于同时传入许多连接而导致过载）或错误配置导致无法预料的数据泛滥的风险变得更大缓解复杂且耗时。

在基于拉取的监控中，系统需要一个明确的主机和服务列表来监控，以便获取它们的指标。拥有一个中央真相来源可以在一定程度上保证一切都在它应该在的地方，但缺点是必须维护所述真相来源并随时更新任何变化。随着当今基础设施的快速变化，需要某种形式的自动发现来跟上整体情况。拥有一个集中的配置点可以在出现问题或配置错误的情况下更快地响应。

最后，每种方法的大部分缺点都可以通过巧妙的设计和自动化来减少或有效解决。选择监控工具时还有其他更重要的因素，例如灵活性、自动化的简易性、可维护性或对所使用技术的广泛支持。

尽管 Prometheus 是一个基于拉取的监控系统，但它也提供了一种通过使用从推送转换为拉取的网关来摄取推送指标的方法。这对于监视一类非常狭窄的进程很有用，我们将在本书后面看到。

What to measure

在计划指标收集时，必然会出现一个问题，即定义要观察的指标。要回答这个问题，我们应该转向当前的最佳实践和方法。在以下主题中，我们将概述最有影响力和最受推崇的降低噪音和提高性能可见性和一般可靠性问题的方法。

Google's four golden signals

谷歌关于监控的理由很简单。它直截了当地指出，要跟踪的四个最重要的指标如下：

Latency: The time required to serve a request
Traffic: The number of requests being made
Errors: The rate of failing requests
Saturation: The amount of work not being processed, which is usually queued

Brendan Gregg's USE method

Brendan 的方法更侧重于机器，它指出对于每个资源（CPU、磁盘、网络接口等），应监控以下指标：

Utilization: Measured as the percentage of the resource that was busy
Saturation: The amount of work the resource was not able to process, which is usually queued
Errors: Amount of errors that occurred

Tom Wilkie's RED method

RED 方法更侧重于服务级别的方法，而不是底层系统本身。显然，这种策略对于监控服务很有用，对于预测外部客户的体验也很有价值。如果服务的错误率增加，则可以合理地假设这些错误将直接或间接影响客户体验。这些是需要注意的指标：

Rate: Translated as requests per second
Errors: The amount of failing requests per second
Duration: The time taken by those requests

Summary

在本章中，我们有机会了解监控的真正价值以及如何在特定上下文中使用该术语，包括本书中使用的上下文。这将帮助您避免任何误解，并确保您清楚地了解本书在该主题上的立场。我们还介绍了监控的不同方面，例如指标、日志记录、跟踪、警报和可视化，同时展示了可观察性及其带来的好处。解决了白盒和黑盒监控问题，这为理解使用指标的好处提供了基础。有了关于指标的这些知识，我们了解了推拉机制以及关于每个论点的所有论点，然后结束了要在您管理的系统上跟踪的指标是什么。

在下一章中，我们将看一下 Prometheus 生态系统的概述，并讨论它的几个组件。

Questions

Why is monitoring definition so hard to clearly define?
Does a high latency of metrics impact the work of a system administrator who's focused on fixing a live incident?
What are the monitoring requirements to properly do capacity planning?
Is logging considered monitoring?
Regarding the available strategies for metrics collection, what are the downsides of using the push-based approach?
If you had to choose three basic metrics from a generic web service to focus on, which would they be?
When a check verifies whether a given process is running on a host by way of listing the running processes in said host, is that whitebox or blackbox monitoring?

vlambda博客
学习文章列表

读书笔记《hands-on-infrastructure-monitoring-with-prometheus》监测基础知识