Monitoring Logs and Metrics

在实际操作中，快速检测和调试问题的能力至关重要。在本章中，我们将讨论两个最重要的工具，我们可以使用它们来发现处理大量请求的生产集群中发生了什么。第一个工具是日志，它帮助我们了解单个请求中发生了什么，而另一个工具是指标，它对系统的聚合性能进行分类。

本章将涵盖以下主题：

Observability of a live system
Setting up logs
Detecting problems through logs
Setting up metrics
Being proactive

在本章结束时，您将了解如何添加日志以便它们可用于检测问题，以及如何添加和绘制指标并了解它们之间的差异。

Technical requirements

我们将使用本章的示例系统并对其进行调整以包括集中式日志记录和指标。本章的代码可以在本书的 GitHub 存储库中找到：https://github.com/PacktPublishing/Hands-On-Docker-for-Microservices-with-Python/tree/master/Chapter10。

要安装集群，您需要构建每个单独的微服务：

$ cd Chapter10/microservices/
$ cd frontend
$ docker-compose build
...
$ cd thoughts_backend
$ docker-compose build
...
$ cd users_backend
$ docker-compose build
...

本章中的微服务与我们之前介绍的微服务相同，但它们增加了额外的日志和指标配置。

现在，我们需要创建示例命名空间并使用 Chapter10/kubernetes 子目录中的 find 配置启动 Kubernetes 集群：

$ cd Chapter10/kubernetes
$ kubectl create namespace example
$ kubectl apply --recursive -f .
...

为了能够访问不同的服务，您需要更新您的 /etc/hosts 文件，使其包含以下代码行：

127.0.0.1 thoughts.example.local
127.0.0.1 users.example.local
127.0.0.1 frontend.example.local
127.0.0.1 syslog.example.local
127.0.0.1 prometheus.example.local
127.0.0.1 grafana.example.local

这样，您将能够访问本章的日志和指标。

Observability of a live system

可观察性是了解实时系统中正在发生的事情的能力。我们可以处理低可观察性系统，我们无法知道发生了什么，或者高可观察性系统，我们可以通过工具从外部推断事件和内部状态。

可观察性是系统本身的属性。通常，监视是获取有关系统当前或过去状态的信息的操作。这有点像命名辩论，但您监控系统以收集它的可观察部分。

在大多数情况下，监控很容易。有很多很棒的工具可以帮助我们捕获和分析信息并以各种方式呈现它。但是，系统需要公开相关信息，以便可以收集。

公开正确数量的信息是困难的。太多的信息会产生大量的噪音，从而隐藏相关信号。信息太少不足以发现问题。在本章中，我们将研究解决这一问题的不同策略，但每个系统都必须自己探索和发现这一点。期待在您自己的系统中进行试验和更改！

分布式系统，例如遵循微服务架构的系统，也存在问题，因为系统的复杂性可能使其难以理解其内部状态。在某些情况下，行为也可能是不可预测的。这种大规模的系统本质上永远不会完全健康。这里和那里总会有小问题。您需要开发一个优先系统来确定哪些问题需要立即采取行动，哪些问题可以在稍后阶段解决。

微服务可观察性的主要工具是日志和指标。它们被社区很好地理解和使用，并且有许多工具可以大大简化它们的使用，既可以作为可以在本地安装的软件包，也可以作为可以帮助保留数据和降低维护成本的云服务。

Using cloud services for monitoring will save you from maintenance costs. We will talk about this later in the Setting up logs and Setting up metrics sections.

Another alternative when it comes to observability is services such as Data Dog ( https://www.datadoghq.com/ ) and New Relic ( https://newrelic.com/ ). They receive events – normally logs – and are able to derive metrics from there.

正如我们在前几章中看到的，可以通过 kubectl 检查集群状态的最重要细节。这将包括诸如已部署的版本、重启、拉取镜像等详细信息。

对于生产环境，最好部署一个基于 Web 的工具来显示此类信息。查看 Weave Scope，这是一种开源工具，可以在网页中显示数据，类似于可以通过以下方式获得的数据 kubectl，但以更好、更图形化的方式。您可以在此处找到有关此工具的更多信息： https://www.weave.works/oss/scope/。

日志和指标有不同的目标，两者都可能很复杂。我们将在本书中介绍它们的一些常见用法。

Understanding logs

日志跟踪系统中发生的独特事件。每个日志都存储一条消息，该消息是在执行代码的特定部分时产生的。日志可以是完全通用的（函数 X 被调用）或包含特定细节（函数 X 被参数 A 调用）。

最常见的日志格式是将它们生成为纯字符串。这是非常灵活的，通常与日志相关的工具可以处理文本搜索。

每个日志都包含一些关于谁生成日志、创建时间等的元数据。这通常也被编码为文本，在日志的开头。标准格式有助于排序和过滤。

日志还包括严重性级别。这允许分类，以便我们可以捕获消息的重要性。严重性级别可以按重要性顺序为 DEBUG、INFO、WARNING 或 ERROR。这种严重性允许我们过滤掉不重要的日志并确定我们应该采取的行动。日志记录工具可以配置为设置阈值；不太严重的日志将被忽略。

There are many severity levels, and you can define custom intermediate levels if you wish. However, this isn't very useful except in very specific situations. Later in this chapter, in the Detecting problems through logs section, we will describe how to set a strategy per level; too many levels can add confusion.

在 Web 服务环境中，大多数日志将作为 Web 请求响应的一部分生成。这意味着请求将到达系统，被处理并返回一个值。沿途会生成几个日志。请记住，在负载下的系统中，多个请求将同时发生，因此多个请求的日志也将同时生成。例如，注意第二个日志如何来自不同的 IP：

Aug 15 00:15:15.100 10.1.0.90 INFO app: REQUEST GET /endpoint
Aug 15 00:15:15.153 10.1.0.92 INFO api: REQUEST GET /api/endpoint
Aug 15 00:15:15.175 10.1.0.90 INFO app: RESPONSE TIME 4 ms
Aug 15 00:15:15.210 10.1.0.90 INFO app: RESPONSE STATUS 200

可以添加一个通用请求 ID 以将为单个请求生成的所有相关日志分组。我们将在本章后面看到如何做到这一点。

每个单独的日志可能相对较大，并且总体上会占用大量磁盘空间。在负载下的系统中，日志可能会迅速增长到不成比例的程度。不同的日志系统允许我们调整它们的保留时间，这意味着我们只保留它们一定的时间。在保留日志以查看过去发生的事情和使用合理的空间之间找到平衡很重要。

Be sure to check the retention policies when enabling any new log service, whether it be local or cloud-based. You won't be able to analyze what happened before the time window. Double-check that the progress rate is as expected – you don't want to find out that you went unexpectedly over quota while you were tracking a bug.

一些工具允许我们使用原始日志来生成汇总结果。他们可以计算特定日志出现的次数，并生成每分钟的平均次数或其他统计信息。但是，这很昂贵，因为每个日志都占用空间。要观察这种聚合行为，最好使用特定的度量系统。

Understanding metrics

指标处理聚合信息。它们显示的信息不是与单个事件相关，而是与一组事件相关。这使我们能够以比使用日志更好的方式检查集群的一般状态。

我们将使用与 Web 服务相关的典型示例，主要处理请求指标，但不会受到它们的限制。您可以生成自己的特定于您的服务的指标！

在日志保存有关每个单独事件的信息的情况下，指标将信息减少到事件发生的次数，或者将它们减少到可以以某种方式平均或聚合的值。

这使得指标比日志更轻量，并允许我们根据时间绘制它们。指标提供的信息包括每分钟的请求数、一分钟内请求的平均时间、排队的请求数、每分钟的错误数等。

指标的分辨率可能取决于用于聚合它们的工具。请记住，更高的分辨率将需要更多资源。典型的分辨率是 1 分钟，它足够小，可以显示详细信息，除非您有一个非常活跃的系统，每秒接收 10 个或更多请求。

捕获和分析与性能相关的信息，例如平均请求时间，使我们能够检测可能的瓶颈并迅速采取行动，以提高系统的性能。平均而言，这更容易处理，因为单个请求可能无法捕获足够的信息让我们看到全局。它还可以帮助我们预测未来的瓶颈。

有许多不同类型的指标，具体取决于所使用的工具。最常用的支持如下：

Counter: A trigger is generated each time something happens. This will be counted and aggregated. An example of this is the number of requests and the number of errors.
Gauge: A single number that is unique. It can go up or down, but the last value overwrites the previous. An example of this is the number of requests in the queue and the number of available workers.
Measure: Events that have a number associated with them. These numbers can be averaged, summed, or aggregated in some way. Compared with gauges, the difference is that previous measures are still independent; for example, when we request time in milliseconds and request size in bytes. Measures can also work as counters since their number can be important; for example, tracking the request time also counts the number of requests.

度量有两种主要的工作方式：

Each time something happens, an event gets pushed toward the metrics collector.
Each system maintains their own metrics, which are then pulled from the metrics system periodically.

每种方式都有其优点和缺点。推送事件会产生更高的流量，因为每个事件都需要发送；这可能会导致瓶颈和延迟。拉取事件只会对信息进行采样，而不会准确地错过样本之间发生的事情，但它本质上更具可扩展性。

虽然这两种方法都在使用，但趋势正在朝着拉动系统指标的方向发展。它们减少了推送系统所需的维护，并且更容易扩展。

我们将设置 Prometheus，它使用第二种方法。第一种方法最常用的指数是石墨。

也可以组合度量来生成其他度量；例如，我们可以将返回错误的请求数除以生成错误请求的请求总数。这种派生的指标可以帮助我们以有意义的方式呈现信息。

仪表板中可以显示多个指标，以便我们了解服务或集群的状态。一目了然，这些图形工具使我们能够检测系统的一般状态。我们将设置 Grafana 使其显示图形信息：

读书笔记《hands-on-docker-for-microservices-with-python》监控日志和指标

与日志相比，指标占用的空间要少得多，并且可以捕获更大的时间窗口。甚至可以保留系统生命周期的指标。这与日志不同，日志永远不能存储那么久。

Setting up logs

我们会将系统生成的所有日志集中到一个 pod 中。在本地开发中，此 pod 将通过 Web 界面公开所有接收到的日志。

日志将通过 syslog 协议发送，这是最标准的传输方式。 syslog 在 Python 中以及在几乎所有处理日志记录并具有 Unix 支持的系统中都有原生支持。

Using a single container makes it easy to aggregate logs. In production, this system should be replaced with a container that relays the received logs to a cloud service such as Loggly or Splunk.

有多个 syslog 服务器能够接收日志并聚合它们； syslog-ng (https://www.syslog-ng.com/) 和 rsyslog (https://www.rsyslog.com/) 是最常见的。最简单的方法是接收日志并将它们存储在文件中。让我们启动一个带有 rsyslog 服务器的容器，该服务器将存储接收到的日志。

Setting up an rsyslog container

在本节中，我们将创建自己的 rsyslog 服务器。这是一个非常简单的容器，您可以在 GitHub 上查看 docker-compose 和 Dockerfile 以获取有关日志的更多信息（https://github.com/PacktPublishing/Hands-On-Docker-for -Microservices-with-Python/tree/master/Chapter10/kubernetes/logs）。

我们将使用 UDP 协议设置日志。这是标准协议 syslog，但它不如用于 Web 开发的通常的基于 TCP 的 HTTP 常见。

主要区别在于 UDP 是无连接的，因此发送日志并且没有收到已发送的确认信息。这使得 UDP 更轻、更快，但也不太可靠。如果网络出现问题，一些日志可能会在没有警告的情况下消失。

这通常是一个适当的权衡，因为日志的数量很高，而丢失一些日志的影响并不大。 syslog 也可以通过 TCP 工作，从而提高可靠性，但也降低了系统的性能。

Dockerfile 安装 rsyslog 并复制其配置文件：

FROM alpine:3.9

RUN apk add --update rsyslog

COPY rsyslog.conf /etc/rsyslog.d/rsyslog.conf

配置文件主要在5140端口启动服务器，并将接收到的文件存放在/var/log/syslog中：

# Start a UDP listen port at 5140
module(load="imudp")
input(type="imudp" port="5140")
...
# Store the received files in /var/log/syslog, and enable rotation
$outchannel log_rotation,/var/log/syslog, 5000000,/bin/rm /var/log/syslog

通过日志轮换，我们在 /var/log/syslog 文件的一侧设置了一个限制，这样它就不会无限制地增长。

我们可以使用通常的 docker-compose 命令构建容器：

$ docker-compose build
Building rsyslog
...
Successfully built 560bf048c48a
Successfully tagged rsyslog:latest

这将创建一个 pod、一个服务和一个入口的组合，就像我们对其他微服务所做的那样，以收集日志并允许从浏览器进行外部访问。

Defining the syslog pod

syslog pod 将包含 rsyslog 容器和另一个用于显示日志的容器。

为了显示日志，我们将使用 front rail，这是一个将日志文件流式传输到 Web 服务器的应用程序。我们需要在同一个 pod 中的两个容器之间共享文件，最简单的方法是通过卷。

我们使用部署来控制 pod。您可以在 https://github.com/PacktPublishing/Hands-On-Docker-for-Microservices-with-Python/blob/master/Chapter10/kubernetes/logs/deployment.yaml。让我们在以下小节中看看它最有趣的部分。

log-volume

log-volume 创建一个在两个容器之间共享的空目录：

  volumes:
  - emptyDir: {}
    name: log-volume

这允许容器在将信息存储在文件中的同时进行通信。 syslog 容器将写入它，而前轨将读取它。

syslog container

syslog 容器启动一个 rsyslogd 进程：

spec:
  containers:
  - name: syslog
    command:
      - rsyslogd
      - -n
      - -f
      - /etc/rsyslog.d/rsyslog.conf
    image: rsyslog:latest
    imagePullPolicy: Never
    ports:
      - containerPort: 5140
        protocol: UDP
    volumeMounts:
      - mountPath: /var/log
        name: log-volume

rsyslogd -n -f /etc/rsyslog.d/rsyslog.conf command 使用我们之前描述的配置文件启动服务器。 -n 参数将进程保持在前台，从而保持容器运行。

指定UDP端口5140，即接收日志的定义端口，log-volume挂载到/var/log。稍后在文件中，将定义 log-volume。

The front rail container

前轨容器是从官方容器镜像开始的：

  - name: frontrail
    args:
    - --ui-highlight
    - /var/log/syslog
    - -n
    - "1000"
    image: mthenw/frontail:4.6.0
    imagePullPolicy: Always
    ports:
    - containerPort: 9001
      protocol: TCP
    resources: {}
    volumeMounts:
    - mountPath: /var/log
      name: log-volume

我们用 frontrail /var/log/syslog 命令启动它，指定端口 9001（这是我们用来访问 frontrail 的端口），并挂载 /var/log，就像我们对 syslog 容器所做的那样，以共享日志文件。

Allowing external access

正如我们对其他微服务所做的那样，我们将创建一个服务和一个入口。该服务将被其他微服务使用，以便他们可以发送日志。 Ingress 将用于访问 Web 界面，以便我们可以在日志到达时看到它们。

YAML 文件位于 GitHub (https://github.com/PacktPublishing/Hands-On-Docker-for-Microservices-with-Python/tree/master/Chapter10/kubernetes/logs) 在 service.yaml 和 ingress.yaml 文件。

服务非常简单；唯一的特点是它有两个端口 – 一个 TCP 端口和一个 UDP 端口 – 并且每个端口都连接到不同的容器：

spec:
  ports:
  - name: fronttail
    port: 9001
    protocol: TCP
    targetPort: 9001
  - name: syslog
    port: 5140
    protocol: UDP
    targetPort: 5140

Ingress 只暴露了前轨端口，这意味着我们可以通过浏览器访问它。请记住，DNS 需要添加到您的 /etc/host 文件中，如本章开头所述：

apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  name: syslog-ingress
  namespace: example
spec:
  rules:
  - host: syslog.example.local
    http:
      paths:
      - backend:
          serviceName: syslog
          servicePort: 9001
        path: /

在浏览器中转到 http://syslog.example.local 将允许您访问前轨界面：

您可以使用右上角的框过滤日志。

请记住，大多数情况下，日志会反映就绪和活跃度探测，如前面的屏幕截图所示。您在系统中进行的健康检查越多，您得到的噪音就越多。

您可以在 syslog 级别通过配置 rsyslog.conf 文件，但注意不要遗漏任何相关信息。

现在，我们需要在此处查看其他微服务如何配置和发送它们的日志。

Sending logs

我们需要在 uWSGI 中配置微服务，以便我们可以将日志转发到日志服务。我们将使用 Thoughts Backend 作为示例，即使位于 Chapter10/microservices 目录下的 Frontend 和 Users Backend 也启用了此配置。

打开 uwsgi.ini 配置文件（https://github.com/PacktPublishing/Hands-On-Docker-for-Microservices-with-Python/blob/master/Chapter10/microservices/想法后端/docker/app/uwsgi.ini）。您将看到以下行：

# Log to the logger container
logger = rsyslog:syslog:5140,thoughts_backend

这会将 rsyslog 格式的日志发送到端口 5140 上的 syslog 服务。我们还添加了 facility，这是日志的来源。这会将字符串添加到来自该服务的所有日志中，这有助于排序和过滤。每个 uwsgi.ini 文件都应该有自己的工具来帮助过滤。

在支持 syslog 协议，该工具需要适合预定值，例如 KERN, LOCAL_7 等等。但在大多数现代系统中，这是一个可以取任何值的任意字符串。

uWSGI 的自动日志很有趣，但我们还需要设置自己的日志以进行自定义跟踪。让我们看看如何。

Generating application logs

Flask 自动为应用程序配置一个记录器。我们需要通过如下方式添加日志，如api_namespace.py文件（https://github.com/PacktPublishing/Hands-On-Docker-for- Microservices-with-Python/blob/master/Chapter10/microservices/thoughts_backend/ThoughtsBackend/thoughts_backend/api_namespace.py#L102）：

from flask import current_app as app

...
if search_param:
    param = f'%{search_param}%'
    app.logger.info(f'Searching with params {param}')
    query = (query.filter(ThoughtModel.text.ilike(param)))

app.logger 可以调用 .debug、.info、.warning 或 .error 生成日志。请注意，可以通过导入 current_app 来检索 app。

记录器遵循 Python 中的标准 logging 模块。它可以通过不同的方式进行配置。看看 app.py 文件 (https://github.com/PacktPublishing/Hands-On-Docker-for-Microservices-with-Python/blob/master /Chapter10/microservices/thoughts_backend/ThoughtsBackend/thoughts_backend/app.py) 查看我们将在以下小节中介绍的不同配置。

Dictionary configuration

第一级日志记录通过默认的 dictConfig 变量。该变量由 Flask 自动定义，允许我们以 Python 文档中定义的方式配置日志（https://docs.python.org/3.7/library/logging.config.html）。您可以在 app.py 文件中查看日志记录的定义：

from logging.config import dictConfig

dictConfig({
    'version': 1,
    'formatters': {
        'default': {
            'format': '[%(asctime)s] %(levelname)s in 
                        %(module)s: %(message)s',
        }
    },
    'handlers': {
        'wsgi': {
            'class': 'logging.StreamHandler',
            'stream': 'ext://flask.logging.wsgi_errors_stream',
            'formatter': 'default'
        }
    },
    'root': {
        'level': 'INFO',
        'handlers': ['wsgi']
    }
})

dictConfig 字典具有三个主要级别：

formatters: This checks how the log is formatted. To define the format, you can use the automatic values that are available in the Python documentation (https://docs.python.org/3/library/logging.html#logrecord-attributes). This gathers information for every log.
handlers: This checks where the log goes to. You can assign one or more to the loggers. We defined a handler called wsgi and configured it so that it goes up, toward uWSGI.
root: This is the top level for logs, so anything that wasn't previously logged will refer to this level. We configure the INFO logging level here.

这设置了默认配置，这样我们就不会错过任何日志。但是，我们可以创建更复杂的日志处理程序。

Logging a request ID

分析大量日志时的问题之一是将它们关联起来。我们需要看看哪些是相互关联的。一种可能性是按生成日志的 pod 过滤日志，该 pod 存储在日志的开头（例如，10-1-0-27.frontend-service.example.svc.cluster.local ）。这类似于生成日志的主机。然而，这个过程很麻烦，在某些情况下，一个容器可以同时处理两个请求。我们需要每个请求的唯一标识符，该标识符被添加到单个请求的所有日志中。

为此，我们将使用 flask-request-id-header 包（https://pypi.org/project/flask-request-id-header/)。这会添加一个 X-Request-ID 标头（如果不存在），我们可以使用它来记录每个单独的请求。

为什么我们为请求设置一个标头而不是在内存中存储一个随机生成的值？这是一种常见的模式，允许我们将请求 ID 注入后端。请求 ID 允许我们在不同微服务请求的生命周期中携带相同的请求标识符。例如，我们可以在前端生成它并将其传递给思想后端，以便我们可以跟踪多个具有相同来源的内部请求。

尽管为了简单起见，我们不会在示例中包含此内容，但随着微服务系统的发展，这对于确定流程和来源变得至关重要。生成一个模块以便我们可以自动将其传递给内部调用是一项不错的投资。

下图显示了前端和两个服务之间的流程。请注意，X-Request-ID 标头在到达时没有为 frontend 服务设置，它需要转发到任何调用：

我们需要还将日志直接发送到 syslog 服务，以便我们可以创建一个处理程序来为我们执行此操作。

从脚本执行代码时，与在 Web 服务器中运行代码相比，我们不使用此处理程序。直接运行脚本时，我们希望我们的日志转到我们之前定义的默认记录器。在create_app中，我们会设置一个参数来区分它们。

The Python logging module has a lot of interesting features. Check out the Python documentation for more information ( https://docs.python.org/3/library/logging.html ).

Setting logs properly is trickier than it looks. Don't be discouraged and keep tweaking them until they work.

我们将在 app.py 文件中设置所有日志记录配置。让我们分解配置的每个部分：

First, we will generate a formatter that appends the request_id so that it's available when generating logs:

class RequestFormatter(logging.Formatter):
    ''' Inject the HTTP_X_REQUEST_ID to format logs '''

    def format(self, record):
        record.request_id = 'NA'

        if has_request_context():
            record.request_id = request.environ.get("HTTP_X_REQUEST_ID")

        return super().format(record)

如您所见，HTTP_X_REQUEST_ID 标头在 request.environ 变量中可用。

Later, in create_app, we will set up the handler that we append to the application logger:

# Enable RequestId
application.config['REQUEST_ID_UNIQUE_VALUE_PREFIX'] = ''
RequestID(application)

if not script:
    # For scripts, it should not connect to Syslog
    handler = logging.handlers.SysLogHandler(('syslog', 5140))
    req_format = ('[%(asctime)s] %(levelname)s [%(request_id)s] '
                    %(module)s: %(message)s')
    handler.setFormatter(RequestFormatter(req_format))
    handler.setLevel(logging.INFO)
    application.logger.addHandler(handler)
    # Do not propagate to avoid log duplication
    application.logger.propagate = False

如果运行发生在脚本之外，我们只设置处理程序。 SysLogHandler 包含在 Python 中。在此之后，我们设置格式，其中包括 request_id。格式化程序使用我们之前定义的 RequestFormatter。

Here, we are hardcoding the values of the logger level to INFO and the syslog host to syslog , which corresponds to the service. Kubernetes will resolve this DNS correctly. Both values can be passed through environment variables, but we didn't do this here for the sake of simplicity.

记录器尚未传播，因此请避免将其发送到 root 记录器，这将复制日志。

Logging each request

我们需要捕获的每个请求中都有一些共同的元素。 Flask 允许我们在请求之前和之后执行代码，因此我们可以使用它来记录每个请求的公共元素。让我们学习如何做到这一点。

在 app.py 文件中，我们将定义 logging_before 函数：

from flask import current_app, g

def logging_before():
    msg = 'REQUEST {REQUEST_METHOD} {REQUEST_URI}'.format(**request.environ)
    current_app.logger.info(msg)

    # Store the start time for the request
    g.start_time = time()

这将创建一个带有单词 REQUEST 的日志和每个请求的两个基本部分 - 方法和 URI - 来自 request .环境。然后，它们被添加到应用记录器的 INFO 日志中。

我们还使用 g 对象来存储请求开始的时间。

这 g 对象允许我们通过请求存储值。我们将使用它来计算请求将花费的时间。

还有相应的 logging_after 函数。它在请求结束时收集时间并以毫秒为单位计算差异：

def logging_after(response):
    # Get total time in milliseconds
    total_time = time() - g.start_time
    time_in_ms = int(total_time * 1000)
    msg = f'RESPONSE TIME {time_in_ms} ms'
    current_app.logger.info(msg)

    msg = f'RESPONSE STATUS {response.status_code.value}'
    current_app.logger.info(msg)

    # Store metrics
    ...

    return response

正如我们将在下一节中看到的，这将使我们能够检测到需要更长时间并将存储在指标中的请求。

然后，在 create_app 函数中启用这些函数：

def create_app(script=False):
    ...
    application = Flask(__name__)
    application.before_request(logging_before)
    application.after_request(logging_after)

每次我们生成请求时，这都会创建一组日志。

生成日志后，我们可以在 frontrail 界面中搜索它们。

Searching through all the logs

来自不同应用程序的所有不同日志都将集中起来，并可在 http://syslog.example.local 上进行搜索。

如果你调用 http://frontend.example.local/search?search=speak 搜索想法，你会在日志中看到对应的想法后端，如图以下截图：

我们可以通过请求ID，即63517c17-5a40-4856-9f3b-904b180688f6进行过滤，得到Thoughts Backend请求日志。紧随其后的是 thoughts_backend_uwsgi 和 frontend_uwsgi 请求日志，它们显示了请求的流程。

在这里，您可以看到我们之前讨论过的所有元素：

The REQUEST log before the request
The api_namespace request, which contains app data
The after RESPONSE logs, which contain the result and time

在 Thoughts Backend 的代码中，我们故意留下了一个错误。如果用户尝试分享新想法，它将被触发。我们将使用它来学习如何通过日志调试问题。

Detecting problems through logs

对于您正在运行的系统中的任何问题，都可能发生两种错误：预期错误和意外错误。

Detecting expected errors

预期错误是通过在代码中显式创建 ERROR 日志而引发的错误。如果正在生成错误日志，这意味着它反映了预先计划好的情况；例如，您无法连接到数据库，或者某些数据以旧的、已弃用的格式存储。我们不希望这种情况发生，但我们看到了它发生的可能性并准备了处理它的代码。他们通常能很好地描述问题，即使解决方案不是很明显。

它们相对容易处理，因为它们描述了可预见的问题。

Capturing unexpected errors

意外错误是可能发生的其他类型的错误。事情以不可预见的方式中断。意外错误通常是由 Python 异常在代码中的某个点引发但未被捕获而产生的。

如果日志记录已正确配置，任何未捕获的异常或错误都将触发 ERROR 日志，其中将包括堆栈跟踪。这些错误可能不会立即显现出来，需要进一步调查。

为了帮助解释这些错误，我们在 Chapter10 代码。您可以在 GitHub 上查看代码（ https ://github.com/PacktPublishing/Hands-On-Docker-for-Microservices-with-Python/tree/master/Chapter10/microservices/thoughts_backend/ThoughtsBackend/thoughts_backend）。这模拟了一个意外的异常。

在尝试为已登录的用户发布新想法时，我们得到一个奇怪的行为，并在日志中看到以下错误。如下图右上角所示，我们通过ERROR过滤来过滤问题：

如您所见，堆栈跟踪显示在一行中。这可能取决于您如何捕获和显示日志。 Flask 将自动生成一个状态码为 500 的 HTTP 响应。如果调用者还没有准备好接收 500 响应，这可能会触发更多的错误。

然后，堆栈跟踪将让您知道发生了什么故障。在这种情况下，我们可以看到 api_namespace.py 文件中的 80 行有一个 raise Exception 命令。这使我们能够定位异常。

Since this is a synthetic error that's been generated specifically as an example, it is actually easy to find out the root cause. In the example code, we are explicitly raising an exception, which produces an error. This may not be the case in a real use case, where the exception could be generated in a different place than the actual error. Exceptions can be also originated in a different microservice within the same cluster.

检测到错误后，目标应该是在微服务中通过单元测试来复制它，以生成异常。这将使我们能够在受控环境中复制条件。

如果我们对 Chapter10 中可用的 Thoughts Backend 代码进行测试，我们会因此而看到错误。请注意，日志显示在失败的测试中：

$ docker-compose run test
...
___ ERROR at setup of test_get_non_existing_thought ___
-------- Captured log setup ---------
INFO flask.app:app.py:46 REQUEST POST /api/me/thoughts/
INFO flask.app:token_validation.py:66 Header successfully validated
ERROR flask.app:app.py:1761 Exception on /api/me/thoughts/ [POST]
Traceback (most recent call last):
  File "/opt/venv/lib/python3.6/site-packages/flask/app.py", line 1813, in full_dispatch_request
    rv = self.dispatch_request()
  File "/opt/venv/lib/python3.6/site-packages/flask/app.py", line 1799, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "/opt/venv/lib/python3.6/site-packages/flask_restplus/api.py", line 325, in wrapper
    resp = resource(*args, **kwargs)
  File "/opt/venv/lib/python3.6/site-packages/flask/views.py", line 88, in view
    return self.dispatch_request(*args, **kwargs)
  File "/opt/venv/lib/python3.6/site-packages/flask_restplus/resource.py", line 44, in dispatch_request
    resp = meth(*args, **kwargs)
  File "/opt/venv/lib/python3.6/site-packages/flask_restplus/marshalling.py", line 136, in wrapper
    resp = f(*args, **kwargs)
  File "/opt/code/thoughts_backend/api_namespace.py", line 80, in post
    raise Exception('Unexpected error!')
Exception: Unexpected error!
INFO flask.app:app.py:57 RESPONSE TIME 3 ms
INFO flask.app:app.py:60 RESPONSE STATUS 500

一旦在单元测试中重现了错误，修复它通常是微不足道的。添加单元测试以捕获触发错误的一组条件，然后修复它。新的单元测试将检测错误是否已在每个自动构建中重新引入。

要修复示例代码，请删除 raise 行代码。然后，事情将再次起作用。

有时，问题无法解决，因为它可能是外部的。也许我们数据库中的某些行有问题，或者另一个服务返回了格式不正确的数据。在这些情况下，我们无法完全避免错误的根本原因。但是，可以捕获问题，进行一些补救，然后从意外错误转变为预期错误。

请注意，并非每个检测到的意外错误都值得花时间处理。有时，未捕获的错误提供了关于问题所在的足够信息，这超出了 Web 服务应处理的范围；例如，可能存在网络问题，Web 服务无法连接到数据库。当您想花时间进行开发时，请使用您的判断力。

Logging strategy

我们在处理日志时会出现问题。特定消息的适当级别是多少？这是 WARNING 还是 ERROR？这应该是一个 INFO 语句吗？

大多数日志级别描述使用诸如程序显示潜在有害情况或程序突出请求的进度等定义。这些是模糊的，在现实生活环境中不是很有用。相反，尝试通过将每个日志级别与预期的后续操作相关联来定义它们。这有助于明确在找到特定级别的日志时应该做什么。

下表显示了不同级别的一些示例以及应采取的措施：

Log level	Action to take	Comments
`DEBUG`	Nothing.	Not tracked.
`INFO`	Nothing.	The `INFO` logs show generic information about the flow of the request to help track problems.
`WARNING`	Track number. Alert on raising levels.	The `WARNING` logs track errors that have been automatically fixed, such as retries to connect (but finally connecting) or fixable formatting errors in the database's data. A sudden increase may require investigation.
`ERROR`	Track number. Alert on raising levels. Review all.	The `ERROR` logs track errors that can't be fixed. A sudden increase may require immediate action so that this can be remediated.
`CRITICAL`	Immediate response.	A `CRITICAL` log indicates a catastrophic failure in the system. Even one will indicate that the system is not working and can't recover.

这只是一个建议，但它对如何应对设定了明确的期望。根据您的团队的工作方式和预期的服务水平，您可以根据您的用例调整它们。

在这里，层次结构非常清晰，并且可以接受将生成一定数量的 ERROR 日志。并非所有事情都需要立即修复，但应该记录和审查它们。

In real life, ERROR logs will be typically categorized as "we're doomed" or "meh." Development teams should actively either fix or remove "mehs" to reduce them as much as possible. That may include lowering the level of logs if they aren't covering actual errors. You want as few ERROR logs as possible, but all of them need to be meaningful.

Be pragmatic, though. Sometimes, errors can't be fixed straight away and time is best utilized in other tasks. However, teams should reserve time to reduce the number of errors that occur. Failing to do so will compromise the reliability of the system in the medium term.

WARNING 日志表明某些事情可能不像我们预期的那样顺利，但除非数量增加，否则无需恐慌。 INFO 只是在出现问题时为我们提供上下文，否则应忽略。

Avoid the temptation to produce an ERROR log when there's a request returning a 400 BAD REQUEST status code. Some developers will argue that if the customer sent a malformed request, it is actually an error. But this isn't something that you should care about if the request has been properly detected and returned. This is business as usual. If this behavior can lead to indicate something else, such as repeated attempts to send incorrect passwords, you can set a WARNING log. There's no point in generating ERROR logs when your system is behaving as expected.

As a rule of thumb, if a request is not returning some sort of 500 error (500, 502, 504, and so on), it should not generate an ERROR log. Remember the categorization of 400 errors as you (customer) have a problem versus 500 errors, which are categorized as I have a problem.

This is not absolute, though. For example, a spike in authentication errors that are normally 4XX errors may indicate that users cannot create logs due to a real internal problem.

牢记这些定义，您的开发和运营团队将有共同的理解，这将有助于他们采取有意义的行动。

随着系统的成熟，期望调整系统并更改日志的某些级别。

Adding logs while developing

正如我们已经看到的，正确配置 pytest 将使测试中的任何错误都显示捕获的日志。

这是一个检查在开发功能时是否正在生成预期日志的机会。任何检查错误条件的测试还应添加其相应的日志并检查它们是否在功能开发过程中生成。

You can check the logs as part of testing with a tool such as pytest-catchlog ( https://pypi.org/project/pytest-catchlog/) to enforce that the proper logs are being produced.

Typically, though, just taking a bit of care and checking during development that logs are produced is enough for most cases. However, be sure that developers understand why it's useful to have logs while they're developing.

在开发过程中，DEBUG 日志可用于显示有关生产流程的额外信息。这样可以填补INFO日志之间的空白，帮助我们养成添加日志的习惯。如果在测试期间发现 DEBUG 日志对跟踪生产中的问题很有用，则可以将其提升为 INFO。

潜在地，DEBUG 日志可以在受控情况下在生产中启用以跟踪一些困难问题，但请注意拥有大量日志的含义。

Be sensible with the information that's presented in INFO logs. In terms of the information that's displayed, avoid sensible data such as passwords, secret keys, credit card numbers, or personal information. This is the same for the number of logs.

Keep an eye on any size limitations and how quickly logs are being generated. Growing systems may have a log explosion while new features are being added, more requests are flowing through the system, and new workers are being added.

此外，请仔细检查日志是否正确生成和捕获，以及它们是否在所有不同级别和环境中工作。所有这些配置可能需要一些时间，但您需要非常确定您可以捕获生产中的意外错误并且所有管道都已正确设置。

让我们看一下可观察性的另一个关键要素：指标。

Setting up metrics

要使用 Prometheus 设置指标，我们需要了解该过程是如何工作的。它的关键组件是每个被测量的服务都有自己的 Prometheus 客户端来跟踪指标。 Prometheus 服务器中的数据可用于绘制指标的 Grafana 服务。

下图显示了一般架构：

Prometheus 服务器会定期拉取信息。这种操作方法非常轻量级，因为注册指标只是更新服务的本地内存并且可以很好地扩展。另一方面，它会在特定时间显示采样数据，并且不会记录每个单独的事件。这在存储和表示数据方面具有一定的影响，并对数据的分辨率施加了限制，特别是对于非常低的速率。

有许多可用的指标导出器将在不同系统中公开标准指标，例如数据库、硬件、HTTP 服务器或存储。查看 Prometheus 文档了解更多信息： https://prometheus.io/docs/instrumenting/exporters/。

这意味着我们的每个服务都需要安装 Prometheus 客户端并以某种方式公开其收集的指标。我们将为 Flask 和 Django 使用标准客户端。

Defining metrics for the Thoughts Backend

对于 Flask 应用，我们将使用 prometheus-flask-exporter 包（https://github .com/rycus86/prometheus_flask_exporter)，已添加到 requirements.txt。

它在 app.py 文件（https://github.com/PacktPublishing/Hands-On-Docker-for-Microservices-with-Python/blob/master/ Chapter10/microservices/thoughts_backend/ThoughtsBackend/thoughts_backend/app.py#L95) 应用程序创建时。

metrics 对象设置为没有应用程序，然后在 created_app 函数中实例化：

from prometheus_flask_exporter import PrometheusMetrics

metrics = PrometheusMetrics(app=None)

def create_app(script=False):
    ...
    # Initialise metrics
    metrics.init_app(application)

这会在 /metrics 服务端点中生成一个端点，即 http://thoughts.example.local/metrics，它以 Prometheus 格式返回数据。 Prometheus 格式是纯文本，如下图所示：

prometheus-flask-exporter 捕获的默认指标是基于端点和方法 (flask_http_request_total) 的请求调用，以及他们花费的时间 (flask_http_request_duration_seconds)。

Adding custom metrics

当涉及到应用程序详细信息时，我们可能希望添加更具体的指标。我们还在请求的末尾添加了一些额外的代码，以便我们可以将类似的信息存储到 prometheus-flask-exporter 允许我们存储的指标中。

特别是，我们将此代码添加到 logging_after 函数（https://github.com/PacktPublishing/Hands-On-Docker-for-Microservices-with-Python/blob/ master/Chapter10/microservices/thoughts_backend/ThoughtsBackend/thoughts_backend/app.py#L72) 使用较低级别的 prometheus_client。

此代码创建 Counter 和 Histogram：

from prometheus_client import Histogram, Counter

METRIC_REQUESTS = Counter('requests', 'Requests',
                          ['endpoint', 'method', 'status_code'])
METRIC_REQ_TIME = Histogram('req_time', 'Req time in ms',
                            ['endpoint', 'method', 'status_code']) 

def logging_after(response):
    ...
    # Store metrics
    endpoint = request.endpoint
    method = request.method.lower()
    status_code = response.status_code
    METRIC_REQUESTS.labels(endpoint, method, status_code).inc()
    METRIC_REQ_TIME.labels(endpoint, method, status_code).observe(time_in_ms)

在这里，我们创建了两个指标：一个名为 requests 的计数器和一个名为 req_time 的直方图。直方图是 Prometheus 对具有特定值的度量和事件的实现，例如请求时间（在我们的例子中）。

直方图将值存储在桶中，从而使我们能够计算分位数。分位数对于确定诸如 95% 时间值之类的指标非常有用，例如聚合时间，其中 95% 低于它。这比平均值有用得多，因为异常值不会从平均值中拉出。

还有另一个类似的指标称为摘要。差异是微妙的，但通常，我们应该使用的度量是直方图。查看 Prometheus 文档了解更多详细信息（ https://prometheus.io/docs/practices/histograms/）。

度量标准在 METRIC_REQUESTS 和 METRIC_REQ_TIME 中按名称、度量和定义的标签进行定义。每个标签都是指标的一个额外维度，因此您将能够按它们进行过滤和汇总。在这里，我们定义端点、HTTP 方法和生成的 HTTP 状态代码。

对于每个请求，都会更新指标。我们需要设置标签，计数器调用，即.inc()，直方图调用，即.observe(time)。

您可以在以下位置找到 Prometheus 客户端的文档 https://github.com/prometheus/client_python。

我们可以在指标页面上看到 request 和 req_time 指标。

为用户后端设置指标遵循类似的模式。用户后端是一个类似的 Flask 应用程序，因此我们安装 prometheus-flask-exporter 也是如此，但没有自定义指标。您可以访问这些指标 http://users.example.local/metrics。

下一阶段是设置 Prometheus 服务器，以便我们可以收集指标并正确汇总它们。

Collecting the metrics

为此，我们需要使用 Kubernetes 部署指标。我们准备了一个 YAML 文件，所有内容都已在 Chapter10/kubernetes/prometheus.yaml 文件中设置。

此 YAML 文件包含一个部署、一个 ConfigMap，其中包含配置文件、一个服务和一个 Ingress。服务和入口是相当标准的，所以我们不会在这里评论它们。

ConfigMap 允许我们定义一个文件：

apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-config
  namespace: example
data:
  prometheus.yaml: |
    scrape_configs:
    - job_name: 'example'

      static_configs:
        - targets: ['thoughts-service', 'users-service', 
                    'frontend-service']

注意 prometheus.yaml file 是如何在 | 符号之后生成的。这是从 thoughts-service、users-service 和 frontend-service 服务器抓取的最小 Prometheus 配置。正如我们从前几章中知道的那样，这些名称访问服务并将连接到为应用程序提供服务的 Pod。他们将自动搜索 /metrics 路径。

这里有一个小警告。从 Prometheus 的角度来看，服务背后的一切都是同一台服务器。如果您有多个 Pod 提供服务，则 Prometheus 正在访问的指标将被负载平衡，并且指标将不正确。

这可以通过更复杂的 Prometheus 设置来解决，我们安装 Prometheus 操作符，但这超出了本书的范围。但是，强烈建议将其用于生产系统。本质上，它允许我们注释每个不同的部署，以便动态更改 Prometheus 配置。这意味着我们可以自动访问 Pod 公开的所有指标端点设置完成后。 Prometheus Operator 注解使我们可以很容易地向度量系统添加新元素。

如果您想了解如何执行此操作，请查看以下文章： https://sysdig.com/blog/kubernetes-monitoring-prometheus-operator-part3 。

部署从 prom/prometheus 中的公共 Prometheus 镜像创建一个容器，如下代码所示：

spec:
  containers:
  - name: prometheus
    image: prom/prometheus
    volumeMounts:
    - mountPath: /etc/prometheus/prometheus.yml
      subPath: prometheus.yaml
      name: volume-config
    ports:
    - containerPort: 9090
    volumes:
    - name: volume-config
      configMap:
        name: prometheus-config

它还将 ConfigMap 挂载为一个卷，然后作为一个文件挂载在 /etc/prometheus/prometheus.yml 中。这将使用该配置启动 Prometheus 服务器。容器打开端口 9090，这是 Prometheus 的默认端口。

At this point, note how we delegated for the Prometheus container. This is one of the advantages of using Kubernetes: we can use standard available containers to add features to our cluster with minimal configuration. We don't even have to worry about the operating system or the packaging of the Prometheus container. This simplifies operations and allows us to standardize the tools we use.

部署的 Prometheus 服务器可以通过 http://prometheus.example.local/ 访问，如 Ingress 和 service 中所述。

这将显示一个可用于绘制图形的图形界面，如以下屏幕截图所示：

表达式搜索框还将自动完成指标，有助于发现过程。

界面 also 显示来自 Prometheus 的其他有趣的元素，例如目标的配置或状态：

这个界面中的图表是可以使用的，但是我们可以通过 Grafana 设置更复杂和有用的仪表板。让我们看看这个设置是如何工作的。

Plotting graphs and dashboards

所需的 Kubernetes 配置 grafana.yaml 可在本书的 GitHub 存储库中的 Chapter10/kubernetes/metrics 目录中找到。就像我们对 Prometheus 所做的那样，我们使用单个文件来配置 Grafana。

由于我们之前解释的相同原因，我们不会显示 Ingress 和服务。部署很简单，但是我们挂载了两个卷而不是一个，如下代码所示：

spec:
  containers:
    - name: grafana
      image: grafana/grafana
      volumeMounts:
        - mountPath: /etc/grafana/provisioning
                     /datasources/prometheus.yaml
          subPath: prometheus.yaml
          name: volume-config
        - mountPath: /etc/grafana/provisioning/dashboards
          name: volume-dashboard
      ports:
        - containerPort: 3000
  volumes:
    - name: volume-config
      configMap:
        name: grafana-config
    - name: volume-dashboard
      configMap:
        name: grafana-dashboard

volume-config volume 共享一个配置 Grafana 的文件。 volume-dashboard 卷添加了仪表板。后者挂载一个包含两个文件的目录。两个挂载都位于 Grafana 期望的配置文件的默认位置。

volume-config 卷在 Grafana 将接收要绘制的数据的位置设置数据源：

---
apiVersion: v1
kind: ConfigMap
metadata:
  name: grafana-config
  namespace: example
data:
  prometheus.yaml: |
      apiVersion: 1

      datasources:
      - name: Prometheus
        type: prometheus
        url: http://prometheus-service
        access: proxy
        isDefault: true

数据来自http://prometheus-service，指向我们之前配置的Prometheus服务。

volume-dashboard 定义了两个文件，dashboard.yaml 和 dashboard.json：

---
apiVersion: v1
kind: ConfigMap
metadata:
  name: grafana-dashboard
  namespace: example
data:
  dashboard.yaml: |
    apiVersion: 1

    providers:
    - name: 'Example'
      orgId: 1
      folder: ''
      type: file
      editable: true
      options:
        path: /etc/grafana/provisioning/dashboards
  dashboard.json: |-
    <JSON FILE>

dashboard.yaml 是一个简单的文件，它指向我们可以找到描述系统可用仪表板的 JSON 文件的目录。我们指向同一个目录以使用单个卷挂载所有内容。

dashboard.json 在这里被编辑以节省空间；查看本书的 GitHub 存储库以获取数据。

dashboard.json 以 JSON 格式描述仪表板。该文件可以通过 Grafana UI 自动生成。添加更多 .json 文件将创建新的仪表板。

Grafana UI

通过访问 http://grafana.example.local 并使用您的登录名/密码详细信息，即 admin/admin（默认值），您可以访问 Grafana用户界面：

从那里，您可以检查仪表板，该仪表板位于左侧中央列中：

这将捕获对 Flask 的调用，包括数字和 95^th 个百分位时间。可以编辑每个单独的图表，以便我们可以看到生成它的配方：

左侧的图标允许我们更改系统中正在运行的查询，更改可视化（单位、颜色、条形或线、要绘制的比例等），添加名称等一般信息，并创建警报。

The Grafana UI allows us to experiment and so is highly interactive. Take some time to try out the different options and learn how to present the data.

Query 部分允许我们添加和显示来自 Prometheus 的指标。注意 default 附近的 Prometheus 标志，它是数据源。

每个查询都有一个从 Prometheus 提取数据的 Metrics 部分。

Querying Prometheus

Prometheus 有自己的查询语言，称为 PromQL。该语言非常强大，但它呈现出一些特殊性。

The Grafana UI helps by autocompleting the query, which makes it easy for us to search for metric names. You can experiment directly in the dashboard, but there's a page on Grafana called Explore that allows you to make queries out of any dashboard and has some nice tips, including basic elements. This is denoted by a compass icon in the left sidebar.

首先要记住的是了解 Prometheus 指标。鉴于其采样方法，它们中的大多数都是单调递增的。这意味着绘制指标将显示一条向上和向上的线。

要获取值在一段时间内变化的速率，您需要使用 rate：

rate(flask_http_request_duration_seconds_count[5m])

这平均每秒生成请求，移动窗口为 5 分钟。可以使用 sum 和 by 进一步聚合速率：

sum(rate(flask_http_request_duration_seconds_count[5m])) by (path)

要计算时间，您可以改用 avg。您还可以按多个标签进行分组：

avg(rate(flask_http_request_duration_seconds_bucket[5m])) by (method, path)

但是，您也可以设置分位数，就像我们在图表中一样。我们乘以 100 得到以毫秒而不是秒为单位的时间，并按 method 和 path 分组。现在，le 是一个自动创建的特殊标签，将数据分成多个桶。 histogram_quantile 函数使用它来计算分位数：

histogram_quantile(0.95, sum(rate(flask_http_request_duration_seconds_bucket[5m])) by (method, path, le)) * 1000

可以过滤指标，以便仅显示特定标签。它们还可以用于不同的功能，例如除法、乘法等。

Prometheus queries can be a bit long and complicated when we're trying to display the result of several metrics, such as the percentage of successful requests over the total. Be sure to test that the result is what you expect it to be and allocate time to tweak the requests, later.

如果您想了解更多信息，请务必查看 Prometheus 文档：https:// prometheus.io/docs/prometheus/latest/querying/basics/。

Updating dashboards

仪表板可以交互更改和保存，但在我们的 Kubernetes 配置中，我们将包含文件的卷设置为非持久性。因此，重新启动 Grafana 将放弃所有更改并重新应用 Chapter10/kubernetes/metrics/grafana.yaml 文件中 volume-dashboard 中定义的配置。

这实际上是一件好事，因为我们应用相同的 GitOps 原则将完整的配置存储在 Git 源代码控制下的存储库中。

但是，如您所见，考虑到参数的数量和手动更改它们的难度，grafana.yaml 文件中包含的仪表板的完整 JSON 描述非常长。

最好的方法是以交互方式更改仪表板，然后使用菜单顶部的共享文件按钮将其导出为 JSON 文件。然后，可以将 JSON 文件添加到配置中：

然后可以重新部署 Grafana pod，并将在仪表板中包含已保存的更改。然后可以通过通常的过程在 Git 中更新 Kubernetes 配置。

请务必探索仪表板的所有可能性，包括设置变量的选项，以便您可以使用相同的仪表板来监控不同的应用程序或环境以及不同类型的可视化工具。有关详细信息，请参阅完整的 Grafana 文档：https://grafana.com/docs/reference/。

有了可用的指标，我们就可以使用它们来主动了解系统并预测任何问题。

Being proactive

指标显示了整个集群状态的聚合观点。它们使我们能够检测趋势问题，但很难找到单个虚假错误。

不过，不要低估他们。它们对于成功监控至关重要，因为它们告诉我们系统是否健康。在一些公司中，最关键的指标会显着地显示在墙上的屏幕上，以便运营团队可以随时看到它们并迅速做出反应。

在系统中找到适当的指标平衡并不是一项简单的任务，需要时间和反复试验。不过，在线服务有四个指标始终很重要。这些如下：

Latency: How many milliseconds the system takes to respond to a request.

Depending on the times, a different time unit, such as seconds or microseconds, can be used. From my experience, milliseconds is adequate since most of the requests in a web application system should take between 50 ms and 1 second to respond. Here, a system that takes 50 ms is too slow and one that takes 1 second is a very performant one.

Traffic: The number of requests flowing through the system per unit of time, that is, requests per second or per minute.
Errors: The percentage of requests received that return an error.
Saturation: Whether the capacity of the cluster has enough headroom. This includes elements such as hard drive space, memory, and so on. For example, there is 20% available RAM memory.

To measure saturation, remember to install the available exporters that will collect most of the hardware information (memory, hard disk space, and so on) automatically. If you use a cloud provider, normally, they expose their own set of related metrics as well, for example, CloudWatch for AWS.

这些指标可以在 Google SRE Book 中作为四个黄金信号找到，并且被认为是成功监控的最重要的高级元素。

Alerting

当指标出现问题时，应生成自动警报。 Prometheus 包含一个警报系统，当定义的指标满足定义的条件时将触发。

Check out the Prometheus documentation on alerting for more information: https://prometheus.io/docs/alerting/overview/.

Prometheus 的 Alertmanager 可以执行某些操作，例如发送电子邮件以根据规则进行通知。该系统可以连接到集成事件解决方案，例如 OpsGenie (https://www.opsgenie.com)以便生成各种警报和通知，例如电子邮件、短信、电话等。

日志也可用于创建警报。有一些工具允许我们在 ERROR 引发时创建条目，例如 Sentry。这使我们能够检测问题并主动修复它们，即使集群的健康状况没有受到损害。

一些处理日志的商业工具，例如 Loggly，允许我们从日志本身中获取指标，根据日志类型绘制图表或从中提取值并将它们用作值。虽然不如 Prometheus 等系统完整，但它们可以监控一些值。它们还允许我们通知是否达到阈值。

The monitoring space is full of products, both free and paid, that can help us to handle this. While it's possible to create a completely in-house monitoring system, being able to analyze whether commercial cloud tools will be of help is crucial. The level of features and their integration with useful tools such as external alerting systems will be difficult to replicate and maintain.

警报也是一个持续的过程。一些元素将被发现，并且必须创建新的警报。一定要花时间让一切按预期进行。系统不健康时将使用日志和指标，在那些时刻，时间至关重要。您不想猜测日志，因为没有正确配置主机参数。

Being prepared

同样，除非恢复过程已经过测试并且正在运行，否则备份是没有用的，在检查监控系统是否正在产生有用的信息时要积极主动。

特别是，尝试标准化日志，以便对要包含的信息及其结构有良好的预期。不同的系统可能会产生不同的日志，但最好让所有微服务的日志格式相同。仔细检查所有参数，，例如客户端引用或主机，是否被正确记录。

这同样适用于指标。拥有一组每个人都理解的指标和仪表板将在您跟踪问题时节省大量时间。

Summary

在本章中，我们学习了如何使用日志和指标，以及如何使用 syslog 协议设置日志并将它们发送到集中式容器。我们描述了如何将日志添加到不同的应用程序，如何包含请求 ID，以及如何从不同的微服务中生成自定义日志。然后，我们学习了如何定义策略以确保日志在生产中有用。

我们还描述了如何在所有微服务中设置标准和自定义 Prometheus 指标。我们启动了一个 Prometheus 服务器并对其进行了配置，以便它从我们的服务中收集指标。我们启动了一个 Grafana 服务，以便我们可以绘制指标并创建仪表板，以便我们可以显示集群的状态和正在运行的不同服务。

然后，我们向您介绍了 Prometheus 中的警报系统以及如何使用它以便通知我们出现问题。请记住，有一些商业服务可以帮助您处理日志、指标和警报。分析您的选择，因为它们可以在维护成本方面为您节省大量时间和金钱。

在下一章中，我们将学习如何管理影响多个微服务的更改和依赖关系，以及如何处理配置和机密。

Questions

What is the observability of a system?
What are the different severity levels that are available in logs?
What are metrics used for?
Why do you need to add a request ID to logs?
What are the available kinds of metrics in Prometheus?
What is the 75th percentile in a metric and how does it differ from the average?
What are the four golden signals?

vlambda博客
学习文章列表

读书笔记《hands-on-docker-for-microservices-with-python》监控日志和指标