vlambda博客
学习文章列表

读书笔记《hands-on-infrastructure-monitoring-with-prometheus》评估

Assessments

Chapter 1, Monitoring Fundamentals

  1. A consensual definition of monitoring is hard to come by because it quickly shifts from industry or even in job-specific contexts. The diversity of viewpoints, the components comprising the monitoring system, and even how the data is collected or used, are all factors that contribute to the struggle to reach a clear definition.
  2. System administrators are interested in high resolution, low latency, high diversity data. Within this scope, the primary objective of monitoring is so that problems are discovered quickly and the root causes identified as soon as possible.
  3. Low resolution, high latency, and high diversity data.
  4. It depends on how broad you want to make the monitoring definition. Within the scope of this book, logging is not considered monitoring.
  5. The monitoring service's location needs to be propagated to all targets. Staleness is a big drawback of this approach: if a system hasn't reported in for some time, does that mean it's having problems, or was it purposely decommissioned? Furthermore, when you manage a distributed fleet of hosts and services that push data to a central point, the risk of a thundering herd (overload due to many incoming connections at the same time) or a misconfiguration causing an unforeseen flood of data becomes much more complex and time-consuming to mitigate.
  6. The RED method is a very good starting point, opting for rate, errors, and duration metrics.
  7. This is a blackbox approach to monitoring and should instead rely on instrumenting said process directly or via an exporter.

Chapter 2, An Overview of the Prometheus Ecosystem

  1. The main components are Prometheus, Alertmanager, Pushgateway, Native Instrumented Applications, Exporters, and Visualization solutions.
  2. Only Prometheus and scrape targets (whether they are natively instrumented or use exporters) are essential for a Prometheus deployment. However, to have alert routing and management, you also need Alertmanager; Pushgateway is only required in very specific use cases, such as batch jobs; while Prometheus does have basic dashboarding functionality built in, Grafana can be added to the stack as the visualization option.
  3. Not all applications are built with Prometheus-compatible instrumentation. Sometimes, no metrics at all are exposed. In these cases, we can rely on exporters.
  4. The information should be quickly gathered and exposed in a synchronous operation.
  5. Alerts will be sent from both sides of the partition, if possible.
  6. The quickest option would be to use the webhook integration.
  7. The Prometheus server comes with an expression browser and consoles.

Chapter 3, Setting Up a Test Environment

  1. While the Prometheus stack can be deployed in almost every mainstream operating system, and thus, it will most certainly run in your desktop environment, it is more reproducible to use a Vagrant-based test environment for simulating machine deployments, and minikube to do the same for Kubernetes-based production environments.
  2. The defaults.sh file located in the utils directory allows the software versions to be changed for the virtual machine-based examples.
  3. The default subnet is 192.168.42.0/24 in all virtual machine-based examples.
  4. The steps to get a Prometheus instance up and running are as follows:
    1. Ensure that software versions match the ones recommended.
    2. Clone the code repository provided.
    3. Move into the chapter directory.
    4. Run vagrant up.
    5. When finished, run vagrant destroy -f.
  1. That information is available in the Prometheus web interface under /targets.
  2. Under ./cache/alerting.log.
  3. In any chapter, when you are done with the test environment, just run vagrant destroy -f under the said chapter's directory.

Chapter 4, Prometheus Metrics Fundamentals

  1. Time series data can be defined as a sequence of numerical data points collected chronologically from the same source – usually at a fixed interval. As such, this kind of data, when represented in a graphical form, will plot the evolution of the data through time, with the x-axis being time and the y-axis the data value.
  2. A timestamp, a value, and tags/labels.
  3. The write-ahead log (WAL).
  4. The default is 2h and should not be changed.
  5. A float64 value and a timestamp with millisecond precision.
  6. Histograms are especially useful for tracking bucketed latencies and sizes (for example, request durations or response sizes) as they can be freely aggregated across different dimensions. Another great use is to generate heatmaps (the evolution of histograms over time).
    Summaries without quantiles are quite cheap to generate, collect, and store. The main reason for using summary quantiles is when accurate quantile estimation is needed, irrespective of the distribution and range of the observed events.
  7. Cross-sectional aggregation combines multiple time series into one by aggregated dimension; longitudinal aggregation combines samples from a single time series over a time range into a single data point.

Chapter 5, Running a Prometheus Server

  1. Then, scrape_timeout will be set to its default – 10 seconds.
  2. Besides restarting, the configuration file can be reloaded by either sending a SIGHUP signal to the Prometheus process or sending an HTTP POST request to the /-/reload endpoint if --web.enable-lifecycle is used at startup.
  3. Prometheus will look back up to five minutes by default, unless it finds a stale marker, in which case it will immediately consider the series stale.
  1. While relabel_configs is used to rewrite the target list before the scrape is performed, metric_relabel_configs is used to rewrite labels or drop samples after the scrape has occurred.
  2. As we're scraping through a Kubernetes service (which is similar in function to a load balancer), the scrapes will hit only a single instance of the Hey application at a time.
  3. Due to the ephemeral nature of Kubernetes pods, it would be almost impossible to accurately manage the scrape targets using static configurations without additional automation.
  4. The Prometheus Operator leverages Kubernetes Custom Resources and Custom Controllers to declare domain-specific definitions that can be used to automatically manage a Prometheus stack and its scrape jobs.

Chapter 6, Exporters and Integrations

  1. The textfile collector enables the exposition of custom metrics by watching a directory for files with the .prom extension that contain metrics in the Prometheus exposition format.
  2. Data is collected from the container runtime daemon and from Linux cgroups.
  3. You can restrict the number of collectors (--collectors) to enable, or use the metric whitelist (--metric-whitelist) or blacklist (--metric-blacklist) flags.
  4. When debugging probes, you can append &debug=true to the HTTP GET URL to enable debug information.
  5. We can use mtail or grok_exporter to extract metrics from the application logs.
  6. One possible problem is the lack of high availability, making it a single point of failure. This also impacts scalability, as the only way to scale is vertically or by sharding. By using Pushgateway, Prometheus does not scrape an instance directly, which prevents having the up metric be a proxy for health monitoring. Additionally, and like the textfile collector from node_exporter, metrics need to be manually deleted from Pushgateway, via its API, or they will forever be exposed to Prometheus
  7. In this particular case, the textfile collector from Node Exporter can be a valid solution, particularly when the life cycle of the produced metric matches the life cycle of the instance.

Chapter 7, Prometheus Query Language - PromQL

  1. The comparison operators are < (less than), > (greater than), == (equals), != (differs), => (greater than or equal to), and <= (less than or equal to).
  2. When the time series you want to enrich are on the right-hand side of the PromQL expression.
  3. topk already sorts its results.
  4. While the rate() function provides the per-second average rate of change over the specified interval by using the first and last values in the range scaled to fit the range window, the irate() function uses the last two values in the range for the calculation, which produces the instant rate of change.
  5. Metrics of type info have their names ending in _info and are regular gauges with one possible value, 1. This special kind of metric was designed to be a place where labels whose values might change over time are stored, such as versions (for example, exporter version, language version, and kernel version), assigned roles, or VM metadata information.
  6. The rate function expects a counter, but a sum of counters is actually a gauge, as it can go down when one of the counters resets; this would translate into seemingly random spikes when graphed, because rate would consider any decrease a counter reset, but the total sum of the other counters would be considered a huge delta between zero and the current value.
  7. When a CPU core is being used 100%, it uses 1 CPU second. Conversely, when it's idle, it will use 0 CPU seconds. This makes it easy to calculate the percentage of usage, as we can utilize the CPU seconds directly. A virtual machine might have more than one core, which means that it might use more than 1 CPU second per second. The following expression calculates how many CPU seconds per second each core was idling in the last 5 minutes:
rate(node_cpu_seconds_total{job="node",mode="idle"}[5m])

计算过去五分钟内每秒平均 CPU 空闲秒数的一种简单方法是对每个内核的值进行平均:

avg without (cpu, mode) (rate(node_cpu_seconds_total{job="node",mode="idle"}[5m]))

由于使用的 CPU 秒数加上空闲的 CPU 秒数,每个内核每秒应总计 1 个 CPU 秒,为了获得 CPU 使用率,我们执行以下操作:

avg without (cpu, mode) (1 - rate(node_cpu_seconds_total{job="node",mode="idle"}[5m]))

要获得百分比,我们只需乘以 100

avg without (cpu, mode) (1 - rate(node_cpu_seconds_total{job="node",mode="idle"}[5m])) * 100

Chapter 8, Troubleshooting and Validation

  1. Prometheus is distributed with promtool which, among other functions, can check a configuration file for issues:
promtool check config /etc/prometheus/prometheus.yml
  1. The promtool utility can also read metrics in the Prometheus exposition format from stdin and validate them according to the current Prometheus standards:
curl -s http://prometheus:9090/metrics | promtool check metrics
  1. The promtool utility can be used to run instant queries against a Prometheus instance:
promtool query instant 'http://prometheus:9090' 'up == 1'
  1. You can use promtool to find every label value for a given label name. One example is the following:
promtool query labels 'http://prometheus:9090' 'mountpoint'
  1. By adding --log.level=debug to the start-up parameters.
  2. The /-/healthy endpoint will tell you (or the orchestration system) whether the instance has issues and needs to be redeployed, while the /-/ready endpoint will tell you (or your instance's load balancer) whether it is ready to receive traffic.
  3. While the Prometheus database is unlocked (for example, when no Prometheus is using that directory), you can run the tsdb utility to analyze a specific block of data for metric and label churn:
tsdb analyze /var/lib/prometheus/data 01D486GRJTNYJH1RM0F2F4Q9TR

Chapter 9, Defining Alerting and Recording Rules

  1. This type of rules can help take the load off heavy dashboards by pre-computing expensive queries, aggregate raw data into time series that can then be exported to external systems, and assist the creation of compound range vector queries.
  2. For the same reasons as in scrape jobs, queries might produce erroneous results when using series with different sampling rates, and having to keep track of what series have what periodicity becomes unmanageable.
  3. instance_job:latency_seconds_bucket:rate30s needs to have at least the instance and job labels. It was calculated by applying the rate to the latency_seconds_bucket_total metric, using a 30-second range vector. Thus, the originating expression could probably be as follows:
rate(latency_seconds_bucket_total[30s])
  1. As that label changes its value, so will the identity of the alert.
  2. An alert enters the pending state when it starts triggering (its expression starts returning results), but the for interval hasn't elapsed yet to be considered firing.
  3. It would be immediate. When the for clause isn't specified, the alert will be considered firing as soon as its expression produces results.
  4. The promtool utility has a test sub-command that can run unit tests for recording and alerting rules.

Chapter 10, Discovering and Creating Grafana Dashboards

  1. Grafana supports automatic provisioning of data sources by reading YAML definitions from a provisioning path at startup.
  2. Steps to import a dashboard from the Grafana gallery are as follows:
    1. Choose a dashboard ID from the grafana.com gallery.
    2. In the target Grafana instance, click on the plus sign in the main menu on the left-hand side and select Import from the sub-menu.
    3. Paste the chosen ID in the appropriate text field.
  1. Variables allow a dashboard to configure placeholders that can be used in expressions and title strings, and those placeholders can be filled with values from either a static or dynamic list, which are usually presented to the dashboard user in the form of a drop-down menu. Whenever the selected value changes, Grafana will automatically update the queries in panels and title strings that use that respective variable.
  2. In Grafana, the building block is the panel.
  3. No, it does not. The dashboard ID will remain the same, but the iteration will be incremented.
  4. Consoles are custom dashboards that are served directly from a Prometheus instance.
  5. They are generated from console templates, which are written in raw HTML/CSS/JavaScript and leverage the power of the Go templating language, making them endlessly customizable. Since the templating runs inside Prometheus, it can access the TSDB directly instead of going through the HTTP API, which makes console generation amazingly quick.

Chapter 11, Understanding and Extending Alertmanager

  1. In the case of a network partition, each side of the partition will send notifications for the alerts they are aware of: in a clustering failure scenario, it's better to receive duplicate notifications for an issue than to not get any at all.
  2. By setting continue to true on a route, it will make the matching process keep going through the routing tree until the next match, thereby allowing multiple receivers to be triggered.
  3. The group_interval configuration defines how long to wait for additional alerts in a given alert group (defined by group_by) before sending an updated notification when a new alert is received; repeat_interval defines how long to wait until resending notifications for a given alert group when there are no changes.
  4. The top-level route, also known as the catch-all or fallback route, will trigger a default receiver when incoming alerts aren't matched in other sub-routes.
  1. The webhook integration allows Alertmanager to issue an HTTP POST request with the JSON payload of the notification to a configurable endpoint. This allows you to run a bridge that can convert notifications from Alertmanager to your chosen notification provider's format, and then forward them to it.
  2. The CommonLabels field is populated with the labels that are common to all alerts in the notification. The CommonAnnotations field does exactly the same, but for annotations.
  3. A good approach is to use a deadman's switch alert: create an alert that is guaranteed to always be firing, and then configure Alertmanager to route that alert to a (hopefully) external system that will be responsible for letting you know whether it ever stops receiving notifications.

Chapter 12, Choosing the Right Service Discovery

  1. Managing scrape targets in a highly dynamic environment becomes an arduous task without automatic discovery.
  2. Having a set of access credentials with sufficient permissions to list all the required resources through its API.
  3. It supports A, AAAA, and SRV DNS records.
  4. Due to the large number of API objects available to query, the Kubernetes discovery configuration for Prometheus has the concept of role, which can be either node, service, pod, endpoint, or ingress. Each will make available their corresponding set of objects for target discovery.
  5. The best mechanism for implementing a custom service discovery is to use file-based discovery integration to inject targets into Prometheus.
  6. No. Prometheus will try to use filesystem watches to automatically detect when there are changes and then reload the target list, and will fall back to re-reading target files on a schedule if watches aren't available.
  7. It's recommended to use the adapter code available in the Prometheus code repository, as it abstracts much of the boilerplate needed to implement a discovery mechanism. Additionally, if you intend to contribute your custom service discovery to the project, the adapter makes it easy to incorporate the service discovery code into the main Prometheus binary, were it to gain traction and community support.

Chapter 13, Scaling and Federating Prometheus

  1. You should consider sharding when you're sure a single instance isn't enough to handle the load, and you can't run it with more resources.
  2. Vertical sharding is used to split scrape workload according to responsibility (for example, by function or team), where each Prometheus shard scrapes different jobs. Horizontal sharding splits loads from a single scrape job into multiple Prometheus instances.
  3. To reduce the ingestion load on a Prometheus instance, you should consider dropping unnecessary metrics through the use of metric_relabel_configs rules, or by increasing the scrape interval so that fewer samples are ingested in total.
  4. Instance-level Prometheus servers should federate job-level aggregate metrics. Job-level Prometheus servers should federate datacenter-level aggregate metrics.
  5. You might need to use metrics only available in other Prometheus instances in recording and alerting rules.
  6. The protocol used is gRPC.
  7. You will lose the ability to use the Thanos deduplication feature.

Chapter 14, Integrating Long-Term Storage with Prometheus

  1. The main advantages of basing the remote write feature on the WAL are: it makes streaming of metrics possible, has a much smaller memory footprint, and it’s more resilient to crashes.
  2. You can request Prometheus to produce a snapshot of the TSDB by using the /api/v1/admin/tsdb/snapshot API endpoint (only available when the --web.enable-admin-api flag is enabled), and then back up the snapshot.
  3. You can delete time series from the TSDB by using the /api/v1/admin/tsdb/delete_series API endpoint and then using the /api/v1/admin/tsdb/clean_tombstones to make Prometheus clean up the deleted series (these endpoints will only be available when the --web.enable-admin-api flag is enabled).
  1. Object storage usually provides 99.999999999% durability and 99.99% availability service-level agreements, and it’s quite cheap in comparison to block storage.
  2. Yes. For example, keeping the raw data is useful for zooming into short time ranges in the past.
  3. Thanos store provides an API gateway between Thanos Querier and object storage.
  4. Data in object storage can be inspected using the thanos bucket sub-command, which also allows verifying, repairing, listing and inspecting storage buckets.