Managing Clusters

在 Elasticsearch 生态系统中，监控节点和集群以管理和改进其性能和状态非常重要。在集群级别可能会出现几个问题，例如：

Node overheads: Some nodes can have too many shards allocated and become a bottleneck for the entire cluster.
Node shutdown: This can happen due to a number of reasons, for example, full disks, hardware failures, and power problems.
Shard relocation problems or corruptions: Some shards can't get an online status.
Shards that are too large: If a shard is too big, then the index performance decreases due to the merging of massive Lucene segments.
Empty indices and shards: These waste memory and resources; however, because each shard has a lot of active threads, if there are a large number of unused indices and shards, then the general cluster performance is degraded.

可以通过 API 或前端检测集群级别的故障或性能不佳（我们将在第 11 章， 用户界面）。这些允许用户在他们的 Elasticsearch 数据上拥有一个有效的 Web 仪表板；它通过监控集群运行状况、备份或恢复数据以及在代码中实现查询之前允许对查询进行测试来工作。

在本章中，我们将探讨以下主题：

Using the health API to check the health of the cluster
Using the task API that controls jobs a cluster level
Using hot threads to check inside nodes for problems due to a high CPU usage
Learning how to monitor Lucene segments so as not to reduce the performance of a node due to there being too many of them

在本章中，我们将介绍以下食谱：

Controlling the cluster health using an API
Controlling the cluster state using an API
Getting cluster node information using an API
Getting node statistics using an API
Using the task management API
Hot Threads API
Managing the shard allocation
Monitoring segments with the segment API
Cleaning the cache

Controlling the cluster health using an API

Elasticsearch 提供了一种方便的方式来管理集群状态，这是检查是否确实发生任何问题的第一件事。

Getting ready

您将需要一个正常运行的 Elasticsearch 安装——类似于我们在 下载和安装 Elasticsearch 中描述的安装配方在第 1 章,< /span> 开始。

要执行这些命令，可以使用任何 HTTP 客户端，例如 Curl (https://curl.haxx.se/) 或 Postman （https://www.getpostman.com/）。您可以使用 Kibana 控制台，因为它为 Elasticsearch 提供了代码完成和更好的字符转义。

How to do it...

为了控制集群健康，我们将执行以下步骤：

In order to view the cluster health, the HTTP method that we use is GET:

GET /_cluster/health

The result will be as follows:

{
 "cluster_name" : "elasticsearch",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 1,
  "number_of_data_nodes" : 1,
  "active_primary_shards" : 17,
  "active_shards" : 17,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 15,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 53.125
}

How it works...

每个 Elasticsearch 节点都会保存集群的状态。 status 集群的值可以分为三种，如下：

green: This means that everything is okay.
yellow: This means that some nodes or shards are missing, but they don't compromise the cluster's functionality. For instance, some replicas could be missing (either a node is down or there are insufficient nodes for replicas), but there is a least one copy of each active shard; additionally, read and write functions are working. The yellow state is very common during the development stage when users typically start a single Elasticsearch server.
red: This indicates that some primary shards are missing and these indices are in the red status. You cannot write to indices that are in the red status and, additionally, the results may not be complete, or only partial results may be returned. Usually, you'll need to restart the node that is down and possibly create some replicas.

The yellow or red states could be transient if some nodes are in recovery mode. In this case, just wait until the recovery completes.

集群健康 API 包含大量信息，如下所示：

cluster_name: This is the name of the cluster.
timeout: This is a Boolean value indicating whether the REST API hits the timeout set in the call.
number_of_nodes: This indicates the number of nodes that are in the cluster.
number_of_data_nodes: This indicates the number of nodes that can store data (you can refer to Chapter 2, Managing Mapping, and the Downloading and Setup recipe in order to set up different node types for different types of nodes).
active_primary_shards: This shows the number of active primary shards; the primary shards are the masters of writing operations.
active_shards: This shows the number of active shards; these shards can be used for searches.
relocating_shards: This shows the number of shards that are relocating or migrating from one node to another node – this is mainly due to cluster-node balancing.
initializing_shards: This shows the number of shards that are in the initializing status. The initializing process is done at shard startup. It's a transient state before becoming active and it's composed of several steps; the most important steps are as follows:
- Copy the shard data from a primary one if its translation log is too old or a new replica is needed
- Check the Lucene indices
- Process the transaction log as needed

unassigned_shards: This shows the number of shards that are not assigned to a node. This is usually due to having set a replica number that is larger than the number of nodes. During startup, shards that are not already initialized or initializing will be counted here.
delayed_unassigned_shards: This shows the number of shards that will be assigned, but their nodes are configured for a delayed assignment. You can find more information about delayed shard assignments at https://www.elastic.co/guide/en/elasticsearch/reference/5.0/delayed-allocation.html.
number_of_pending_tasks: This is the number of pending tasks at the cluster level, such as updates to the cluster state, the creation of indices, and shard relocations. It should rarely be anything other than 0.
number_of_in_flight_fetch: This is the number of cluster updates that must be executed in the shards. As the cluster updates are asynchronous, this number tracks how many updates still have to be executed in the shards.
task_max_waiting_in_queue_millis: This is the maximum amount of time that some cluster tasks have been waiting in the queue. It should rarely be anything other than 0. If the value is different to 0, then it means that there is some kind of cluster saturation of resources or a similar problem.
active_shards_percent_as_number: This is the percentage of active shards that are required by the cluster. In a production environment, it should rarely differ from 100 percent – apart from some relocations and shard initializations.

已安装的插件在分片初始化中起着重要作用；例如，如果您使用本机插件提供的映射类型并删除该插件（或者如果由于 API 更改而无法初始化插件），则分片初始化将失败。通过阅读 Elasticsearch 日志文件可以轻松检测到这些问题。

When upgrading your cluster to a new Elasticsearch release, make sure that you upgrade your mapping plugins or, at the very least, check that they work with the new Elasticsearch release. If you don't do this, you risk your shards failing to initialize and giving a red status to your cluster.

There's more...

这个 API 调用非常有用；可以针对一个或多个索引执行它，以获得它们在集群中的健康状况。这种方法允许隔离那些有问题的索引；执行此操作的 API 调用如下：

GET /_cluster/health/index1,index2,indexN

前面的调用还具有额外的请求参数，以控制集群的健康状况。这些附加参数如下：

level: This controls the level of the health information that is returned. This parameter accepts only cluster, index, and shards.
timeout: This is the wait time for a wait_for_* parameter (the default is 30s).
wait_for_status: This allows the server to wait for the provided status (green, yellow, or red) until timeout.
wait_for_relocating_shards: This allows the server to wait until the provided number of relocating shards has been reached, or until the timeout period has been reached (the default is 0).
wait_for_nodes: This waits until the defined number of nodes are available in the cluster. The value for this parameter can also be an expression, such as >N, >=N, <N, <=N, ge(N), gt(N), le(N), and lt(N).

如果待处理任务的数量不为零，那么最好调查这些待处理任务是什么。可以使用以下 API URL 显示它们：

GET /_cluster/pending_tasks

返回值是待处理任务的列表；请注意，Elasticsearch 会非常快速地应用集群更改，因此其中许多任务的生命周期只有几毫秒才能应用那些向您展示的任务。

Controlling the cluster state using an API

前面的配方仅返回有关集群运行状况的信息。如果您需要有关集群的更多详细信息，则需要查询其状态。

Getting ready

您将需要一个正常运行的 Elasticsearch 安装 - 类似于我们在 下载和安装 Elasticsearch 中描述的安装第 1 章中的配方，开始。

为了执行命令，可以使用任何 HTTP 客户端，例如 curl (https://curl.haxx.se/)或邮递员 (https://www.getpostman.com/)。您可以使用 Kibana 控制台，因为它为 Elasticsearch 提供了代码完成和更好的字符转义。

How to do it...

要检查集群状态，我们将执行以下步骤：

In order to view the cluster state, the HTTP method that you can use is GET, and the curl command is as follows:

GET /_cluster/state

The result will contain the following data sections
The general cluster information is as follows:

{
  "cluster_name" : "elastic-cookbook",
  "compressed_size_in_bytes" : 4714,
  "cluster_uuid" : "02UhFNltQXOqtz1JH6ec8w",
  "version" : 9,
  "state_uuid" : "LZcYMc3PRdKSJ9MMAyM-ew",
  "master_node" : "-IFjP29_TOGQF-1axtNMSg",
  "blocks" : { },

The node address information is as follows:

 "nodes" : {
    "-IFjP29_TOGQF-1axtNMSg" : {
      "name" : "5863a2552d84",
      "ephemeral_id" : "o6xo1mowRIGVZ7ZfXkClww",
      "transport_address" : "172.18.0.2:9300",
      "attributes" : {
        "xpack.installed" : "true"
      }
    }
  },

The cluster metadata information (such as templates, indices with mappings, and the aliases) is as follows:

"metadata" : {
    "cluster_uuid" : "02UhFNltQXOqtz1JH6ec8w",
    "cluster_coordination" : {
      "term" : 0,
      "last_committed_config" : [ ],
      "last_accepted_config" : [ ],
      "voting_config_exclusions" : [ ]
    },
    "templates" : {
      "kibana_index_template:.kibana" : {
        "index_patterns" : [
          ".kibana"
        ],
        "order" : 0,
        "settings" : {
          "index" : {
            "number_of_shards" : "1",
            "auto_expand_replicas" : "0-1"
          }
        },
        "mappings" : {
          "doc" : {
            "dynamic" : "strict",
            "properties" : {
              "server" : {
                "properties" : {
                  "uuid" : {
                    "type" : "keyword"
                  }
                }
              },
              ... truncated ...
    },
    "index-graveyard" : {
      "tombstones" : [ ]
    }
  },

You can route the tables in order to find the shards by using the following code:

  "routing_table" : {
    "indices" : {
      ".kibana_1" : {
        "shards" : {
          "0" : [
            {
              "state" : "STARTED",
              "primary" : true,
              "node" : "-IFjP29_TOGQF-1axtNMSg",
              "relocating_node" : null,
              "shard" : 0,
              "index" : ".kibana_1",
              "allocation_id" : {
                "id" : "QjMusIOIRRqOIsL8kEdudQ"
              }
            }
          ]
        }
      }
    }
  },

You can route the nodes by using the following code:

"routing_nodes" : {
    "unassigned" : [ ],
    "nodes" : {
      "-IFjP29_TOGQF-1axtNMSg" : [
        {
          "state" : "STARTED",
          "primary" : true,
          "node" : "-IFjP29_TOGQF-1axtNMSg",
          "relocating_node" : null,
          "shard" : 0,
          "index" : ".kibana_1",
          "allocation_id" : {
            "id" : "QjMusIOIRRqOIsL8kEdudQ"
          }
        }
      ]
    }
  }

How it works...

集群状态包含整个集群的信息；它的输出非常大的事实是正常的。

调用输出还包含常用字段，它们如下：

cluster_name: This is the name of the cluster.
master_node: This is the identifier of the master node. The master node is the primary node that is used for cluster management.
blocks: This section shows the active blocks in a cluster.
nodes: This shows the list of nodes in the cluster. For each node, we have the following information:
- id: This is the hash that is used to identify the node in Elasticsearch (for example, 7NwnFF1JTPOPhOYuP1AVN).
- name: This is the name of the node.
- transport_address: This is the IP address and port used to connect to this node.
- attributes: These are additional node attributes.
metadata: This is the definition of the indices (including their settings and mappings), ingest pipelines, and stored_scripts.
routing_table: These are the indices or shards routing tables, which are used to select primary and secondary shards and their nodes.
routing_nodes: This is the routing for the nodes.

元数据部分是最常用的字段，因为它包含与索引及其映射相关的所有信息。这是一次收集所有索引映射的便捷方式；否则，您需要为每种类型调用 get 映射实例。

元数据部分由几个部分组成，如下所示：

templates: These are the templates that control the dynamic mapping for your created indices.
indices: These are the indices that exist in the cluster.
* ingest: This stores all the ingest pipelines that are defined in the system.
stored_scripts: This stores the scripts, which are usually in the form of language#script_name.

索引小节返回每个索引的所有元数据描述的完整表示；它包含以下内容：

state (open or closed): This describes whether an index is open (that is, it can be searched and can index data) or closed (you can refer to the Opening/closing an index recipe in Chapter 3, Basic Operations).
settings: These are the index settings. The most important ones are as follows:
- index.number_of_replicas: This is the number of replicas of this index; it can be changed using an update index settings call.
- index.number_of_shards: This is the number of shards in this index. This value cannot be changed in an index.
- index.codec: This is the codec that is used to store index data; default is not shown, but the LZ4 algorithm is used. If you want a high compression rate, then use best_compression and the DEFLATE algorithm (this will slow down the writing performances slightly).
- index.version.created: This is the index version.
mappings: These are defined in the index. This section is similar to the get mapping response (you can refer to the Getting a Mapping recipe in Chapter 3, Basic Operations).
alias: This is a list of index aliases, which allows the aggregation of indices in a single name or the definition of alternative names for an index.

索引和分片的路由记录具有相似的字段，它们如下：

state (UNASSIGNED, INITIALIZING, STARTED, RELOCATING): This shows the state of the shard or the index.
primary (true/false): This shows whether the shard or node is primary.
node: This shows the ID of the node.
relocating_node: This field, if validated, shows the id node in which the shard is relocated.
shard: This shows the number of the shard.
index: This shows the name of the index in which the shard is contained.

There's more...

集群状态调用会返回很多信息，可以通过 URL 过滤掉不同的section部分。

集群状态 API 的完整 URL 如下：

http://{elasticsearch_server}/_cluster/state/{metrics}/{indices}

metrics 值可用于仅返回部分响应；它由一个逗号分隔的列表组成，并包含以下值：

* version: This is used to show the version part of the response.
blocks: This is used to show the blocks part of the response.
master_node: This is used to show the master node part of the response.
nodes: This is used to show the node part of the response.
metadata: This is used to show the metadata part of the response.
routing_table: This is used to show the routing_table part of the response.

indices 值是要包含在元数据中的索引名称的逗号分隔列表。

Getting cluster node information using an API

前面的配方允许将信息返回到集群级别； Elasticsearch 提供调用以在节点级别收集信息。在生产集群中，使用此 API 监控节点以检测错误配置以及与不同插件和模块相关的任何问题非常重要。

Getting ready

您将需要一个正常运行的 Elasticsearch 安装——类似于我们在 下载和安装 Elasticsearch 中描述的安装配方在第 1 章，开始。

为了执行命令，可以使用任何 HTTP 客户端，例如 curl (https://curl.haxx.se/) 或邮递员 (https://www.getpostman.com/)。您可以使用 Kibana 控制台，因为它为 Elasticsearch 提供了代码完成和更好的字符转义。

How to do it...

要获取集群节点信息，我们将执行以下步骤：

To retrieve the node information, the HTTP method you can use is GET, and the curl command is as follows:

GET /_nodes

GET /_nodes/<nodeId1>,<nodeId2>

The result will contain a lot of information about the node; it's huge, so the repetitive parts have been truncated:

{
  "_nodes" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "cluster_name" : "elastic-cookbook",
  "nodes" : {
    "-IFjP29_TOGQF-1axtNMSg" : {
      "name" : "5863a2552d84",
      "transport_address" : "172.18.0.2:9300",
      "host" : "172.18.0.2",
      "ip" : "172.18.0.2",
      "version" : "7.0.0",
      "build_flavor" : "default",
      "build_type" : "tar",
      "build_hash" : "a30e8c2",
      "total_indexing_buffer" : 103887667,
      "roles" : [
        "master",
        "data",
        "ingest"
      ],
      "attributes" : {
        "xpack.installed" : "true"
      },
      "settings" : {
        "cluster" : {
          "name" : "elastic-cookbook"
        },
        "node" : {
          "attr" : {
            "xpack" : {
              "installed" : "true"
            }
          },
          "name" : "5863a2552d84"
        },
        "path" : {
          "logs" : "/usr/share/elasticsearch/logs",
          "home" : "/usr/share/elasticsearch"
        },
        "discovery" : {
          "type" : "single-node",
          "zen" : {
            "minimum_master_nodes" : "1"
          }
        },
        "client" : {
          "type" : "node"
        },
        "http" : {
          "type" : {
            "default" : "netty4"
          }
        },
        "transport" : {
          "type" : {
            "default" : "netty4"
          },
          "features" : {
            "x-pack" : "true"
          }
        },
        "xpack" : ... truncated ...
        "network" : {
          "host" : "0.0.0.0"
        }
      },
      "os" : {
        "refresh_interval_in_millis" : 1000,
        "name" : "Linux",
        "pretty_name" : "CentOS Linux 7 (Core)",
        "arch" : "amd64",
        "version" : "4.9.125-linuxkit",
        "available_processors" : 4,
        "allocated_processors" : 4
      },
      "process" : {
        "refresh_interval_in_millis" : 1000,
        "id" : 1,
        "mlockall" : false
      },
      "jvm" :... truncated ...
      },
      "thread_pool" : {
        "force_merge" : {
          "type" : "fixed",
          "size" : 1,
          "queue_size" : -1
        },
... truncated ...
      },
      "transport" : {
        "bound_address" : [
          "0.0.0.0:9300"
        ],
        "publish_address" : "172.18.0.2:9300",
        "profiles" : { }
      },
      "http" : {
        "bound_address" : [
          "0.0.0.0:9200"
        ],
        "publish_address" : "172.18.0.2:9200",
        "max_content_length_in_bytes" : 104857600
      },
      "plugins" : [
        {
          "name" : "analysis-icu",
          "version" : "7.0.0-alpha2",
          "elasticsearch_version" : "7.0.0",
          "java_version" : "1.8",
          "description" : "The ICU Analysis plugin integrates Lucene ICU module into elasticsearch, adding ICU relates analysis components.",
          "classname" : "org.elasticsearch.plugin.analysis.icu.AnalysisICUPlugin",
          "extended_plugins" : [ ],
          "has_native_controller" : false
        },
... truncated ...
      ],
      "ingest" : {
        "processors" : [
          {
            "type" : "append"
          },
... truncated...
      }
    }
  }
}

How it works...

集群节点信息调用提供了节点配置的概述。它涵盖了很多信息；最重要的部分如下：

hostname: This is the name of the host.
ip: This is the IP address of the host.
version: This is the Elasticsearch version; it's best practice for all the nodes of a cluster to have the same Elasticsearch version.
roles: This is a list of roles that this node can cover. The developer nodes usually support three roles: master, data, and ingest.
settings: This section contains information about the current cluster and the path of the Elasticsearch node. The most important fields are as follows:
- cluster_name: This is the name of the cluster.
- node.name: This is the name of the node.
- path.*: This is the configured path of this Elasticsearch instance.
- script: This section is useful to check the script configuration of the node.
- os: This section provides the operating system (OS) information about the node that is running Elasticsearch, including the processors that are available or allocated, and the OS version.
- process: This section contains information about the currently-running Elasticsearch process:
  - id: This is the PID ID of the process.
  - mlockall: This flag defines whether Elasticsearch can use direct memory access; in production, this must be set to active.
- max_file_descriptors: This is the max file descriptor number.
- jvm: This section contains information about the Java Virtual Machine (JVM) node; this includes the version, vendor, name, PID, and memory (heaps and non-heaps).

It's highly recommended to run all the nodes on the same JVM version and type.

thread_pool: This section contains information about several types of thread pools running in a node.
transport: This section contains information about the transport protocol. The transport protocol is used for intra-cluster communication, or by the native client in order to communicate with a cluster. The response format is similar to the HTTP one, as follows:
- bound_address: If a specific IP is not set in the configuration, then Elasticsearch binds all the interfaces together.
- publish_address: This is the address that is used for publishing the native transport protocol.
- http: This section gives information about the HTTP configuration
- max_content_length_in_bytes (the default is 104857600 of 100 MB): This is the maximum size of HTTP content that Elasticsearch will allow to be received; HTTP payloads that are bigger than this size are rejected.

默认的 100 MB HTTP 限制（可以在 elasticsearch.yml 中更改）可能会由于大负载（通常与映射器插件附件一起使用）而导致故障，因此在执行批量操作时牢记此限制非常重要或使用附件。

publish_address: This is the address that is used to publish the Elasticsearch node.
plugins: This section lists every plugin installed in the node and provides information about the following:
- name: This is the plugin name.
- description: This is the plugin description.
- version: This is the plugin version.
- classname: This is the Java class used to load the plugin.
  All the nodes must have the same plugin version; different plugin versions in a node bring unexpected failures.
modules: This section lists every module installed in the node. The structure is the same as the plugin section.
ingest: This section contains the list of active processors in the ingest node.

There's more...

API 调用允许您过滤必须返回的部分。在此示例中，我们返回了整个部分。或者，我们可以选择以下一个或多个部分：

http
thread_pool
transport
jvm
os
process
plugins
modules
ingest
settings

例如，如果您只需要 os 和 plugins 信息，则调用如下：

GET /_nodes/os,plugins

Getting node statistics via the API

节点统计调用 API 用于收集节点的实时指标，例如内存使用情况、线程使用情况、索引数量和搜索次数。

Getting ready

您将需要一个正常运行的 Elasticsearch 安装 - 类似于我们在 下载和安装 Elasticsearch 中描述的安装第 1 章中的配方，开始。

How to do it...

要获取节点统计信息，我们将执行以下步骤：

To retrieve the node statistics, the HTTP method that we will use is GET, and the command is as follows:

GET /_nodes/stats
GET /_nodes/<nodeId1>,<nodeId2>/stats

The result will be a long list of all the node statistics. The most significant parts of the results can be broken up as follows.

首先是描述集群名称和节点部分的标头，如下所示：

{
  "_nodes" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "cluster_name" : "elastic-cookbook",
  "nodes" : {
    "-IFjP29_TOGQF-1axtNMSg" : {
      "timestamp" : 1545580226575,
      "name" : "5863a2552d84",
      "transport_address" : "172.18.0.2:9300",
      "host" : "172.18.0.2",
      "ip" : "172.18.0.2:9300",
      "roles" : [
        "master",
        "data",
        "ingest"
      ],
      "attributes" : {
        "xpack.installed" : "true"
      },

以下是与指数相关的统计数据：

"indices" : {
        "docs" : {
          "count" : 3,
          "deleted" : 0
        },
        "store" : {
          "size_in_bytes" : 12311
        },
        ... truncated...
      },

以下是与操作系统相关的统计信息：

      "os" : {
        "timestamp" : 1545580226579,
        "cpu" : {
          "percent" : 0,
          "load_average" : {
            "1m" : 0.0,
            "5m" : 0.02,
            "15m" : 0.0
          }
        },
        "mem" : {
          "total_in_bytes" : 2095869952,
          "free_in_bytes" : 87678976,
          "used_in_bytes" : 2008190976,
          "free_percent" : 4,
          "used_percent" : 96
        },
...truncated ...
          "memory" : {
            "control_group" : "/",
            "limit_in_bytes" : "9223372036854771712",
            "usage_in_bytes" : "1360773120"
          }
        }
      },

以下是与当前 Elasticsearch 流程相关的统计信息：

"process" : {
        "timestamp" : 1545580226580,
        "open_file_descriptors" : 257,
        "max_file_descriptors" : 1048576,
        "cpu" : {
          "percent" : 0,
          "total_in_millis" : 50380
        },
        "mem" : {
          "total_virtual_in_bytes" : 4881367040
        }
      },

以下是与当前 JVM 相关的统计信息：

      "jvm" : {
        "timestamp" : 1545580226581,
        "uptime_in_millis" : 3224543,
        "mem" : {
          "heap_used_in_bytes" : 245981600,
          "heap_used_percent" : 23,
          "heap_committed_in_bytes" : 1038876672,
          "heap_max_in_bytes" : 1038876672,
          "non_heap_used_in_bytes" : 109403072,
          "non_heap_committed_in_bytes" : 119635968,
... truncated ...
      },

以下是与线程池相关的统计信息：

      "thread_pool" : {
        "analyze" : {
          "threads" : 0,
          "queue" : 0,
          "active" : 0,
          "rejected" : 0,
          "largest" : 0,
          "completed" : 0
        },
 ... truncated ...
      },

以下是节点文件系统统计信息：

      "fs" : {
        "timestamp" : 1545580226582,
        "total" : {
          "total_in_bytes" : 62725623808,
          "free_in_bytes" : 59856470016,
          "available_in_bytes" : 56639754240
        },
        ... truncated ...
      },

以下是有关节点之间通信的统计信息：

      "transport" : {
        "server_open" : 0,
        "rx_count" : 0,
 "rx_size_in_bytes" : 0,
        "tx_count" : 0,
        "tx_size_in_bytes" : 0
      },

以下是与 HTTP 连接相关的统计信息：

      "http" : {
        "current_open" : 4,
        "total_opened" : 175
      },

以下是与断路器缓存相关的统计信息：

      "breakers" : {
        "request" : {
          "limit_size_in_bytes" : 623326003,
          "limit_size" : "594.4mb",
          "estimated_size_in_bytes" : 0,
          "estimated_size" : "0b",
          "overhead" : 1.0,
          "tripped" : 0
        },
... truncated ...
      },

以下是与脚本相关的统计信息：

      "script" : {
        "compilations" : 0,
        "cache_evictions" : 0,
        "compilation_limit_triggered" : 0
      },

这是集群状态队列：

"discovery" : { },

以下是 摄取 统计数据：

      "ingest" : {
        "total" : {
          "count" : 0,
          "time_in_millis" : 0,
          "current" : 0,
          "failed" : 0
        },
        "pipelines" : { }
      },

以下是 adaptive_selection 统计信息：

      "adaptive_selection" : {
        "-IFjP29_TOGQF-1axtNMSg" : {
          "outgoing_searches" : 0,
          "avg_queue_size" : 0,
          "avg_service_time_ns" : 7479391,
          "avg_response_time_ns" : 13218805,
          "rank" : "13.2"
        }
      }

How it works...

在执行期间，每个 Elasticsearch 节点都会收集有关节点管理的几个方面的统计信息；这些统计信息可使用统计 API 调用进行访问。在下一个秘籍中，我们将看到一个监控应用程序的示例，它使用此信息来提供节点或集群的实时状态。

该API收集的主要统计数据如下：

fs: This section contains statistics about the filesystem; this includes the free space that is on devices, the mount points, and reads and writes. It can also be used to remotely control the disk usage of your nodes.
http: This gives the number of current open sockets and their maximum number. indices: This section contains statistics about several indexing aspects:
- The use of fields and caches.
- Statistics about operations such as get, indexing, flush, merges, refresh, and warmer.
jvm: This section provides statistics about buffers, pools, garbage collectors (this refers to the creation or destruction of objects and their memory management), memory (such as used memory, heaps, and pools), threads, and uptime. You should check to see whether the node is running out of memory.
network: This section provides statistics about transmission control protocol (TCP) traffic, such as open connections, closed connections, and data I/O.
os: This section collects statistics about the OS, such as the following:
- CPU usage
- Node load
- Virtual and swap memory
- Uptime
process: This section contains statistics about the CPU that is used by Elasticsearch, memory, and open file descriptors.
It's very important to monitor the open file descriptors; this is because if you run out of them, then the indices may be corrupted.
thread_pool: This section monitors all the thread pools that are available in Elasticsearch. It's important, in the case of low performance, to control whether there are pools that have an excessive overhead. Some of them can be configured to a new maximum value.
transport: This section contains statistics about the transport layer and, in particular, the bytes that are read and transmitted.
breakers: This section monitors the circuit breakers. This must be checked to see whether it's necessary to optimize resources, queries, or aggregations to prevent them from being called.
adaptive_selection: This section contains information about the adaptive node selection that is used for executing searches. Adaptive selection allows you to choose the best replica node from a coodinator node to execute searches.

There's more...

API 响应非常大。可以通过仅请求所需的部分来限制它。为此，您需要将查询参数传递给 API 调用，指定以下所需部分：

fs
http
indices
jvm
network
os
process
thread_pool
transport
breaker
discovery
script
ingest
breakers
adaptive_selection

例如，仅请求 os 和 http 统计信息，则调用如下：

GET /_nodes/stats/os,http

Using the task management API

Elasticsearch 5.x 及更高版本允许您定义执行到服务器端的操作。这些操作可能需要一些时间才能完成，并且可能会使用大量集群资源。最常见的如下：

delete_by_query
update_by_query
reindex

当这些动作被调用时，它们会创建一个执行作业的服务器端任务；任务管理 API 允许您控制这些作业。

Getting ready

您将需要一个正常运行的 Elasticsearch 安装 - 类似于我们在 下载和安装 Elasticsearch 中描述的安装第 1 章中的配方，开始。

How to do it...

要获取任务信息，我们将执行以下步骤：

Retrieve the node information using the HTTP GET method; the command is as follows:

GET /_tasks
GET /_tasks?nodes=nodeId1,nodeId2'
GET /_tasks?nodes=nodeId1,nodeId2&actions=cluster:'

The result will be as follows:

{
  "nodes" : {
    "-IFjP29_TOGQF-1axtNMSg" : {
      "name" : "5863a2552d84",
      "transport_address" : "172.18.0.2:9300",
      "host" : "172.18.0.2",
      "ip" : "172.18.0.2:9300",
      "roles" : [
        "master",
        "data",
        "ingest"
      ],
      "attributes" : {
        "xpack.installed" : "true"
      },
      "tasks" : {
        "-IFjP29_TOGQF-1axtNMSg:92797" : {
          "node" : "-IFjP29_TOGQF-1axtNMSg",
          "id" : 92797,
          "type" : "transport",
          "action" : "cluster:monitor/tasks/lists",
          "start_time_in_millis" : 1545642518460,
          "running_time_in_nanos" : 7937700,
          "cancellable" : false,
          "headers" : { }
        },
        "-IFjP29_TOGQF-1axtNMSg:92798" : {
          "node" : "-IFjP29_TOGQF-1axtNMSg",
          "id" : 92798,
          "type" : "direct",
          "action" : "cluster:monitor/tasks/lists[n]",
          "start_time_in_millis" : 1545642518462,
          "running_time_in_nanos" : 5701400,
          "cancellable" : false,
          "parent_task_id" : "-IFjP29_TOGQF-1axtNMSg:92797",
          "headers" : { }
        }
      }
    }
  }
}

How it works...

在 Elasticsearch 中执行的每个任务都在任务列表中可用。

任务最重要的属性如下：

node: This defines the node that is executing the task.
id: This defines the unique ID of the task.
action: This is the name of the action; it's generally composed by an action type, the : separator, and the detailed action.
cancellable: This defines whether the task can be canceled; some tasks, such as delete/update by query or reindex, can be canceled; however, other tasks are mainly management tasks and so they cannot be canceled.
parent_task_id: This defines the group of tasks; some tasks can be split and executed in several subtasks. This value can be used to group these tasks by the parent.

任务的id属性可以通过API调用的node_id参数过滤响应：

GET /_tasks/-IFjP29_TOGQF-1axtNMSg:92797

如果需要监控一组任务，可以通过API调用根据其parent_task_id property 进行过滤，如下：

GET /_tasks?parent_task_id=-IFjP29_TOGQF-1axtNMSg:92797

There's more...

一般来说，取消一个任务可能会因为文档的部分更新或删除而在 Elasticsearch 中产生一些数据不一致的情况；但是，在重新索引时，它可能很有意义。当您重新索引大量数据时，通常会更改映射或在其中重新索引脚本。因此，为了不浪费时间和 CPU 使用率，取消重新索引是一个明智的解决方案。

要取消任务，API URL 如下：

POST /_tasks/task_id:1/_cancel

在一组任务的情况下，可以通过单个 cancel 调用停止它们，使用查询参数来选择它们，如下所示：

POST /_tasks/_cancel?nodes=nodeId1,nodeId2&actions=*reindex

Using the hot threads API

有时，由于 CPU 使用率高，您的集群会变慢，您需要了解原因。 Elasticsearch 提供了监控热线程的能力，以便能够了解问题出在哪里。

在 Java 中，热线程是使用大量 CPU 并需要很长时间才能执行的线程。

Getting ready

您将需要一个正常运行的 Elasticsearch 安装 - 类似于我们在 下载和安装 Elasticsearch 中描述的安装配方在第 1 章,< /span> 开始。

How to do it...

要获取任务信息，我们将执行以下步骤：

To retrieve the node information, the HTTP method that we use is GET, and the curl command is as follows:

GET /_nodes/hot_threads
GET /_nodes/{nodesIds}/hot_threads'

The result will be as follows:

::: {5863a2552d84}{-IFjP29_TOGQF-1axtNMSg}{o6xo1mowRIGVZ7ZfXkClww}{172.18.0.2}{172.18.0.2:9300}{xpack.installed=true}
   Hot threads at 2018-12-24T09:22:30.481, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:
   
   16.1% (80.6ms out of 500ms) cpu usage by thread 'elasticsearch[5863a2552d84][write][T#2]'
     10/10 snapshots sharing following 2 elements
       [email protected]/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
       [email protected]/java.lang.Thread.run(Thread.java:834)
   
    8.7% (43.3ms out of 500ms) cpu usage by thread 'elasticsearch[5863a2552d84][write][T#3]'
     2/10 snapshots sharing following 35 elements
       app//org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:264)
       app//org.elasticsearch.index.shard.IndexShard.prepareIndex(IndexShard.java:733)
       app//org.elasticsearch.index.shard.IndexShard.applyIndexOperation(IndexShard.java:710)
       app//org.elasticsearch.index.shard.IndexShard.applyIndexOperationOnPrimary(IndexShard.java:691)
       app//org.elasticsearch.action.bulk.TransportShardBulkAction.lambda$executeIndexRequestOnPrimary$3(TransportShardBulkAction.java:462)
 ... truncated ...

How it works...

热线程 API 非常特别；它通过返回当前运行的热线程的文本表示来工作，因此可以使用堆栈跟踪检查每个线程减速的原因。

为了控制返回值，可以提供额外的参数作为查询参数：

threads: This is the number of hot threads to provide (the default is 3).
interval: This is the interval for the sampling of threads (the default is 500ms).
type: This allows the control of different types of hot threads, for example, to check, wait, and block states (the default is cpu; the possible values are cpu, wait, and block).
ignore_idle_threads: This is used to filter out any known idle threads (the default is true).

热线程 API 是 Elasticsearch 提供的高级监控功能；它可以非常方便地帮助您调试生产集群的缓慢速度，因为它可以用作运行时调试器。如果您的节点或集群有性能问题，那么热线程 API 是唯一可以帮助您了解 CPU 使用情况的调用。

由于错误的正则表达式使用或脚本问题，计算开销很高是很常见的。

Managing the shard allocation

在正常的 Elasticsearch 使用过程中，通常不需要更改分片分配，因为默认设置适用于所有标准场景。然而，有时由于大规模迁移、节点重启或其他一些集群问题，有必要监控或定义自定义分片分配。

Getting ready

您将需要一个正常运行的 Elasticsearch 安装 - 类似于我们在 下载和安装 Elasticsearch 中描述的安装第 1 章中的配方，开始。

为了执行命令，可以使用任何 HTTP 客户端，例如 curl (https://curl.haxx.se/) 或邮递员 (https://www.getpostman.com/)。您可以使用 Kibana 控制台，因为它为 Elasticsearch 提供了代码完成和更好的字符转义。

How to do it...

要获取有关未分配分片分配的当前状态的信息，我们将执行以下步骤：

To retrieve the cluster allocation information, the HTTP method that we use is GET, and the command is as follows:

GET /_cluster/allocation/explain

The result will be as follows:

{
  "index" : "mybooks",
  "shard" : 0,
  "primary" : false,
  "current_state" : "unassigned",
  "unassigned_info" : {
    "reason" : "INDEX_CREATED",
    "at" : "2018-12-24T09:47:23.192Z",
    "last_allocation_status" : "no_attempt"
  },
  "can_allocate" : "no",
  "allocate_explanation" : "cannot allocate because allocation is not permitted to any of the nodes",
  "node_allocation_decisions" : [
    {
      "node_id" : "-IFjP29_TOGQF-1axtNMSg",
      "node_name" : "5863a2552d84",
      "transport_address" : "172.18.0.2:9300",
      "node_attributes" : {
        "xpack.installed" : "true"
      },
      "node_decision" : "no",
      "weight_ranking" : 1,
      "deciders" : [
        {
          "decider" : "same_shard",
          "decision" : "NO",
          "explanation" : "the shard cannot be allocated to the same node on which a copy of the shard already exists [[mybooks][0], node[-IFjP29_TOGQF-1axtNMSg], [P], s[STARTED], a[id=4IEkiR-JS7adyFCHN_GGTw]]"
        }
      ]
    }
  ]
}

How it works...

Elasticsearch 允许不同的分片分配机制。有时，您的分片未分配给节点，通过查询集群分配解释 API 来调查 Elasticsearch 未分配它们的原因很有用。

该调用返回了很多关于未分配分片的信息，但最重要的是 decisions。这是一个对象列表，解释了为什么无法在节点中分配分片。在前面的例子中，结果是分片不能分配在它已经存在的相同节点id [-IFjP29_TOGQF-1axtNMSg]上，这是因为分片需要副本而返回。但是，在这种情况下，集群仅由一个节点组成，因此无法初始化集群中的复制分片。

There's more...

集群分配解释 API 提供过滤结果以搜索特定分片的功能；如果您的集群有很多分片，这将非常有用。这可以通过在 get 正文中添加用作过滤器的参数来完成；这些参数如下：

index: This is the index that the shard belongs to.
shard: This is the number of the shard; shard numbers start from 0.
primary (true or false): This indicates whether the shard to be checked is the primary one or not.

可以使用类似的调用过滤前面的示例分片，如下所示：

GET /_cluster/allocation/explain
{
  "index": "mybooks",
  "shard": 0,
  "primary": false
}

为了手动重定位分片，Elasticsearch 提供了一个集群重路由 API，允许在节点之间迁移分片。以下是此 API 的示例：

POST /_cluster/reroute
{
  "commands": [
    {
      "move": {
        "index": "test-index",
        "shard": 0,
        "from_node": "node1",
        "to_node": "node2"
      }
    }
  ]
}

在这种情况下，test-index index的0 shard从node1迁移到node2。如果您强制进行分片迁移，集群会开始移动另一个分片以重新平衡自身。

Monitoring segments with the segment API

监视索引段意味着监视索引的运行状况。它包含有关段数和存储在其中的数据的信息。

Getting ready

您将需要一个正常运行的 Elasticsearch 安装 - 类似于我们在 the 下载和安装 Elasticsearch 第 1 章中的配方，< /span> 开始。

How to do it...

要获取有关索引段的信息，我们将执行以下步骤：

To retrieve the index segments, the HTTP method that we use is GET, and the curl command is as follows:

GET /mybooks/_segments

The result will be as follows:

{
  "_shards" : {
    "total" : 2,
    "successful" : 1,
    "failed" : 0
  },
  "indices" : {
    "mybooks" : {
      "shards" : {
        "0" : [
          {
            "routing" : {
              "state" : "STARTED",
              "primary" : true,
              "node" : "-IFjP29_TOGQF-1axtNMSg"
            },
            "num_committed_segments" : 1,
            "num_search_segments" : 1,
            "segments" : {
              "_0" : {
                "generation" : 0,
                "num_docs" : 3,
                "deleted_docs" : 0,
                "size_in_bytes" : 5688,
                "memory_in_bytes" : 2137,
                "committed" : true,
                "search" : true,
                "version" : "8.0.0",
                "compound" : true,
                "attributes" : {
                  "Lucene50StoredFieldsFormat.mode" : "BEST_SPEED"
                }
              }
            }
          }, ...truncated...
        ]
      }
    }
  }
}

In Elasticsearch, there is the special alias _all value, which defines all the indices. This can be used in all the APIs that require a list of index names.

How it works...

索引段 API 返回有关索引中段的统计信息。这是衡量指数健康状况的重要指标。它返回以下信息：

num_docs: The number of documents that are stored in the index.
deleted_docs: The number of deleted documents in the index. If this value is high, then a lot of space is wasted to tombstone documents in the index.
size_in_bytes: The size of the segments in bytes. If this value is too high, the writing speed will be very low.
memory_in_bytes: The memory taken up, in bytes, by the segment.
committed: This indicates whether the segment is committed to the disk.
search: This indicates whether the segment is used for searching. During force merge or index optimization, new segments are created and returned by the API, but they are not available for searching until the end of the optimization.
version: The Lucene version that is used for creating the index.
compound: This indicates whether the index is a compound one.
attributes: This is a key-value list of attributes about the current segment.

监控段所需的最重要元素是 deleted_docs 和 size_in_bytes。这是因为它们要么意味着浪费磁盘空间，要么意味着分片太大。如果分片太大（即超过 10 GB），那么为了提高写入性能，最好的解决方案是使用大量分片重新索引索引。

由于大量数据在节点之间移动，拥有大分片也会造成重定位问题。

It's impossible to define the perfect size for a shard. In general, a good size for a shard that doesn't need to be frequently updated is between 10 GB and 25 GB.

Cleaning the cache

在执行过程中，Elasticsearch 会缓存数据以加快搜索速度，例如结果、项目和过滤结果。

Elasticsearch 通过遵循内部指标（例如缓存与内存的百分比大小（即 20%））自动释放内存。如果您想手动启动一些性能测试或释放内存，则需要调用缓存 API。

Getting ready

您将需要一个正常运行的 Elasticsearch 安装 - 类似于我们在 下载和安装 Elasticsearch 中描述的安装配方在第 1 章< /a>, 开始。

为了执行命令，可以使用任何 HTTP 客户端，例如 curl (https://curl.haxx.se/) 或邮递员 (https://www.getpostman.com/)。您可以使用 Kibana 控制台，因为它为 Elasticsearch 提供了代码完成和更好的字符转义。

How to do it...

为了清理缓存，我们将执行以下步骤：

We call the _cache/clean API on an index as follows:

POST /mybooks/_cache/clear

If everything is okay, then the result that is returned by Elasticsearch will be as follows:

{
  "_shards" : {
    "total" : 2,
    "successful" : 1,
    "failed" : 0
  }
}

How it works...

缓存清理 API 释放用于在 Elasticsearch 中缓存值的内存——查询和聚合。

通常，清理缓存并不是一个好主意，因为 Elasticsearch 会在内部自行管理缓存并清理过时的值。但是，如果您的节点内存不足或者您想强制进行完整的缓存清理，它会非常方便。

如果您已经进行了性能测试，那么在触发查询之前，您可以执行干净的缓存 API 以获得查询执行的实时样本，而不会因缓存而增加。

vlambda博客 学习文章列表

读书笔记《elasticsearch-7-0-cookbook-fourth-edition》管理集群

标签:

vlambda博客
学习文章列表