vlambda博客
学习文章列表

读书笔记《elasticsearch-7-0-cookbook-fourth-edition》基本操作

Basic Operations

在我们开始在 Elasticsearch 中进行索引和搜索之前,我们需要了解如何管理索引和对文档执行操作。在本章中,我们将从讨论索引的不同操作开始,例如 createdeleteupdateopen 关闭。这些操作非常重要,因为它们允许您定义将存储文档的容器(索引)。 索引 create/< kbd>delete 操作类似于 SQL create/delete 数据库命令。

在索引管理部分之后,我们将学习如何管理映射以完成我们在上一章开始的讨论,并为下一章打下基础,主要以搜索为中心。

本章的大部分内容专门用于创建-读取-更新-删除(CRUD< /strong>) 对位于 Elasticsearch 中记录存储和管理核心的记录的操作。

为了提高索引性能,了解批量操作并避免它们的常见陷阱也很重要。

本章不涉及涉及查询的操作,因为这是 第 4 章探索搜索功能第 5 章文本和数字查询,以及 第 6 章关系和地理查询,以及集群操作,将在 第 9 章管理集群和节点, 因为它们主要与控制有关d 监控集群。

在本章中,我们将介绍以下食谱:

  • Creating an index
  • Deleting an index
  • Opening or closing an index
  • Putting a mapping in an index
  • Getting a mapping
  • Reindexing an index
  • Refreshing an index
  • Flushing an index
  • ForceMerge an index
  • Shrinking an index
  • Checking if an index exists
  • Managing index settings
  • Using index aliases
  • Rolling over an index
  • Indexing a document
  • Getting a document
  • Deleting a document
  • Updating a document
  • Speeding up atomic operations (bulk operations)
  • Speeding up GET operations (multi GET)

Creating an index

在 Elasticsearch 中开始索引数据之前要执行的第一个操作是创建索引——我们数据的主要容器。

索引类似于 SQL 中数据库的概念;它是类型(SQL 中的表)和文档(SQL 中的记录)的容器。

Getting ready

您需要一个正常运行的 Elasticsearch 安装,正如我们在 下载和安装 Elasticsearch 配方中所述第 1 章 开始

要执行命令,可以使用 HTTP 客户端,例如 curl (https://curl.haxx.se/)、postman ( https://www.getpostman.com/)等。我建议使用 Kibana 控制台,因为它为 Elasticsearch 提供了代码完成和更好的字符转义。

How to do it...

创建索引的 HTTP 方法是 PUT(但 POST  有效) ; REST URL 包含索引名称:

http://<server>/<index_name>

要创建索引,我们必须执行以下步骤:

  1. From the command line, we can execute a PUT call:
PUT /myindex
{
  "settings": {
    "index": {
      "number_of_shards": 2,
      "number_of_replicas": 1
    }
  }
}
  1. The result returned by Elasticsearch should be like the following:
{
  "acknowledged" : true,
  "shards_acknowledged" : true,
  "index" : "myindex"
}
  1. If the index already exists, a 400 error will be returned:
{
  "error": {
    "root_cause": [
      {
        "type": "resource_already_exists_exception",
        "reason": "index [myindex/xaXAnnwcTUiTePcKGWJw3Q] already exists",
        "index_uuid": "xaXAnnwcTUiTePcKGWJw3Q",
        "index": "myindex"
      }
    ],
    "type": "resource_already_exists_exception",
    "reason": "index [myindex/xaXAnnwcTUiTePcKGWJw3Q] already exists",
    "index_uuid": "xaXAnnwcTUiTePcKGWJw3Q",
    "index": "myindex"
  },
  "status": 400
}

How it works...

在创建索引期间,可以使用 settings/index 对象中的两个参数设置复制:

  • The number_of_shards, which controls the number of shards that compose the index (every shard can store up to 232 documents)
  • The number_of_replicas, which controls the number of replications (how many times your data is replicated in the cluster for high availability)
  • A good practice is to set this value to at least 1

API 调用初始化一个新索引,这意味着 以下内容:

  • The index is created in a primary node first and then its status is propagated to all the nodes at the cluster level
  • A default mapping (empty) is created
  • All the shards required by the index are initialized and ready to accept data

索引创建 API 允许在创建期间定义映射。定义映射所需的参数是mapping,它接受多个映射。因此,在一次调用中,可以创建索引并放置所需的映射。

索引名称也有一些限制;唯一接受的字符如下:

  • ASCII letters [a-z]
  • Numbers [0-9]
  • Point ., minus -, &, and _

There's more...

create index 命令also 允许 传递映射部分,其中包含映射定义。它是一种无需执行额外的 PUT 映射调用即可使用映射创建索引的快捷方式。

此调用的一个常见示例,使用 Putting a mapping in an index 配方中的映射,如下所示:

PUT /myindex
{
  "settings": {
    "number_of_shards": 2,
    "number_of_replicas": 1
  },
  "mappings": {
    "properties": {
      "id": {
        "type": "keyword",
        "store": true
      },
      "date": {
        "type": "date",
        "store": false
      },
      "customer_id": {
        "type": "keyword",
        "store": true
      },
      "sent": {
        "type": "boolean"
      },
      "name": {
        "type": "text"
      },
      "quantity": {
        "type": "integer"
      },
      "vat": {
        "type": "double",
        "index": true
      }
    }
  }
}

See also

您可以查看以下与本食谱相关的食谱以供进一步参考:

  • All the main concepts related to indexing are discussed in the Understanding clusters, replication, and sharding recipe in Chapter 1, Getting Started
  • After creating an index, you generally need to add a mapping, as described in the Putting a mapping in an index recipe in this chapter

Deleting an index

创建索引的对应物是删除索引。删除索引意味着删除其分片、映射和数据。当我们需要删除索引时,常见的场景有很多,比如下面这样:

  • Removing the index to clean unwanted or obsolete data (for example, old Logstash indices).
  • Resetting an index for a scratch restart.
  • Deleting an index that has some missing shards, mainly due to some failures, to bring the cluster back in a valid state. (If a node dies and it's storing a single replica shard of an index, this index will be missing a shard, and so the cluster state becomes red. In this case, you'll bring back the cluster to a green status, but you will lose the data contained in the deleted index.)

Getting ready

您需要一个正常运行的 Elasticsearch 安装,正如我们在 下载和安装 Elasticsearch 第一章,  开始

要执行命令,可以使用 HTTP 客户端,例如 curl (https://curl.haxx.se/)、postman ( https://www.getpostman.com/) 或其他。我建议使用 Kibana 控制台,因为它为 Elasticsearch 提供了代码完成和更好的字符转义。

我们在上一个秘籍中创建的索引应该被删除。

How to do it...

用于删除索引的 HTTP 方法是 DELETE。以下 URL 仅包含索引名称:

http://<server>/<index_name>

要删除索引,我们将执行以下步骤:

  1. Execute a DELETE call by writing the following command:
DELETE /myindex
  1. We then check the result returned by Elasticsearch. If everything is all right, it should be as follows:
{
  "acknowledged" : true
}
  1. If the index doesn't exist, a 404 error is returned:
{
  "error" : {
    "root_cause" : [
      {
        "type" : "index_not_found_exception",
        "reason" : "no such index [myindex]",
        "resource.type" : "index_or_alias",
        "resource.id" : "myindex",
        "index_uuid" : "_na_",
        "index" : "myindex"
      }
    ],
    "type" : "index_not_found_exception",
    "reason" : "no such index [myindex]",
    "resource.type" : "index_or_alias",
    "resource.id" : "myindex",
    "index_uuid" : "_na_",
    "index" : "myindex"
  },
  "status" : 404
}

How it works...

删除索引时,与该索引相关的所有数据都会从磁盘中删除并丢失。

删除过程由两步组成:首先更新集群,然后从存储中删除分片。这个操作非常快;在传统的文件系统中,它被实现为递归删除。

如果没有备份,则无法恢复已删除的索引。

此外,使用特殊值 _all 调用删除 API 作为索引名称可用于删除所有索引。在生产环境中,最好通过在 elasticsearch.yml 中添加以下行来禁用所有索引删除:

action.destructive_requires_name:true

See also

之前的秘籍创建索引与本秘籍密切相关。

Opening or closing an index

如果您想保留数据但节省资源(内存或 CPU),删除索引的一个很好的替代方法是关闭它们。

Elasticsearch 允许您打开和关闭索引,将其置于在线或离线模式。

Getting ready

您需要一个正常运行的 Elasticsearch 安装,正如我们在 下载和安装 Elasticsearch 第一章,  开始

要执行这些命令,可以使用 HTTP 客户端,例如 curl (https://curl.haxx.se/ )、邮递员 (https://www.getpostman.com/) 或其他人。我建议使用 Kibana 控制台,因为它为 Elasticsearch 提供了代码完成和更好的字符转义。

要正确执行以下命令 正确,我们在 创建索引 recipe中创建的索引是必需的。

How to do it...

对于打开和关闭索引,我们将执行以下步骤:

  1. From the command line, we can execute a POST call to close an index using the following command:
POST /myindex/_close
  1. If the call is successful, the result returned by Elasticsearch should be as follows:
{
  "acknowledged" : true
}
  1. To open an index from the command line, type the following command:
POST /myindex/_open
  1. If the call is successful, the result returned by Elasticsearch should be:
{
  "acknowledged" : true,
  "shards_acknowledged" : true
}

How it works...

当索引关闭时,集群上没有开销(元数据状态除外):索引分片被关闭并且它们不使用文件描述符、内存或线程。

关闭索引时有很多用例:

  • It can disable date-based indices (indices that store their records by date)–for example, when you keep an index for a week, month, or day and you want to keep a fixed number of old indices (that is, 2 months old) online and some offline (that is, from 2 months to 6 months old).
  • When you do searches on all the active indices of a cluster and don't want to search in some indices (in this case, using an alias is the best solution, but you can achieve the same concept with an alias with closed indices).
An alias cannot have the same name as an index.
When an index is closed, calling  open restores its state.

See also

在本章的使用索引别名秘籍中,我们将讨论索引引用在基于时间的索引中的高级用法,以简化对打开索引的管理。

Putting a mapping in an index

在上一章中,我们看到了如何通过索引文档来构建映射。这个秘籍展示了如何将类型映射放入索引中。这种操作可以被认为是 SQL 创建表的 Elasticsearch 版本。

Getting ready

您需要一个正常运行的 Elasticsearch 安装,正如我们在 下载和安装 Elasticsearch 第一章,  开始

要执行这些命令,可以使用 HTTP 客户端,例如 curl (https://curl.haxx.se/ )、邮递员 (https://www.getpostman.com/) 或其他人。我建议使用 Kibana 控制台,因为它为 Elasticsearch 提供了代码完成和更好的字符转义。

 正确执行以下命令,我们在创建索引配方中创建的索引是必需的。

How to do it...

放置映射的 HTTP 方法是 PUTPOST  有效)。放置映射的 URL 格式如下:

http://<server>/<index_name>/_mapping

要将映射放入索引中,我们将执行以下步骤:

  1. If we consider a possible order data model to be used as a mapping, the call will be as follows:
PUT /myindex/_mapping
{
  "properties": {
    "id": {
      "type": "keyword",
      "store": true
    },
    "date": {
      "type": "date",
      "store": false
    },
    "customer_id": {
      "type": "keyword",
      "store": true
    },
    "sent": {
      "type": "boolean"
    },
    "name": {
      "type": "text"
    },
    "quantity": {
      "type": "integer"
    },
    "vat": {
      "type": "double",
      "index": false
    }
  }
}
  1. In the case of a success, the result returned by Elasticsearch should be like this:
{
  "acknowledged" : true
}

How it works...

此调用检查索引是否存在,然后创建一种或多种类型的映射,如定义中所述。要了解如何定义映射描述,请参阅第 3 章< /a>,管理映射

在映射插入期间,如果此类型存在现有映射,则将其与新映射合并。如果存在具有不同类型的字段并且无法更新类型,则会引发扩展  fields 属性的异常。为了防止在合并映射阶段出现异常,可以将 ignore_conflicts 参数指定为 true(默认为 false)。

PUT mapping 调用允许您一次性设置多个索引的类型;也就是说,列出以逗号分隔的索引,或者,要应用所有索引,请使用 _all 别名。

There's more...

映射没有删除操作。无法从索引中删除单个映射。要删除或更改映射,您需要管理以下步骤:

  1. Create a new index with the new or modified mapping.
  2. Reindex all the records.
  3. Delete the old index with an incorrect mapping.

在 Elasticsearch 5.x 或更高版本中,还有一个新的操作来加速这个过程:reindex 命令,我们将在 Reindexing an index 配方中看到这一章。

See also

Getting a mapping recipe 与此秘诀密切相关,它允许您控制 put 映射命令的确切结果

Getting a mapping

在为处理类型设置映射后,我们有时需要控制或分析映射以防止出现问题。由于一些合并和隐式类型猜测,获取类型映射的操作有助于我们理解结构或其演变。

Getting ready

您需要一个正常运行的 Elasticsearch 安装,正如我们在 下载和安装 Elasticsearch 第一章,  开始

要执行这些命令,可以使用 HTTP 客户端,例如 curl (https://curl.haxx.se/ )、邮递员 (https://www.getpostman.com/) 或其他人。我建议使用 Kibana 控制台,因为它为 Elasticsearch 提供了代码完成和更好的字符转义。

要正确执行以下命令 正确,我们在中创建的映射Putting a mapping in an index < span>需要配方。

How to do it...

获取映射的 HTTP 方法是 GET。获取映射的 URL 格式如下:

  • http://<server>/_mapping
  • http://<server>/<index_name>/_mapping

要从索引中获取映射,我们将执行以下步骤:

  1. If we consider the mapping for the previous recipe, the call will be as follows:
GET /myindex/_mapping?pretty
The pretty argument in the URL is optional, but very handy to print the response output   prettily .
  1. The result returned by Elasticsearch should be as follows:
{
  "myindex" : {
    "mappings" : {
      "properties" : {
        "customer_id" : {
          "type" : "keyword",
          "store" : true
        },
        "date" : {
          "type" : "date"
        },
        ... truncated ...,
        "vat" : {
          "type" : "double",
          "index" : false
        }
      }
    }
  }
}

How it works...

映射存储在 Elasticsearch 中的集群级别。该调用检查索引和类型是否存在,然后返回存储的映射。

返回的映射采用简化形式,这意味着不返回字段的默认值。吨 o 减少网络和内存消耗,Elasticsearch 返回非默认值。

检索映射对于以下几个目的非常有用:
  • 调试模板级映射
  • 通过猜测字段检查隐式映射是否正确导出
  • 检索映射元数据,可用于存储类型相关信息
  • 只需检查映射是否正确
如果需要获取多个映射,最好在索引级别或集群级别进行,以减少 API 调用次数。

See also

您可以查看以下与本食谱相关的食谱以供进一步参考:

  • To insert a mapping in an index, refer to the Putting a mapping in an index recipe in this chapter.
  • To manage dynamic mapping in an index, refer to the Using dynamic templates in document mapping recipe in Chapter 3, Managing Mapping

Reindexing an index

有很多常见的场景涉及更改映射。由于 Elasticsearch 映射的限制,无法删除已定义的映射,因此您经常需要重新索引索引数据。最常见的场景如下:

  • Changing an analyzer for a mapping
  • Adding a new subfield to a mapping, whereupon you need to reprocess all the records to search for the new subfield
  • Removing an unused mapping
  • Changing a record structure that requires a new mapping

Getting ready

您需要一个正常运行的 Elasticsearch 安装,正如我们在 下载和安装 Elasticsearch 第一章,  开始

要执行这些命令,可以使用 HTTP 客户端,例如 curl (https://curl.haxx.se/ )、邮递员 (https://www.getpostman.com/) 或其他人。我建议使用 Kibana 控制台,因为它为 Elasticsearch 提供了代码完成和更好的字符转义。

要正确执行以下命令 正确,我们在 创建索引 recipe中创建的索引是必需的。

How to do it...

重新索引索引的 HTTP 方法是 POST。获取映射的 URL 格式是 http:// /_reindex

要在两个索引之间重新索引数据,我们将执行以下步骤:

  1. If we want to reindex data from myindex to the myindex2 index, the call will be as follows:
POST /_reindex?pretty=true
{
  "source": {
    "index": "myindex"
  },
  "dest": {
    "index": "myindex2"
  }
}
  1. The result returned by Elasticsearch should be as follows:
{
  "took" : 20,
  "timed_out" : false,
  "total" : 0,
  "updated" : 0,
  "created" : 0,
  "deleted" : 0,
  "batches" : 0,
  "version_conflicts" : 0,
  "noops" : 0,
  "retries" : {
    "bulk" : 0,
    "search" : 0
  },
  "throttled_millis" : 0,
  "requests_per_second" : -1.0,
  "throttled_until_millis" : 0,
  "failures" : [ ]
}

How it works...

Elasticsearch 5.x 中引入的 reindex 功能提供了一种重新索引文档的有效方法。

在之前的 Elasticsearch 版本中,此功能必须在客户端级别实现。新的 Elasticsearch 实现的优点如下:

  • Fast copying of data because it is completely managed on the server side.
  • Better management of the operation due to the new task API.
  • Better error-handling support as it is done at the server level. This allows us to manage failovers better during reindex operations.

在服务器级别,此操作由以下步骤组成:

  1. Initialization of an Elasticsearch task to manage the operation
  2. Creation of the target index and copying the source mappings, if required
  3. Executing a query to collect the documents to be reindexed
  4. Reindexing all the documents using bulk operations until all documents are reindexed

该动作可以提供的主要参数如下:

  • The source section manages how to select source documents. The most important subsections are as follows:
    • index, which is the source index to be used. It can also be a list of indices.
    • query (optional), which is an Elasticsearch query to be used to select parts of the document.
    • sort (optional), which can be used to provide a way of sorting the documents.
  • The dest section manages how to control target written documents. The most important parameters in this section are as follows:
    • index, which is the target index to be used. If it is not available, it is to be created.
    • version_type (optional), if it is set to external, the external version is preserved.
    • routing (optional), which controls the routing in the destination index. It can be any of the following:
      • keep (the default), which preserves the original routing
      • discard, which discards the original routing
      • =<text>, which uses the text value for the routing
    • pipeline (optional), which allows you to define a custom pipeline for ingestion. We will see more about the ingestion pipeline in Chapter 12, Using the Ingest Module.
    • size (optional), the number of documents to be reindexed.
    • script (optional), which allows you to define a scripting for document manipulation. This case will be discussed in the Reindex with a custom script recipe in Chapter 8Scripting in Elasticsearch.

See also

您可以查看以下食谱以供进一步参考,这些食谱与此食谱相关:

  • In this chapter, check out the Speeding up atomic operation recipe, which will talk about using the bulk operation to ingest data quickly. The bulk actions are used under the hood by the reindex functionality.
  • To manage task execution, please refer to the Using the task management API recipe in Chapter 9, Managing Clusters.
  • The Reindex with a custom script recipe in Chapter 8, Scripting in Elasticsearch, will show several common scenarios for reindexing documents with a custom script.
  • Chapter 12, Using the Ingest module, will discuss how to use the ingestion pipeline.

Refreshing an index

Elasticsearch 允许用户使用对索引的强制刷新来控制搜索器的状态。如果不强制,新索引的文档将仅在固定时间间隔(通常为 1 秒)后  可搜索。

Getting ready

您需要一个正常运行的 Elasticsearch 安装,正如我们在 下载和安装 Elasticsearch 配方中所述第 1 章, 开始

要执行这些命令,可以使用 HTTP 客户端,例如 curl (https://curl.haxx.se/ )、邮递员 (https://www.getpostman.com/) 或其他人。我建议使用 Kibana 控制台,因为它为 Elasticsearch 提供了代码完成和更好的字符转义。

 正确执行以下命令,请使用我们在创建索引配方中创建的索引。

How to do it...

用于这两个操作的 HTTP 方法是 POST。刷新索引的 URL 格式如下:

http://<server>/<index_name(s)>/_refresh

刷新集群中所有索引的 URL 格式如下:

http://<服务器>/_refresh

要刷新索引,我们将执行以下步骤:

  1. If we consider the type order of the previous chapter, the call will be as follows:
POST /myindex/_refresh
  1. The result returned by Elasticsearch should be as follows:
{
  "_shards" : {
    "total" : 2,
    "successful" : 1,
    "failed" : 0
  }
}

How it works...

近乎实时 (NRT) 功能由 Elasticsearch 自动管理,如果索引中的数据发生更改,它会每秒自动刷新索引。

要在内部 Elasticsearch 间隔之前强制刷新,您可以在一个或多个索引(更多索引以逗号​​分隔)或所有索引上调用刷新 API。

Elasticsearch 不会在每个插入的文档处刷新索引的状态,以防止由于关闭和重新打开文件描述符所需的过多 I/O 而导致性能下降。

You must force the refresh to have your last index data available for searching.

通常,调用刷新的最佳时间是在索引大量数据之后,以确保您的记录可以立即搜索。通过添加 refresh=true 作为查询参数,也可以在文档索引期间强制刷新。例如:

POST /myindex/_doc/2qLrAfPVQvCRMe7Ku8r0Tw?refresh=true
{
"id": "1234",
"date": "2013-06-07T12:14:54",
"customer_id": "customer1",
"sent": true,
"in_stock_items": 0,
"items": [
{
"name": "item1",
"quantity": 3,
"vat": 20
},
{
"name": "item2",
"quantity": 2,
"vat": 20
},
{
"name": "item3",
"quantity": 1,
"vat": 10
}
]
}

See also

请参阅本章中的刷新索引配方以强制将索引数据写入磁盘和强制合并索引配方以优化索引以进行搜索

Flushing an index

出于性能原因, Elasticsearch 将一些数据存储在内存和事务日志中。如果我们想释放内存,我们需要清空事务日志,并且为了确保我们的数据安全地写入磁盘,我们需要刷新一个索引。

Elasticsearch 自动在磁盘上提供定期刷新,但强制刷新可能很有用,例如:

  • When we need to shut down a node to prevent stale data
  • To have all the data in a safe state (for example, after a big indexing operation to have all the data flushed and refreshed)

Getting ready

您需要一个正常运行的 Elasticsearch 安装,正如我们在 下载和安装 Elasticsearch 第一章,  开始

要执行这些命令,可以使用 HTTP 客户端,例如 curl (https://curl.haxx.se/ )、邮递员 (https://www.getpostman.com/) 或其他人。我建议使用 Kibana 控制台,因为它为 Elasticsearch 提供了代码完成和更好的字符转义。

 正确执行以下命令,请使用我们在创建索引配方中创建的索引。

How to do it...

用于这两个操作的 HTTP 方法是 POST。刷新索引的 URL 格式如下:

http://<server>/<index_name(s)>/_flush[?refresh=True]

刷新集群中所有索引的 URL 格式如下:

http:///_flush[?refresh=True]

  1. For flushing an index, we will perform the following steps: if we consider the type order of the previous chapter, the call will be as follows:
POST /myindex/_flush
  1. If everything is fine, the result returned by Elasticsearch should be as follows:
{
  "_shards" : {
    "total" : 2,
    "successful" : 1,
    "failed" : 0
  }
}

结果包含分片操作状态。

How it works...

为了减少写入,Elasticsearch 尽量不在 I/O 操作中增加开销,它会在内存中缓存一些数据 直到刷新发生,从而能够执行多文档单次写入以提高性能。

要清理内存并将此数据强制存储在磁盘上,需要 flush 操作。

在flush调用中,可以给一个额外的请求参数,refresh,这也是 使用 to 强制索引刷新。

Flushing too often affects index performance. Use it wisely!

See also

在本章中,请参阅 刷新索引 recipe 以搜索最近索引的数据和 ForceMerge an index 秘诀以优化索引以进行搜索。

ForceMerge an index

Elasticsearch 核心基于 Lucene,它将数据分段存储在磁盘上。在索引的生命周期中,会创建和更改许多段。随着段数的增加,由于读取所有段所需的时间,搜索速度降低。 ForceMerge 操作允许我们合并索引以获得更快的搜索性能和减少段。

Getting ready

您需要一个正常运行的 Elasticsearch 安装,正如我们在 下载和安装 Elasticsearch 第一章,  开始

要执行这些命令,可以使用 HTTP 客户端,例如 curl (https://curl.haxx.se/ )、邮递员 (https://www.getpostman.com/) 或其他人。我建议使用 Kibana 控制台,因为它为 Elasticsearch 提供了代码完成和更好的字符转义。

 正确执行以下命令,请使用我们在创建索引配方中创建的索引。

How to do it...

使用的 HTTP 方法是 POST。优化一个或多个索引的 URL 格式如下:

http://<server>/<index_name(s)>/_flush[?refresh=True]

优化集群中所有索引的 URL 格式为:

http://<server>/_flush[?refresh=True]

为了优化或强制合并索引,我们将执行以下步骤:

  1. If we consider the index we created in the Creating an index recipe, the call will be as follows:
POST /myindex/_forcemerge
  1. The result returned by Elasticsearch should be as follows:
{
  "_shards" : {
    "total" : 2,
    "successful" : 1,
    "failed" : 0
  }
}

结果包含分片操作状态。

How it works...

Lucene 将您的数据存储在磁盘上的几个段中。当您为新文档或 记录编制索引或删除文档时,会创建这些段。

在 Elasticsearch 中,被删除的文档不会从磁盘中删除,而是被标记为已删除(并称为墓碑)。要释放空间,您需要 forcemerge 来清除已删除的文档。

由于所有这些因素,段数可能很大。 (因此,在设置中,我们增加了 Elasticsearch 进程的文件描述编号。)

在内部,Elasticsearch 有一个合并,它试图减少段的数量,但它旨在提高索引性能而不是搜索性能。 Lucene 中的 forcemerge 操作尝试通过删除未使用的段、清除已删除的文档以及用最少的段重建索引以大量 IO 的方式减少段。

其主要优点 如下:

  • Reducing both file descriptors
  • Freeing memory used by the segment readers
  • Improving performance during searches due to less segment management
ForceMerge 是一个非常重 IO 的操作。在此优化期间,索引可能无响应。它通常在很少修改的索引上执行,例如前几天的logstash。

There's more...

您可以将几个附加参数传递给 ForceMerge 调用,例如 以下:

  • max_num_segments: The default value is autodetect. For full optimization, set this value to 1.
  • only_expunge_deletes: The default value is false. Lucene does not delete documents from segments, but it marks them as deleted. This flag only merges segments that have been deleted.
  • flush: The default value is true. Elasticsearch performs a flush after a ForceMerge.
  • wait_for_merge: The default value is true. If the request needs to wait, then the merge ends.

See also

在本章中,请参阅 刷新索引配方以搜索更新的索引数据和 刷新索引配方以强制将索引数据写入磁盘。

Shrinking an index

最新版本的 Elasticsearch 提供了一种优化索引的新方法。使用收缩 API,可以减少索引的分片数量。

此功能针对几种常见场景:

  • There will be the wrong number of shards during the initial design sizing. Often, sizing the shards without knowing the correct data or text distribution tends to oversize the number of shards.
  • Reducing the number of shards to reduce memory and resource usage.
  • Reducing the number of shards to speed up searching.

Getting ready

您需要一个正常运行的 Elasticsearch 安装,正如我们在 下载和安装 Elasticsearch 第一章,  开始

要执行这些命令,可以使用 HTTP 客户端,例如 curl (https://curl.haxx.se/ )、邮递员 (https://www.getpostman.com/) 或其他人。我建议使用 Kibana 控制台,因为它为 Elasticsearch 提供了代码完成和更好的字符转义。

 正确执行以下命令,请使用我们在创建索引配方中创建的索引。

How to do it...

使用的 HTTP 方法是 POST。优化一个或多个索引的 URL 格式如下:

http://<server>/<source_index_name>/_shrink/<target_index_name>

要缩小索引,我们将执行以下步骤:

  1. We need all the primary shards of the index to be shrinking in the same node. We need the name of the node that will contain the shrink index. We can retrieve it using the _nodes API:
GET /_nodes?pretty

在结果中,将有一个类似的部分:

....
"cluster_name" : "elastic-cookbook",
  "nodes" : {
    "9TiCStQuTDaTyMb4LgWDsg" : {
      "name" : "1e9840cf42df",
      "transport_address" : "172.18.0.2:9300",
      "host" : "172.18.0.2",
      "ip" : "172.18.0.2",
      "version" : "7.0.0",
      "build_flavor" : "default",
      "build_type" : "docker",
      "build_hash" : "f076a79",
      "total_indexing_buffer" : 103795916,

....

我的节点的名称是 1e9840cf42df

  1. Now, we can change the index settings, forcing allocation to a single node for our index, and disabling the writing for the index. This can be done using the following code:
PUT /myindex/_settings
{
  "settings": {
    "index.routing.allocation.require._name": "1e9840cf42df",
    "index.blocks.write": true
  }
}
  1. We need to check if all the shards are relocated. We can check for their green status:
GET /_cluster/health?pretty

结果如下:

{
  "cluster_name" : "elastic-cookbook",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 1,
  "number_of_data_nodes" : 1,
  "active_primary_shards" : 2,
  "active_shards" : 2,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 1,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 66.66666666666666
}
  1. The index should be in a read-only state to shrink. We need to disable the writing for the index using this code snippet:
PUT /myindex/_settings?
{"index.blocks.write":true}

  1. If we consider the index we created in the Creating an index recipe, the shrink call for creating the reduced_index will be as follows:
POST /myindex/_shrink/reduced_index
{
  "settings": {
    "index.number_of_replicas": 1,
    "index.number_of_shards": 1,
    "index.codec": "best_compression"
  },
  "aliases": {
    "my_search_indices": {}
  }
}
  1. The result returned by Elasticsearch should be as follows:
{"acknowledged":true}
  1. We can also wait for a yellow status, if the index is ready to work:
GET /_cluster/health?wait_for_status=yellow
  1. Now, we can remove the read-only setting by changing the index settings:
PUT /myindex/_settings
{"index.blocks.write":false}

How it works...

收缩 API 通过执行以下步骤来减少分片的数量:

  1. Elasticsearch creates a new target index with the same definition as the source index, but with a smaller number of primary shards.
  2. Elasticsearch hard-links (or copies) segments from the source index into the target index.
如果文件系统不支持硬链接,那么所有段都会被复制到新索引中,这是一个更加耗时的过程。 Elasticsearch 恢复目标索引,就好像它是一个刚刚被关闭的索引 重新开放。在 Linux 系统上,由于硬链接,该过程非常快。

执行收缩的先决条件如下:

  • 所有主分片必须在同一个节点上
  • 目标索引不能存在
  • 目标分片数必须是源索引中分片数的一个因子

There's more...

此 Elasticsearch 功能为 Elasticsearch 使用中的新场景提供支持。

第一种情况是您高估了分片的数量。如果您不知道自己的数据,则很难选择要使用的正确分片数量。因此,通常,Elasticsearch 用户倾向于超大分片的数量。

另一个有趣的场景是使用收缩来提高索引时间。将 Elasticsearch 写入能力加速到大量文档的主要方法是创建具有大量分片的索引(通常,摄取速度大约等于分片数乘以文档 每 秒,由单个分片摄取)。标准分配将分片移动到不同的节点上,所以通常你拥有的分片越多,写入速度就越快:所以,为了实现快速的写入速度,你应该为一个索引创建 15 或 30 个分片。在索引阶段之后,索引不会接收新记录(例如基于时间的索引):索引仅被搜索,因此为了加快搜索速度,您可以缩小分片。

See also

在本章中,请参阅 ForceMerge an index 配方来优化您的索引以进行搜索。

Checking if an index exists

一个常见的陷阱错误是查询不存在的索引。为了防止这个问题,Elasticsearch 让用户能够检查索引是否存在。

此检查通常在应用程序启动期间用于创建正确工作所需的索引。

Getting ready

您需要一个正常运行的 Elasticsearch 安装,正如我们在 下载和安装 Elasticsearch 第一章,  开始

要执行这些命令,可以使用 HTTP 客户端,例如 curl (https://curl.haxx.se/ )、邮递员 (https://www.getpostman.com/) 或其他人。我建议使用 Kibana 控制台,因为它为 Elasticsearch 提供了代码完成和更好的字符转义。

 正确执行以下命令,请使用我们在创建索引配方中创建的索引。

How to do it...

检查索引是否存在的 HTTP 方法是 HEAD。检查索引的 URL 格式如下:

http://<server>/<index_name>/

要检查索引是否存在,我们将执行以下步骤:

  1. If we consider the index we created in the Creating an index recipe, the call will be as follows:
HEAD /myindex/
  1. If the index exists, an HTTP status code 200 is returned; if it is missing, a 404 code will be returned.

How it works...

这是一个典型的 HEAD REST 调用来检查某些东西的存在。它不返回正文响应,只返回状态码,即操作的结果状态。

最常见的状态码是:

  • 20X family, if everything is okay
  • 404, if the resource is not available
  • 50X family, if there are server errors
Before every action involved in indexing, generally on an application's startup, it's good practice to check if an index exists to prevent future failures.

Managing index settings

索引设置更为重要,因为它们允许您控制几个重要的 Elasticsearch 功能,例如分片 或 复制、缓存、术语管理、路由和分析。

Getting ready

您需要一个正常运行的 Elasticsearch 安装,正如我们在 下载和安装 Elasticsearch 第一章,  开始

要执行这些命令,可以使用 HTTP 客户端,例如 curl (https://curl.haxx.se/ )、邮递员 (https://www.getpostman.com/) 或其他人。我建议使用 Kibana 控制台,因为它为 Elasticsearch 提供了代码完成和更好的字符转义。

 正确执行以下命令,请使用我们在创建索引配方中创建的索引。

How to do it...

为了管理索引设置,我们将执行以下步骤:

  1. To retrieve the settings of your current index, use the following URL format: http://<server>/<index_name>/_settings
  2. We are reading information using the REST API, so the method will be GET. An example of a call, using the index we created in the Creating an index recipe, is as follows:
GET /myindex/_settings?pretty=true
  1. The response will be something similar to this:
{
  "myindex" : {
    "settings" : {
      "index" : {
        "routing" : {
          "allocation" : {
            "require" : {
              "_name" : "1e9840cf42df"
            }
          }
        },
        "number_of_shards" : "1",
        "blocks" : {
          "write" : "true"
        },
        "provided_name" : "myindex",
        "creation_date" : "1554578317870",
        "number_of_replicas" : "1",
        "uuid" : "sDzB7n80SFi8Of99IgLYtA",
        "version" : {
          "created" : "7000099"
        }
      }
    }
  }
}
  1. The response attributes depend on the index settings. In this case, the response will be the number of replicas (1), shards (2), and the index creation version (7000099). The UUID represents the unique ID of the index.
  1. To modify the index settings, we need to use the PUT method. A typical settings change is to increase the replica number:
PUT /myindex/_settings
{"index":{ "number_of_replicas": "2"}}

How it works...

Elasticsearch 提供了很多选项来调整索引行为,例如:

  • Replica management:
    • index.number_of_replicas: This is the number of replicas each shard has
    • index.auto_expand_replicas: This allows you to define a dynamic number of replicas related to the number of shards
Using set index.auto_expand_replicas to 0-all allows the creation of an index that is replicated in every node. (This is very useful for settings or cluster-propagated data, such as language options or stopwords).
  • Refresh interval (default 1s): In the Refreshing an index recipe, we saw how to refresh an index manually. The index settings index.refresh_interval control the rate of automatic refreshing.
  • Write management: Elasticsearch provides several settings to block read or write operations in the index and to change metadata. They live in the index.blocks settings.
  • Shard allocation management: These settings control how the shards must be allocated. They live in the index.routing.allocation.* namespace.

还有其他索引设置可以针对非常特定的需求进行配置。在 Elasticsearch 的每个新版本中,社区都会扩展这些设置以涵盖新的场景和要求。

There's more...

 refresh_interval 参数允许几个技巧来优化索引速度。它控制刷新率和刷新本身,并由于文件的打开和关闭而降低索引的性能。一个好的做法是在大批量索引期间禁用刷新间隔 (set-1),并在此之后恢复默认行为。这可以通过以下步骤来完成:

  1. Disable the refresh:
PUT /myindex/_settings
{"index":{"refresh_interval": "-1"}}
  1. Bulk-index millions of documents.
  2. Restore the refresh:
PUT /myindex/_settings
{"index":{"refresh_interval": "1s"}}
  1. Optionally, you can optimize an index for search performance:
POST /myindex/_forcemerge

See also

在本章中,请参考 刷新索引 方法来搜索更新的索引数据,参考 ForceMerge 索引 方法来优化索引以进行搜索。

Using index aliases

现实世界的应用程序有很多索引和跨越更多索引的查询。这种情况需要定义查询所基于的所有索引名称;别名允许将它们分组到一个通用名称下。

这种用法的一些常见场景如下:

  • Log indices divided by date (that is, log_YYMMDD) for which we want to create an alias for the last week, the last month, today, yesterday, and so on. This pattern is commonly used in log applications such as Logstash (https://www.elastic.co/products/logstash).
  • Collecting website contents in several indices (New York Times, The Guardian, ...) for those we want to be referred to by the index alias sites.

Getting ready

您需要一个正常运行的 Elasticsearch 安装,正如我们在 下载和安装 Elasticsearch 配方中所述第 1 章 开始

要执行这些命令,可以使用 HTTP 客户端,例如 curl (https://curl.haxx.se/ )、邮递员 (https://www.getpostman.com/) 或其他人。我建议使用 Kibana 控制台,因为它为 Elasticsearch 提供了代码完成和更好的字符转义。

How to do it...

控件别名的 URL 格式如下:

http://<server>/_aliases
http://<server>/<index>/_alias/<alias_name>

为了管理索引别名,我们将执行以下步骤:

  1. We are reading the aliases and statuses for all indices using the REST API, so the method will be GET. An example of a call is as follows:
GET /_aliases
  1. This gives a response similar to this one:
{
  ".monitoring-es-7-2019.04.06" : {
    "aliases" : { }
  },
  "myindex" : {
    "aliases" : { }
  }
}
  1. Aliases can be changed with add and delete commands.
  2. To read an alias for a single index, we use the _alias endpoint:
GET /myindex/_alias

结果应如下所示:

{
  "myindex" : {
    "aliases" : { }
  }
}
  1. To add an alias, type the following command:
PUT /myindex/_alias/myalias1

结果应如下所示:

{
  "acknowledged" : true
}

此操作将 myindex 索引添加到 myalias1 别名。

  1. To delete an alias, type the following command:
DELETE /myindex/_alias/myalias1

结果应如下所示:

{
  "acknowledged" : true
}

删除操作从 myalias1 别名中删除了 myindex

How it works...

Elasticsearch 在搜索操作期间会自动扩展别名,因此选择所需的索引。

别名元数据保持在集群状态。添加 或 删除别名时,所有更改都会传播到所有集群节点。

别名主要是功能结构,当数据存储在多个索引中时,可以简单地管理索引。

There's more...

别名也可用于定义过滤器和路由参数。

过滤器会自动添加到查询中以过滤掉数据。通过使用别名进行路由,我们可以控制在搜索和索引期间命中哪些分片。

此调用的一个示例如下:

POST /myindex/_aliases/user1alias
{
  "filter": {
    "term": {
      "user": "user_1"
    }
  },
  "search_routing": "1,2",
  "index_routing": "2"
}

在这种情况下,我们正在向 myindex 索引添加一个新别名 user1alias,并且还添加了以下内容:

  • A filter to select only documents that match a field user with a user_1 term.
  • A list and a routing key to select the shards to be used during a search.
  • A routing key to be used during indexing. The routing value is used to modify the destination shard of the document.
search_routing 参数允许多值路由键。该  index_routing 参数仅为单值。

Rolling over an index

在使用管理日志的系统时,将滚动文件用于日志条目是很常见的。通过使用这个想法,我们可以拥有类似于滚动文件的索引。

我们可以定义一些要检查的条件,并将其留给 Elasticsearch 来滚动新索引 自动 并将别名的使用仅引用到一个 虚拟 index.

Getting ready

您需要一个正常运行的 Elasticsearch 安装,正如我们在 下载和安装 Elasticsearch 第 1 章,< /span> 开始

要执行这些命令,可以使用 HTTP 客户端,例如 curl (https://curl.haxx.se/ )、邮递员 (https://www.getpostman.com/) 或其他人。我建议使用 Kibana 控制台,因为它为 Elasticsearch 提供了代码完成和更好的字符转义。

How to do it…

要启用滚动索引,我们需要一个具有单独指向它的别名的索引。例如,要设置日志滚动索引,我们将遵循以下步骤:

  1. We need an index with a logs_write alias that points to it alone:
PUT /mylogs-000001
{
  "aliases": {
    "logs_write": {}
  }
}

结果将是一个确认,如下所示:

{
  "acknowledged" : true,
  "shards_acknowledged" : true,
  "index" : "mylogs-000001"
}
  1. We can add the rolling to the logs_write alias in this way:
POST /logs_write/_rollover
{
  "conditions": {
    "max_age": "7d",
    "max_docs": 100000
  },
  "settings": {
    "index.number_of_shards": 3
  }
}

结果如下:

{
  "acknowledged" : false,
  "shards_acknowledged" : false,
  "old_index" : "mylogs-000001",
  "new_index" : "mylogs-000002",
  "rolled_over" : false,
  "dry_run" : false,
  "conditions" : {
    "[max_docs: 100000]" : false,
    "[max_age: 7d]" : false
  }
}
  1. In case your alias doesn't point to a single index, a similar error is returned:
{
"error" : {
"root_cause" : [
{
"type" : "illegal_argument_exception",
"reason" : "source alias maps to multiple indices
}
],
"type" : "illegal_argument_exception",
"reason" : "source alias maps to multiple indices"
},
"status" : 400
}

How it works...

滚动索引是一个特殊的别名,当其中一个条件匹配时,它管理新索引的自动创建。

这是一个非常方便的功能,因为它完全由 Elasticsearch 管理,减少了对大量自定义后端的需求 user 代码。

创建新索引的信息取自源代码,但您也可以在创建 index 时应用自定义设置。

命名约定由 Elasticsearch 管理,它会自动递增索引名称的数字部分(默认情况下,它使用六个结尾数字)。

See also

请参阅本章中的使用索引别名 食谱来管理索引的别名。

Indexing a document

在 Elasticsearch 中,有两个至关重要的操作:index 和 search。

索引意味着在索引中存储一个或多个文档:类似于在关系数据库中插入记录的概念。

在 Elasticsearch 的核心引擎 Lucene 中,插入或更新文档具有相同的成本:在 Lucene 和 Elasticsearch 中,更新意味着替换。

Getting ready

您需要一个正常运行的 Elasticsearch 安装,正如我们在 下载和安装 Elasticsearch 第一章,  开始

要执行这些命令,可以使用 HTTP 客户端,例如 curl (https://curl.haxx.se/ )、邮递员 (https://www.getpostman.com/) 或其他人。我建议使用 Kibana 控制台,因为它为 Elasticsearch 提供了代码完成和更好的字符转义。

 正确执行以下命令,请使用我们在将映射放入索引配方中创建的索引和映射。

How to do it...

要索引文档,可以使用几个 REST 入口点:

方法

网址

POST

http://<server>/<index_name>/_doc

PUT/POST

http://<server>/<index_name>/_doc/<id>

PUT/POST

http://<server>/<index_name>/_doc/<id>/_create

 

要索引文档,我们需要执行以下步骤:

  1. If we consider the order type of the previous chapter, the call to index a document will be as follows:
POST /myindex/_doc/2qLrAfPVQvCRMe7Ku8r0Tw
{
  "id": "1234",
  "date": "2013-06-07T12:14:54",
  "customer_id": "customer1",
  "sent": true,
  "in_stock_items": 0,
  "items": [
    {
      "name": "item1",
      "quantity": 3,
      "vat": 20
    },
    {
      "name": "item2",
      "quantity": 2,
      "vat": 20
    },
    {
      "name": "item3",
      "quantity": 1,
      "vat": 10
    }
  ]
}
  1. If the index operation was successful, the result returned by Elasticsearch should be as follows:
{
  "_index" : "myindex",
  "_type" : "_doc",
  "_id" : "2qLrAfPVQvCRMe7Ku8r0Tw",
  "_version" : 1,
  "result" : "created",
  "_shards" : {
    "total" : 2,
    "successful" : 1,
    "failed" : 0
  },
  "_seq_no" : 0,
  "_primary_term" : 1
}

索引操作会返回一些附加信息,例如以下内容:

  • An auto-generated ID if it's not specified (in this example: 2qLrAfPVQvCRMe7Ku8r0Tw)
  • The version of the indexed document as per the optimistic concurrency control (the version is 1 because it was the document's first time of saving or updating)
  • Whether the record has been created ("result": "create" in this example)

How it works...

Elasticsearch 中最常用的 API 之一是索引。基本上,索引 JSON 文档的内部包括以下步骤:

  1. Routing the call to the correct shard based on the ID, or routing, or parent metadata. If the ID is not supplied by the client, a new one is created (see the Managing your data recipe in Chapter 1, Getting Started, for details).
  2. Validating the sent JSON.
  3. Processing the JSON according to the mapping. If new fields are present in the document (and the mapping can be updated), new fields are added in the mapping.
  4. Indexing the document in the shard. If the ID already exists, it is updated.
  5. If it contains nested documents, it extracts them, and it processes them separately.
  6. Returning information about the saved document (ID and versioning).

选择正确的 ID 来索引数据很重要。如果您不提供 ID,则在索引阶段, Elasticsearch 会自动将一个新 ID 关联到您的文档。为了提高性能,ID 通常应该具有相同的字符长度,以提高存储它们的数据树的平衡。

由于 REST 调用性质,最好注意由于 URL 编码和解码而不使用 ASCII 字符(或确保您使用的客户端框架正确转义它们)。

根据映射,在索引阶段会发生其他操作:在副本上传播、嵌套处理和渗透器。

刷新后文档将可用于标准搜索调用(通过 API 调用强制或在 1 秒的时间片后,接近实时):并非文档上的每个 GET API 都需要刷新,这些可以立即可用.

也可以通过在索引期间指定 refresh 参数来强制刷新。

There's more...

Elasticsearch 允许您将索引 API URL 传递给多个查询参数,以控制文档的索引方式。最常用的如下:

  • routing: This controls the shard to be used for indexing, that is:
POST /myindex/_doc?routing=1
  • consistency(one/quorum/all): By default, an index operation succeeds if a quorum (>replica/2+1) of active shards is available. The right consistency value can be changed for index action:
POST /myindex/_doc?consistency=one
  • replication (sync/async): Elasticsearch returns from an index operation when all the shards of the current replication group have executed the index operation. Setting the async replication allows us to execute the index action synchronously only on the primary shard and asynchronously on secondary shards. In this way, the API call returns the response action faster:
POST /myindex/_doc?replication=async

  • version: The version allows us to use the optimistic concurrency control (http://en.wikipedia.org/wiki/Optimistic_concurrency_control). The first time index of a document, its version 1, is set on the document. At every update, this value is incremented. Optimistic concurrency control is a way to manage concurrency in every insert or update operation. The passed version value is the last seen version (usually returned by a get or a search). The index happens only if the current index version value is equal to the passed one:
POST /myindex/_doc?version=2
  • op_type: This can be used to force a create on a document. If a document with the same ID exists, the index fails:
POST /myindex/_doc?op_type=create
  • refresh: This forces a refresh after having indexed the document. It allows documents to be ready for searching after their indexing:
POST /myindex/_doc?refresh=true
  • timeout: This defines a time to wait for the primary shard to be available. Sometimes, the primary shard is not in a writable status (if it's relocating or recovering from a gateway) and a timeout for the write operation is raised after 1 minute:
POST /myindex/_doc?timeout=5m

See also

您可以查看以下与本食谱相关的食谱以供进一步参考:

  • The Getting a document recipe in this chapter to learn how to retrieve a stored document
  • The Deleting a document recipe in this chapter to learn how to delete a document
  • The Updating a document recipe in this chapter to learn how to update fields in a document
  • For optimistic concurrency control, that is, the Elasticsearch way to manage concurrency on a document, a good reference place can be found at http://en.wikipedia.org/wiki/Optimistic_concurrency_control.

Getting a document

在为文档编制索引后,在应用程序的生命周期中,可能需要检索它。

GET REST 调用允许我们实时获取文档而无需刷新。

Getting ready

您需要一个正常运行的 Elasticsearch 安装,正如我们在第 1 章, 开始

要执行这些 命令,可以使用 HTTP 客户端 例如 curl (https://curl.haxx. se/)、邮递员 (https://www.getpostman.com/) 或其他人。我建议使用 Kibana 控制台,因为它为 Elasticsearch 提供了代码完成和更好的字符转义。

要执行以下命令 正确,请使用 索引文档中的索引文档 食谱。

How to do it...

GET 方法允许我们返回给定索引、类型和 ID 的文档。

REST API URL 如下:

http://<server>/<index_name>/_doc/<id>

要获取文档,我们将执行以下步骤:

  1. If we consider the document that we indexed in the previous recipe, the call will be as follows:
GET /myindex/_doc/2qLrAfPVQvCRMe7Ku8r0Tw
  1. The result returned by Elasticsearch should be the indexed document:
{
  "_index" : "myindex",
  "_type" : "_doc",
  "_id" : "2qLrAfPVQvCRMe7Ku8r0Tw",
  "_version" : 1,
  "_seq_no" : 0,
  "_primary_term" : 1,
  "found" : true,
  "_source" : {
    "id" : "1234",
    "date" : "2013-06-07T12:14:54",
    "customer_id" : "customer1",
    ... truncated ...
  }
}
  1. Our indexed data is contained in the _source parameter, but other information is returned:
    • _index: The index that stores the document
    • _type: The type of the document
    • _id: The ID of the document
    • _version: The version of the document
    • found: Whether the document has been found

如果记录丢失,则返回一个 404错误作为状态码,返回JSON如下:

{
  "_index" : "myindex",
  "_type" : "_doc",
  "_id" : "2qLrAfPVQvCRMe7Kud8r0Tw",
  "found" : false
}

How it works...

文档上的 Elasticsearch GET API 不需要刷新:所有 GET 调用都是实时的。

这个调用非常快,因为 Elasticsearch 只在包含文档的分片上重定向搜索,而没有其他开销,并且文档 ID 通常缓存在内存中以便快速查找。

只有存储了 _source 字段(根据 Elasticsearch 中的默认设置),文档的源才可用。

有几个附加参数可用于控制 get 调用:

  • _source allows us to retrieve only a subset of fields. This is very useful for reducing bandwidth or for retrieving calculated fields such as the attachment-mapping ones:
GET /myindex/_doc/2qLrAfPVQvCRMe7Ku8r0Tw?_source=date,sent
  • stored_fields, similar to source, allows us to retrieve only a subset of fields that are marked as stored in the mapping. Stored fields are kept in a separated memory portion of the index, and they can be retrieved without parsing the JSON source:
GET /myindex/_doc/2qLrAfPVQvCRMe7Ku8r0Tw?stored_fields=date,sent
  • routing allows us to specify the shard to be used for the get operation. To retrieve a document, the routing used in indexing time must be the same as the search time:
GET /myindex/_doc/2qLrAfPVQvCRMe7Ku8r0Tw?routing=customer_id
  • refresh allows us to refresh the current shard before performing the get operation (it must be used with care because it slows down indexing and introduces some overhead):
GET /myindex/_doc/2qLrAfPVQvCRMe7Ku8r0Tw?refresh=true
  • preference allows us to control which shard replica is chosen to execute the GET method. Generally, Elasticsearch chooses a random shard for the GET call. The possible values are as follows:
    • _primary for the primary shard.
    • _local, first trying the local shard and then falling back to a random choice. Using the local shard reduces the bandwidth usage and should generally be used with auto-replicating shards (replica set to 0-all).
    • custom value for selecting a shard-related value, such as customer_id and username.

There's more...

GET API 非常快,因此在开发应用程序时一个好的做法是尽可能多地尝试和使用它。在应用程序开发过程中选择正确的 ID 形式可以大大提高性能。

如果包含文档的分片未绑定到 ID,则使用 ID 过滤器进行查询(我们将在 第 6 章文本和数字查询,在使用 IDS 查询 配方中) 是 获取文档所必需的。

如果不需要获取记录,只想检查它是否存在,可以将GET替换为HEAD,响应为状态码200 如果它存在,或 404 如果它丢失。

GET 调用也  一个 特殊端点, _source, 只允许获取文档的来源。

GET 源 REST API URL 如下:

http://<server>/<index_name>/_doc/<id>/_source

要获取上一个订单的来源,我们将调用以下命令:

GET /myindex/_doc/2qLrAfPVQvCRMe7Ku8r0Tw/_source

See also

请参阅本章中的加速 GET 操作 配方,了解如何一次执行多个 GET 操作以减少获取时间。

Deleting a document

在 Elasticsearch 中删除文档有两种方法:使用 DELETE 调用或 delete_by_query 调用,我们将在下一章中介绍。

Getting ready

您需要一个正常运行的 Elasticsearch 安装,正如我们在 下载和安装 Elasticsearch 第一章,  开始

要执行这些 命令,可以使用 HTTP 客户端,例如 curl (https://curl.haxx.se/< /a>)、邮递员 (https://www.getpostman.com/) 或其他。我建议使用 Kibana 控制台,因为它为 Elasticsearch 提供了代码完成和更好的字符转义。

要正确执行以下命令 ,请使用 索引文档 配方中的索引文档.

How to do it...

REST API URL 与 GET 调用相同,但 HTTP 方法是 DELETE

http://<server>/<index_name>/_doc/<id>

要删除文档,我们将执行以下步骤:

  1. If we consider the order indexed in the Indexing a document recipe, the call to delete a document will be as follows:
DELETE /myindex/_doc/2qLrAfPVQvCRMe7Ku8r0Tw
  1. The result returned by Elasticsearch should be as follows:
{
  "_index" : "myindex",
  "_type" : "_doc",
  "_id" : "2qLrAfPVQvCRMe7Ku8r0Tw",
  "_version" : 2,
  "result" : "deleted",
  "_shards" : {
    "total" : 2,
    "successful" : 1,
    "failed" : 0
  },
  "_seq_no" : 3,
  "_primary_term" : 1
}
  1. If the record is missing, a 404 is returned as the status code, and the return JSON will be as follows:
{
  "_index" : "myindex",
  "_type" : "_doc",
  "_id" : "2qLrAfPVQvCRMe7Ku8r0Tw",
  "_version" : 3,
  "result" : "not_found",
  "_shards" : {
    "total" : 2,
    "successful" : 1,
    "failed" : 0
  },
  "_seq_no" : 4,
  "_primary_term" : 1
}

How it works...

删除记录只命中包含文档的分片,因此没有开销。如果文档是子文档,则必须将父文档设置为查找正确的分片。

有几个附加参数可用于控制删除调用。最重要的是 如下:

  • routing, which allows you to specify the shard to be used for the delete operation
  • version, which allows you to define a version of the document to be deleted to prevent modification of that document
The DELETE operation has to restore functionality. Every document that is deleted is lost forever.

Deleting a record is a fast operation and very easy to use if the IDs of the documents to delete are available. Otherwise, we must use the delete_by_query call, which we will look at in the next chapter.

See also

请参阅 第 4 章 探索搜索功能, 删除一堆匹配查询的文档。

Updating a document

存储在 Elasticsearch 中的文档可以在其生命周期内更新。在 Elasticsearch 中执行此操作有两种可用的解决方案:添加新文档或使用更新调用。

更新调用可以通过两种方式工作:

  • By providing a script that uses the update strategy
  • By providing a document that must be merged with the original one

更新相对于索引的主要优点是网络减少。

Getting ready

您需要一个正常运行的 Elasticsearch 安装,正如我们在 下载和安装 Elasticsearch 第一章,  开始

要执行这些命令,可以使用 HTTP 客户端,例如 curl (https://curl.haxx.se/ )、邮递员 (https://www.getpostman.com/) 或其他人。我建议使用 Kibana 控制台,因为它为 Elasticsearch 提供了代码完成和更好的字符转义。

要正确执行以下命令 ,请使用 索引文档 配方中的索引文档.

要使用动态脚本语言,必须启用它们。请参阅 第 9 章,管理集群,了解更多信息。

How to do it...

由于我们正在更改数据的状态,HTTP 方法为 POST,REST URL 如下:

http://<server>/<index_name>/_update/<id>
The REST format is changed by the previous version of Elasticsearch.

要更新文档,我们将执行以下步骤:

  1. If we consider the type order of the previous recipe, the call to update a document will be as follows:
POST /myindex/_update/2qLrAfPVQvCRMe7Ku8r0Tw
{
  "script": {
    "source": "ctx._source.in_stock_items += params.count",
    "params": {
      "count": 4
    }
  }
}
  1. If the request is successful, the result returned by Elasticsearch should be as follows:
{
  "_index" : "myindex",
  "_type" : "_doc",
  "_id" : "2qLrAfPVQvCRMe7Ku8r0Tw",
  "_version" : 4,
  "result" : "updated",
  "_shards" : {
    "total" : 2,
    "successful" : 1,
    "failed" : 0
  },
  "_seq_no" : 8,
  "_primary_term" : 1
}
  1. The record will be as follows:
{
  "_index" : "myindex",
  "_type" : "_doc",
  "_id" : "2qLrAfPVQvCRMe7Ku8r0Tw",
  "_version" : 8,
  "_seq_no" : 12,
  "_primary_term" : 1,
  "found" : true,
  "_source" : {
    "id" : "1234",
    "date" : "2013-06-07T12:14:54",
    "customer_id" : "customer1",
    "sent" : true,
    "in_stock_items" : 4,
... truncated ...
  }
}

可见变化如下:

  • The scripted field has been changed
  • The version has been incremented

How it works...

更新操作获取一个文档,将脚本或更新文档中所需的更改应用到该文档,然后重新索引更改的文档。在第 8 章中,中的脚本Elasticsearch 我们将探索 Elasticsearch 的脚本功能。

在 Elasticsearch 中编写脚本的标准语言是 Painless,并且在这些示例中使用。

该脚本可以对ctx._source:文档的来源(必须存储才能工作)进行操作,并且可以就地更改文档。可以通过传递 JSON 对象将参数传递给脚本。这些参数在执行上下文中可用。

脚本可以通过设置上下文的 ctx.op 值来控制脚本执行后的 Elasticsearch 行为。可用值如下:

  • ctx.op="delete" by which the document will be deleted after the script's execution.
  • ctx.op="none" by which the document will skip the indexing process. A good practice to improve performance is to set ctx.op="none" so that the script doesn't update the document, thus preventing a reindexing overhead.

ctx 还管理 ctx._timestamp 中记录的时间戳。有可能  upsert 属性中传递一个额外的对象,它将被使用 如果文档在 index:中不可用

POST /myindex/_update/2qLrAfPVQvCRMe7Ku8r0Tw
{
  "script": {
    "source": "ctx._source.in_stock_items += params.count",
    "params": {
      "count": 4
    }
  },
  "upsert": {
    "in_stock_items": 4
  }
}

如果需要替换某些字段值,一个好的解决方案不是编写复杂的更新脚本,而是使用特殊属性doc,它允许我们覆盖对象的值。 doc 参数中提供的文档将与原始文档合并。这种方式使用起来比较容易,但是它不能设置ctx.op,所以如果更新没有改变原始文档的值,就会一直执行下一个后续阶段:

POST /myindex/_update/2qLrAfPVQvCRMe7Ku8r0Tw
{
  "doc": {
    "in_stock_items": 10
  }
}

如果原始文档丢失,可以为 upsert 提供一个 doc 值(​​要创建的文档)作为 doc_as_upsert 参数:

POST /myindex/_update/2qLrAfPVQvCRMe7Ku8r0Tw
{
  "doc": {
    "in_stock_items": 10
  },
  "doc_as_upsert": true
}

使用 Painless 脚本,可以对字段应用高级操作,例如:

  • Remove a field, that is:
"script" : {"inline": "ctx._source.remove("myfield"}}
  • Add a new field, that is:
"script" : {"inline": "ctx._source.myfield=myvalue"}}

更新 REST 调用非常有用,因为它有一些优点:

  • It reduces bandwidth usage because the update operation doesn't need a round trip to the client of the data
  • It's safer, because it automatically manages the optimistic concurrent control: if a change happens during script execution, the script that it's re-executed with updates the data
  • It can be bulk-executed

See also

请参阅以下秘籍,加速原子操作,了解如何使用批量操作来减少网络负载并加速摄取。

Speeding up atomic operations (bulk operations)

当我们插入、删除或更新大量文档时,HTTP 开销很大。为了加快这个过程,Elasticsearch 允许执行大量的 CRUD 调用。

Getting ready

您需要一个正常运行的 Elasticsearch 安装,正如我们在 下载和安装 Elasticsearch 第一章,  开始

要执行这些命令,可以使用 HTTP 客户端,例如 curl (https://curl.haxx.se/ )、邮递员 (https://www.getpostman.com/) 或其他人。我建议使用 Kibana 控制台,因为它为 Elasticsearch 提供了代码完成和更好的字符转义。

How to do it...

当我们改变数据的状态时,HTTP 方法是 POST,而 REST URL 如下:

http://<server>/<index_name/_bulk

要执行批量操作,我们将通过 curl 执行以下步骤(因为准备文件数据并通过命令行将它们发送到 Elasticsearch 是很常见的):

  1. We need to collect the create/index/delete/update commands in a structure made of bulk JSON lines, composed of a line of action with metadata, and another optional line of data related to the action. Every line must end with a new line \n. A bulk data file should be presented like this:
{ "index":{ "_index":"myindex", "_id":"1" } }
{ "field1" : "value1", "field2" : "value2" }
{ "delete":{ "_index":"myindex", "_id":"2" } }
{ "create":{ "_index":"myindex", "_id":"3" } }
{ "field1" : "value1", "field2" : "value2" }
{ "update":{ "_index":"myindex", "_id":"3" } }
{ "doc":{"field1" : "value1", "field2" : "value2" }}
  1. This file can be sent with the following POST:
curl -s -XPOST localhost:9200/_bulk --data-binary @bulkdata;
  1. The result returned by Elasticsearch should collect all the responses of the actions.

您可以通过以下调用在 Kibana 中执行之前的命令:

POST /_bulk
{ "index":{ "_index":"myindex", "_id":"1" } }
{ "field1" : "value1", "field2" : "value2" }
{ "delete":{ "_index":"myindex", "_id":"2" } }
{ "create":{ "_index":"myindex", "_id":"3" } }
{ "field1" : "value1", "field2" : "value2" }
{ "update":{ "_index":"myindex", "_id":"3" } }
{ "doc":{"field1" : "value1", "field2" : "value2" }}

How it works...

批量操作允许将不同的调用聚合为一个单独的调用:包含要执行的操作的标题部分,以及其他操作的主体部分,例如 indexcreate , 和 更新

标头由动作名称及其参数对象组成。查看前面的索引示例,我们有以下内容:

{ "index":{ "_index":"myindex", "_id":"1" } }

对于索引和创建,数据需要一个额外的主体:

{ "field1" : "value1", "field2" : "value2" }

delete 操作不需要可选数据,因此只有标题组成它:

{ "delete":{ "_index":"myindex", "_id":"1" } }

至少,可以使用类似于 index 格式的批量更新操作:

{ "update":{ "_index":"myindex", "_id":"3" } }

标头接受更新操作的所有常用参数,例如 docupsertdoc_as_upsertlang脚本参数。为了控制并发情况下的重试次数,批量更新定义了 _retry_on_conflict 参数,设置为要执行的重试次数,然后再引发异常。

因此,更新的可能主体如下:

{ "doc":{"field1" : "value1", "field2" : "value2" }}

批量项可以接受多个参数,例如以下:

  • routing, to control the routing shard.
  • parent, to select a parent item shard. This is required if you are indexing some child documents. Global bulk parameters that can be passed using query arguments are as follows:
    • consistency (one, quorum, all) (default quorum), which controls the number of active shards before executing write operations.
    • refresh (default false), which forces a refresh in the shards involved in bulk operations. The newly indexed document will be available immediately, without having to wait for the standard refresh interval (1s).
    • pipeline, which forces an index using the ingest pipeline provided.
以前版本的 Elasticsearch 要求用户通过 _type 值,但在 7.x 版中已删除,因为 类型删除

通常,使用 Elasticsearch REST API 的 Elasticsearch 客户端库会自动实现批量命令的序列化。

在批量执行中序列化的命令的正确数量是用户选择,但有一些事情需要考虑:

  • In standard configuration, Elasticsearch limits the HTTP call to 100 MB in size. If the size is over that limit, the call is rejected.
  • Multiple complex commands take a lot of time to be processed, so pay attention to client timeout.
  • The small size of commands in a bulk doesn't improve performance.

如果文档不大,批量 500 个命令可能是一个很好的开始,并且可以根据数据结构(字段数、嵌套对象数、字段复杂性等)进行调整。

Speeding up GET operations (multi GET)

标准的 GET 操作非常快,但是如果您需要通过 ID 获取大量文档,Elasticsearch 提供了多 GET 操作。

Getting ready

您需要一个正常运行的 Elasticsearch 安装,正如我们在 下载和安装 Elasticsearch 第一章,  开始

要执行这些命令,可以使用 HTTP 客户端,例如 curl (https://curl.haxx.se/ )、邮递员 (https://www.getpostman.com/) 或其他类似的。我建议使用 Kibana 控制台,因为它为 Elasticsearch 提供了代码完成和更好的字符转义。

 正确执行以下命令,请使用我们在索引文档配方中创建的索引文档。

How to do it...

多个 GET REST URL 如下:

http://<server</_mget
http://<server>/<index_name>/_mget

要执行多重 GET 操作,我们将执行以下步骤:

  1. The method is POST with a body that contains a list of document IDs and the index or type if they are missing. As an example, using the first URL, we need to provide the index, type, and ID:
POST /_mget
{
  "docs": [
    {
      "_index": "myindex",
      "_id": "2qLrAfPVQvCRMe7Ku8r0Tw"
    },
    {
      "_index": "myindex",
      "_id": "2"
    }
  ]
}

这种调用允许我们以几种不同的索引和类型获取文档。

  1. If the index and the type are fixed, a call should also be in the following form:
GET /myindex/_mget
{
"ids" : ["1", "2"]
}

多重 GET 结果是一个文档数组。

How it works...

GET 调用是一次执行多个 get 命令的快捷方式。

在内部, Elasticsearch 将 get 并行分布在多个分片上,并收集结果返回给用户。

get 对象可以包含以下参数:

  • _index: The index that contains the document. It can be omitted if passed in the URL.
  • _id: The document ID.
  • stored_fields: (optional) A list of fields to retrieve.
  • _source: (optional) Source filter object.
  • routing: (optional) The shard routing parameter.

GET 的优点如下:

  • Reduced networking traffic, both internally and externally for Elasticsearch
  • Increased speed if used in an application: the time for processing a multi GET is quite similar to a standard get

See also...

请参阅本章中的 获取文档秘籍,了解如何执行简单的 get 以及 GET 调用的一般参数。