vlambda博客
学习文章列表

读书笔记《elasticsearch-7-0-cookbook-fourth-edition》探索搜索功能

Exploring Search Capabilities

现在我们已经设置了映射并将数据放入索引中,我们可以开始探索 Elasticsearch 中的搜索功能。在本章中,我们将介绍使用不同因素的搜索:排序、突出显示、滚动、建议、计数和删除。这些动作是 Elasticsearch 的核心部分;归根结底,Elasticsearch 中的一切都是为了提供查询服务并返回高质量的结果。

本章分为两部分:第一部分展示如何执行与 API 调用相关的搜索,第二部分将介绍两个特殊的查询运算符,它们是后续章节中构建复杂查询的基础。

在本章中,我们将介绍以下食谱:

  • Executing a search
  • Sorting results
  • Highlighting results
  • Executing a scrolling query
  • Using the search_after functionality
  • Returning inner hits in results
  • Suggesting a correct query
  • Counting matched results
  • Explaining a query
  • Query profiling
  • Deleting by query
  • Updating by query
  • Matching all the documents
  • Using a Boolean query
  • Using the search template

Technical requirements

Executing a search

Elasticsearch 是作为搜索引擎诞生的;它的主要目的是处理查询并尽可能快地给出结果。 在这个秘籍中,我们将看到 Elasticsearch 中的搜索不仅限于匹配文档 - 它还可以计算改进搜索所需的额外信息质量。

Getting ready

您需要一个正常运行的 Elasticsearch 安装,正如我们在 the 下载和安装 Elasticsearch recipe in 中所述a href="https://subscription.packtpub.com/book/big_data_and_business_intelligence/9781789956504/1" linkend="ch01">第 1 章, 开始< /em>.

要执行这些命令,可以使用 HTTP 客户端,例如 curl (https://curl.haxx.se/)、postman (https://www.getpostman.com/),或类似的东西。我建议使用 Kibana 控制台,因为它为 Elasticsearch 提供了代码完成和更好的字符转义。

要正确执行以下命令,您需要一个填充有   ch04/populate_kibana.txt  命令,在线代码中提供。

本章所有查询和搜索中使用的映射类似于以下书籍表示:

读书笔记《elasticsearch-7-0-cookbook-fourth-edition》探索搜索功能

创建模式的命令如下:

PUT /mybooks
 {
   "mappings": {
     "properties": {
       "join_field": {
         "type": "join",
         "relations": {
           "order": "item"
...
...
       "title": {
         "term_vector": "with_positions_offsets",
         "store": true,
         "type": "text",
         "fielddata": true,
         "fields": {
           "keyword": {
             "type": "keyword",
             "ignore_above": 256
           }
         }
       }
     }
   }
 }

How to do it...

要执行搜索并查看结果,我们将执行以下步骤:

  1. From the command line, we can execute a search as follows:
GET /mybooks/_search
 {
   "query": {
     "match_all": {}
   }
 }

在本例中,我们使用了返回所有文档的 match_all 查询。我们将在本章的匹配所有文档秘籍中讨论这种查询。

  1. If everything works, the command will return the following:
{
   "took" : 0,
   "timed_out" : false,
   "_shards" : {
     "total" : 1,
     "successful" : 1,
     "skipped" : 0,
     "failed" : 0
   },
   "hits" : {
     "total" : 3,
     "max_score" : 1.0,
     "hits" : [
...
...
         "_index" : "mybooks",
         "_type" : "_doc",
         "_id" : "3",
         "_score" : 1.0,
         "_source" : {...truncated...}
       }
      ]
   }
 }

这些结果包含以下信息:

  • took is the milliseconds of time required to execute the query.
  • time_out indicates whether a timeout occurred during the search. This is related to the timeout parameter of the search. If a timeout occurs, you will get partial or no results.
  • _shards is the status of shards divided into the following sections:
    • total, which is the number of shards
    • successful, which is the number of shards in which the query was successful
    • skipped, which is the number of shards that are skipped during the search (for example, if you are searching more than 720 shards simultaneously)
    • failed, which is the number of shards in which the query failed, because some error or exception occurred during the query
  •  hits are the results, and are composed of the following:
    • total is the number of documents that match the query.
    • max_score is the match score of first document. It is usually one if no match scoring was computed, for example, in sorting or filtering.
    • hits, which is a list of result documents.

生成的文档有很多始终可用的字段和其他依赖于搜索参数的字段。最重要的字段如下:

  • _index: The index that contains the document.
  • _type: The type of the document (that is, _doc). It will disappear in future ES versions.
  • _id: The ID of the document.
  • _source: The document source—the original json sent to Elasticsearch.
  • _score: Query score of the document (if the query doesn't require a score, it's 1.0).
  • sort: If the document is sorted, values that are used for sorting.
  • highlight: Highlighted segments if highlighting was requested.
  • stored_fields: Some fields can be retrieved without needing to fetch the source object.
  • script_fields: Some fields that can be computed using scripting.

How it works...

Elasticsearch中的搜索是一个由很多步骤组成的分布式计算,主要有以下几个:

  1. In the master or coordinator nodes, validation of the query body is needed
  2. A selection of indices to be used in the query are needed; the shards are randomly chosen
  3. Execution of the query part in data nodes that collects the top hits or the query
  4. Aggregation of results in the master and coordinator nodes, as well as scoring
  5. Return the results to the user

下图显示了查询在集群中的分布情况:

读书笔记《elasticsearch-7-0-cookbook-fourth-edition》探索搜索功能

执行搜索的 HTTP 方法是 GET(虽然 POST 也可以); REST 端点如下:

http://<server>/_search
http://<server>/<index_name(s)>/_search
并非所有 HTTP 客户端都允许您通过 GET 调用,所以如果你需要发送正文数据,最好的做法是使用 POST 调用。

多索引和类型以逗号分隔。如果定义了索引或类型,则搜索仅限于它们。一个或多个别名可以用作索引名称。

核心查询通常包含在 GET/POST 调用的主体中,但很多选项也可以表示为 URI 查询参数 ,例如:

  • q: This is the query string to perform simple string queries, which can be done as follows:
GET /mybooks/_search?q=uuid:11111
  • df: This is the default field to be used within the query and can be done as follows:
GET /mybooks/_search?df=uuid&q=11111
  • from (the default value is 0): The start index of the hits.
  • size (the default value is 10): The number of hits to be returned.
  • analyzer: The default analyzer to be used.
  • default_operator (the default value is OR): This can be set to AND or OR.
  • explain: This allows the user to return information about how the score is calculated. It is calculated as follows:
GET /mybooks/_search?q=title:joe&explain=true
  • stored_fields: These allow the user to define fields that must be returned, and can be done as follows:
GET /mybooks/_search?q=title:joe&stored_fields=title
  • sort (the default value is score): This allows the user to change the documents in order. Sort is ascendant by default; if you need to change the order, add desc to the field, as follows:
GET /mybooks/_search?sort=title.keyword:desc
  • timeout (not active by default): This defines the timeout for the search. Elasticsearch tries to collect results until a timeout. If a timeout is fired, all the hits that have been accumulated are returned.
  • search_type: This defines the search strategy. A reference is available in the online Elasticsearch documentation at https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-search-type.html.
  • track_scores (the default value is false): If true, this tracks the score and allows it to be returned with the hits. It's used in conjunction with sort because sorting by default prevents the return of a match score.
  • pretty (the default value is false): If true, the results will be pretty-printed.

通常,搜索正文中包含的查询是 JSON 对象。搜索主体是 Elasticsearch 搜索功能的核心;搜索功能列表在每个版本中都有扩展。对于 Elasticsearch 当前版本(7.x),可用参数如下:

  • query: This contains the query to be executed. Later in this chapter, we will see how to create different kinds of queries to cover several scenarios.
  • from: This allows the user to control pagination. The from parameter defines the start position of the hits to be returned (default 0) and size (default 10).
The pagination is applied to the currently returned search results. Firing the same query can bring different results if a lot of records have the same score, or a new document is ingested. If you need to process all the result documents without repetition, you need to execute scan or scroll queries.
  • sort: This allows the user to change the order of the matched documents. This option is fully covered in the Sorting results recipe.
  • post_filter: This allows the user to filter out the query results without affecting the aggregation count. It's usually used for filtering by facet values.
  • _source: This allows the user to control the returned source. It can be disabled (false), partially returned (obj.*), or use multiple exclude/include rules. This functionality can be used instead of fields to return values (for complete coverage of this, take a look at the online Elasticsearch reference at http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-request-source-filtering.html).
  • fielddata_fields: This allows the user to return a field data representation of the field.
  • stored_fields: This controls the fields to be returned.
Returning only the required fields reduces the network and memory usage, thus improving performance. The suggested way to retrieve custom fields is to use the _source filtering function because it doesn't need to use Elasticsearch's extra resources.
  • aggregations/aggs: These control the aggregation layer analytics. These will be discussed in the next chapter.
  • index_boost: This allows the user to define the per-index boost value. It is used to increase/decrease the score of results in boosted indices.
  • highlighting: This allows the user to define fields and settings to be used for calculating a query abstract (see the Highlighting results recipe in this chapter).
  • version (the default value false) This adds the version of a document in the results.
  • rescore: This allows the user to define an extra query to be used in the score to improve the quality of the results. The rescore query is executed on the hits that match the first query and filter.
  • min_score: If this is given, all the result documents that have a score lower than this value are rejected.
  • explain: This returns information on how the TD/IF score is calculated for a particular document.
  • script_fields: This defines a script that computes extra fields via scripting to be returned with a hit. We'll look at Elasticsearch scripting in Chapter 8, Scripting in Elasticsearch.
  • suggest: If given a query and a field, this returns the most significant terms related to this query. This parameter allows the user to implement the Google-like do you mean functionality similar to Google one (see the Suggesting a correct query recipe).
  • search_type: This defines how Elasticsearch should process a query. We'll see the scrolling query in the Executing a scrolling query recipe in this chapter.
  • scroll: This controls the scrolling in scroll/scan queries. scroll allows the user to have an Elasticsearch equivalent of a DBMS cursor.
  • _name: This allows returns for every hit that matches the named queries. It's very useful if you have a Boolean and you want the name of the matched query.
  • search_after: This allows the user to skip results using the most efficient way of scrolling. We'll see this functionality in the Using search_after functionality recipe in this chapter.
  • preference: This allows the user to select which shard/s to use for executing the query.

There's more...

为了提高结果评分的质量,Elasticsearch 提供了 rescore 功能。此功能允许用户使用通常更昂贵(CPU 或耗时)的另一个查询重新排序数量最多的文档,例如,如果查询包含大量匹配查询或脚本。这种方法允许用户仅对一小部分结果执行 rescore 查询,从而减少总体计算时间和资源。

rescore 查询,对于每个查询,都是在分片级别执行的,因此它是自动分布的。

The best candidates to be executed in the  rescore query are complex queries with a lot of nested options, and everything that is used is scripting (due to the massive overhead of scripting languages).

以下示例将向您展示如何在第一阶段执行快速查询(布尔值),然后在 中使用 match 查询来查询它。重新评分 部分:

POST /mybook/_search
 {
   "query": {
     "match": {
       "description": {
         "operator": "or",
         "query": "nice guy joe"
       }
     }
   },
   "rescore": {
     "window_size": 100,
     "query": {
       "rescore_query": {
         "match_phrase": {
           "description": {
             "query": "joe nice guy",
             "slop": 2
           }
         }
       },
       "query_weight": 0.8,
       "rescore_query_weight": 1.5
     }
   }
 }

rescore参数如下:

  • window_size: The example is 100. This controls how many results per shard must be considered in the rescore functionality.
  • query_weight: The default value is 1.0, and the rescore_query_weight default value is 1.0. These are used to compute the final score using the following formula:

读书笔记《elasticsearch-7-0-cookbook-fourth-edition》探索搜索功能

如果用户只想保留 rescore 分数,他们可以将 query_weight 设置为 0

See also

您可以查看以下与本食谱相关的食谱以供进一步参考:

  • Executing an aggregation recipe in Chapter 7, Aggregations, explains how to use the aggregation framework during queries
  • Highlighting results recipe in this chapter explains how to use the highlighting functionality for improving the user experience in results
  • Executing a scrolling query recipe in this chapter covers how to efficiently paginate results
  • Suggesting terms for a query recipe in this chapter helps to correct text queries

Sorting results

在搜索结果时,Elasticsearch 中排序的标准标准是与文本查询的相关性。现实世界的应用程序往往需要在场景中控制排序标准,例如:

  • Sorting a user by last name and first name
  • Sorting items by stock symbols, price (ascending and descending)
  • Sorting documents by size, file type, source, and so on
  • Sorting item related maximum or minimum or average of some children fields

Getting ready

您需要一个正常运行的 Elasticsearch 安装,正如我们在 下载和安装 Elasticsearch 配方中所述的 第 1 章 入门

要执行这些命令,可以使用任何 HTTP 客户端,例如 curl (https://curl.haxx.se/)、postman (https://www.getpostman.com/),或类似的。我建议使用 Kibana 控制台,因为它为 Elasticsearch 提供了代码完成和更好的字符转义。

要正确执行以下命令,您需要使用 ch04/populate_kibana.txt 命令填充的索引,该命令可用在在线代码中。

How to do it...

为了对结果进行排序,我们将执行以下步骤:

  1. Add a sort section to your query, as follows:
GET /mybooks/_search
{
   "query": {
     "match_all": {}
   },
   "sort": [
     {
       "price": {
         "order": "asc",
         "mode": "avg",
         "unmapped_type": "double",
         "missing": "_last"
       }
     },
   "_score"
   ]
 }
  1. The returned result should be similar to the following:
...truncated...
   "hits" : {
   "total" : 3,
   "max_score" : null,
   "hits" : [
     {
       "_index" : "mybooks",
       "_type" : "_doc",
       "_id" : "1",
       "_score" : 1.0,
       "_source" : {
         ...truncated...
         "price" : 4.3,
         "quantity" : 50
       },
       "sort" : [
         4.3,
         1.0
       ]

...truncated...

排序结果非常特殊——创建了一个额外的 sort 字段来收集用于排序的值。

How it works...

sort 参数可以定义为一个可以包含简单字符串和 JSON 对象的列表。排序字符串是字段的名称(如field1field2field3、 field4、等等)用于排序,类似于 order by SQL函数。 

JSON 对象允许用户使用额外的参数,如下所示:

  • order (asc or desc): This defines whether the order must be considered ascendant (default) or descendent.
  • unmapped_type (long or int or double or string, and so on): This defines the type of the sort parameter if the value is missing. It's a best practice to define it to prevent sorting errors due to missing values.
  • missing (_last or _first): This defines how to manage missing values—whether to put them at the end (_last) of the results or at the start (_first).
  • mode: This defines how to manage multi-value fields. The possible values are as follows:
    • min: The minimum value is chosen (that is to say that in the case of multi-price on an item, it chooses the lowest for comparison).
    • max: The maximum value is chosen.
    • sum: The sort value will be computed as the sum of all the values. This mode is only available on numeric array fields.
    • avg: The sort value will be the average of all the values. This mode is only available on numeric array fields.
    • median: The sort value will be the median of all the values. This mode is only available on numeric array fields.
If we want to add the relevance score value to the sort list, we must use the special _score sort field.

如果您正在对嵌套对象进行排序,则可以使用两个额外的参数,如下所示:

  • nested_path: This defines the nested object to be used for sorting. The field defined for sorting will be relative to the nested_path. If not defined, then the sorting field is related to the document root.
  • nested_filter: This defines a filter that is used to remove nested documents that don't match from the sorting value extraction. This filter allows for a better selection of values to be used in sorting.

例如,如果我们有一个 address 对象嵌套在 person 文档中,我们可以为 city.name 排序,我们可以使用以下:

  • address.city.name without defining the nested_path
  • city.name if we define a nested_path address
The sorting process requires that the sorting fields of all the matched query documents are fetched to be compared. To prevent high memory usage, its better to sort numeric fields, and in the case of string sorting, choose short text fields processed with an analyzer that doesn't tokenize the text.

There's more...

如果您使用 sort,请注意标记化的字段,因为排序顺序取决于升序的低阶标记和下降的高阶标记。在标记化字段的情况下,这种行为与普通排序不同,因为我们在术语级别执行它。

例如,如果我们按降序  title 字段排序,我们使用以下内容:

GET /mybooks/_search?sort=title:desc

在前面的示例中,结果如下:

{
  ...truncated...
   "hits" : {
     "total" : 3,
     "max_score" : null,
     "hits" : [
       {
         "_index" : "mybooks",
         "_type" : "_doc",
         "_id" : "1",
         "_score" : null,
         "_source" : {
          ...truncated...
           "title" : "Joe Tester",
          ...truncated...
...
...
         "sort" : [
           "bill"
         ]
       }
     ]
   }
 }

可以使用未标记的关键字字段获得预期的 SQL 结果,在本例中为 title.keyword,如下所示:

GET /mybooks/_search?sort=title.keyword:desc

结果如下:

{
  ...truncated...
   "hits" : {
     "total" : 3,
     "max_score" : null,
     "hits" : [
...
...
         "_index" : "mybooks",
         "_type" : "_doc",
         "_id" : "2",
        ...truncated...
         "sort" : [
           "Bill Baloney"
         ]
       }
     ]
   }
 }

有两种特殊的排序类型:地理距离和脚本。

地理距离排序使用距 GeoPoint(位置)的距离作为度量来计算排序。排序示例如下:

...truncated...
  "sort" : [
    {
      "_geo_distance" : {
        "pin.location" : [-70, 40],
        "order" : "asc",
        "unit" : "km"
      }
    }
  ],
...truncated...

它接受特殊参数,例如:

  • unit: This defines the metric to be used to compute the distance.
  • distance_type (sloppy_arc or arc or plane): This defines the type of distance to be computed. The _geo_distance name for the field is mandatory.

正如我们已经在 第 2 章管理映射

使用脚本进行排序将在 第 8 章Elasticsearch 中的脚本, 在我们介绍了 Elasticsearch 的脚本功能之后。

See also

您可以查看以下与此食谱相关的食谱以供进一步参考:

  • The Mapping a GeoPoint field recipe in Chapter 2Managing Mapping, explains how to correctly create a mapping for a GeoPoint field
  • The Sorting with scripts recipe in Chapter 8Scripting in Elasticsearch, will explain the use of custom script for computing values to sort on

Highlighting results

Elasticsearch 在大文本文档中查找匹配结果方面表现出色。它对于在非常大的块中搜索文本很有用,但为了改善用户体验,您需要向用户显示摘要——文档中与查询匹配的文本部分的一小部分。摘要是帮助用户了解匹配文档如何与他们相关的常用方法。

Elasticsearch 中的高亮功能旨在完成这项工作。

Getting ready

您需要一个正常运行的 Elasticsearch 安装,正如我们在 下载和安装 Elasticsearch 第 1 章,  开始

要执行这些命令,可以使用任何 HTTP 客户端,例如 curl (https://curl.haxx.se/)、postman (https://www.getpostman.com/),或类似的。我建议使用 Kibana 控制台,因为它为 Elasticsearch 提供了代码完成和更好的字符转义。

要正确执行以下命令,您需要使用 ch04/populate_kibana.txt 命令填充的索引,该命令可用在在线代码中。

How to do it...

要搜索并突出显示结果,我们需要执行以下步骤:

  1. From the command line, we can execute a search with a highlight parameter, as follows:
GET /mybooks/_search?from=0&size=10
 {
   "query": {
     "query_string": {
       "query": "joe"
     }
   },
   "highlight": {
     "pre_tags": [
       ""
     ],
     "fields": {
       "description": {
         "order": "score"
       },
       "title": {
         "order": "score"
       }
     },
     "post_tags": [
       ""
     ]
   }
 }
  1. If everything works, the command will return the following result:
{
  ...truncated...
   "hits" : {
     "total" : 1,
     "max_score" : 1.0126973,
     "hits" : [
       {
         "_index" : "mybooks",
         "_type" : "_doc",
         "_id" : "1",
         "_score" : 1.0126973,
        ...truncated...
         "highlight" : {
           "description" : [
             "<b>Joe</b> Testere nice guy"
           ],
           "title" : [
             "<b>Joe</b> Tester"
           ]
         }
       }
     ]
   }
 }

如您所见,在标准结果中,有一个新的 highlight 字段,其中包含片段数组中突出显示的字段。

How it works...

highlight 参数传递给搜索对象时,Elasticsearch 会尝试在文档结果上执行突出显示。

在文档获取之后的突出显示阶段尝试使用以下步骤提取突出显示:

  1. It collects the terms that are available in the query
  2. It initializes the highlighter with the parameters given during the query
  3. It extracts the interested fields and tries to load them if they are stored, otherwise they are taken from the source
  4. It executes the query on single fields to detect the more relevant parts
  5. It adds the found highlighted fragments to the hit

使用高亮功能非常容易,但有一些重要的因素需要注意:

  • The field that must be used for highlighting must be available in one of these forms: stored, in source, or in stored term vector
The Elasticsearch highlighter checks the presence of the data field first as the term vector (this is a faster way to execute the highlighting). If the field does not use the term vector (a special indexing parameter that allows you to store an index additional positional text data), it tries to load the field value from the stored fields. If the field is not stored, it finally loads the JSON source, interprets it, and extracts the data value, if available. Obviously, the last approach is the slowest and most resource-intensive.
  • If a special analyzer is used in the search, it should also be passed to the highlighter (this is often automatically managed)

在对大量字段执行高亮时,可以使用通配符进行多选(即title*)。

控制高亮字段使用的常用属性如下:

  • order: This defines the matched fragments selection order.
  • force_source: This skips the term vector or stored field and takes the field from the source (false default).
  • type (optional, valid values are plain, postings, and fvh): This is used to force a specific highlight type.
  • number_of_fragment: The default value is 5. This parameter controls how many fragments return. It can be configured globally or for a field.
  • fragment_size: The default value is 100. This is the number of characters that the fragments must contain. It can be configured globally or for a field.

可以在高亮对象中传递几个可选参数来控制高亮标记,它们如下:

  • pre_tags/post_tags: A list of tags to be used for marking the highlighted text.
  • tags_schema="styled": This allows the user to define a tag schema that marks highlighting with different tags with ordered importance. This is a helper to reduce the definition of a lot of pre_tags/post_tags tags.
  • encoder: The default value is html. If this is set to html, it will escape HTML tags in the fragments.
  • require_field_match: The default value is true. If this is set to false, it also allow highlighting on fields that don't match the query.
  • boundary_chars: This is a list of characters that are used for phrase boundaries (that is,;:/).
  • boundary_max_scan: The default value is 20. This controls how many characters the highlighting must scan for boundaries in a match. It's used to provide better fragment extraction.
  • matched_fields: This allows the user to combine multi-fields to execute the highlighting. This is very useful if the field that you use for highlighting is a multi-field that's been analyzed with different analyzers (such as standard, linguistic, and so on). It can only be used when the highlighter is a Fast Vector Highlighter (FVH). An example of this usage could be as follows:
{
  "query": {
    "query_string": {
       "query": "content.plain:some text",
       "fields": [
         "content"
       ]
     }
   },
   "highlight": {
     "order": "score",
     "fields": {
       "content": {
         "matched_fields": [
           "content",
           "content.plain"
         ],
         "type": "fvh"
       }
     }
   }
 }

See also

Executing a scrolling query

每次执行查询时,都会计算结果并实时返回给用户。在 Elasticsearch 中,记录没有确定的顺序——对一大块值进行分页会导致结果之间的不一致,因为添加和删除的文档以及具有相同分数的文档也是如此。

滚动查询试图解决这类问题,提供一个特殊的光标,允许用户唯一地迭代所有文档。 

Getting ready

您需要一个正常运行的 Elasticsearch 安装,正如我们在 下载和安装 Elasticsearch 第 1 章,  开始

要执行这些命令,可以使用任何 HTTP 客户端,例如 curl (https://curl.haxx.se/)、postman (https://www.getpostman.com/) 或类似的。我建议使用 Kibana 控制台,因为它为 Elasticsearch 提供了代码完成和更好的字符转义。

要正确执行以下命令,您需要使用 ch04/populate_kibana.txt 命令填充的索引,该命令可用在在线代码中。

How to do it...

为了执行滚动查询,我们将执行以下步骤:

  1. From the command line, we can execute a search of type scan, as follows:
GET /mybooks/_search?scroll=10m&size=1
 {
   "query": {
     "match_all": {}
   }
 }
  1. If everything works, the command will return the following result:
{
   "_scroll_id" : "DXF1ZXJ5QW5kRmV0Y2gBAAAAAAAHdMUWNHBwdFp4NGpTTS14Y3BpVlRfZDdSdw==",
  ...truncated...
   "hits" : {
     "total" : 3,
     "max_score" : 1.0,
     "hits" : [
       {
         "_index" : "mybooks",
         "_type" : "_doc",
         "_id" : "1",
         "_score" : 1.0,
        ...truncated...
       }
     ]
   }
 }
  1. The result is composed of the following:
  • scroll_id: The value to be used for scrolling records
  • took: The time required to execute the query
  • timed_out: Whether the query was timed out
  • _shards: This query status is the information about the status of shards during the query
  • hits: An object that contains the total count and the result hits
  1. With a scroll_id, you can use scroll to get the results, as follows:
POST /_search/scroll
 {
     "scroll" : "10m",
     "scroll_id" : "DXF1ZXJ5QW5kRmV0Y2gBAAAAAAAHdMUWNHBwdFp4NGpTTS14Y3BpVlRfZDdSdw=="
 }
  1. The result should be something similar to the following:
{
   "_scroll_id" : "DXF1ZXJ5QW5kRmV0Y2gBAAAAAAAHdMUWNHBwdFp4NGpTTS14Y3BpVlRfZDdSdw==",
  ...truncated...
   "hits" : {
     "total" : 3,
     "max_score" : 1.0,
     "hits" : [
       {
         "_index" : "mybooks",
         "_type" : "_doc",
         "_id" : "2",
         "_score" : 1.0,
        ...truncated...
       }
     ]
   }
 }
对于最好奇的读者来说,scroll_id 是一个 base64,包含有关查询类型和内部 ID 的信息。在我们的例子中, DXF1ZXJ5QW5kRmV0Y2gBAAAAAAAHdMUWNHBwdFp4NGpTTS14Y3BpVlRfZDdSdw== 对应于queryAndFetcht4pptZx4jSM-xcpiVT_d7Rw

How it works...

滚动查询被解释为标准搜索。这种搜索旨在迭代大量结果,因此不计算分数和顺序。

在查询阶段,每个分片都将 ID 的状态存储在内存中,直到超时。通过以下方式处理滚动查询:

  1. The first part executes a query and returns a scroll_id used to fetch the results.
  2. The second part executes the document scrolling. You iterate the second step, getting the new scroll_id, and fetch other documents.
If you need to iterate on a big set of records, the scrolling query must be used, otherwise you could have duplicated results.

滚动查询类似于每个执行的标准查询,但有一个特殊的参数必须在查询字符串中传递。

scroll=(your timeout) 参数允许用户定义点击应该存在多长时间。时间可以使用 s 后缀(即 5s、10s、15s 等)以秒表示,也可以使用 m 后缀以分钟表示(即即5m、10m等)。如果您使用长超时,则必须确保您的节点有大量 RAM 以保持生成的 ID 有效。此参数是强制性的,并且必须始终提供。

There's more...

滚动对于执行重新索引操作或在非常大的结果集上迭代非常有用,这种操作的最佳方法是使用特殊的 _doc字段排序来获取所有匹配的文档,并提高效率。

因此,如果您需要对大量文档进行迭代以重新索引,您应该执行类似于以下的查询:

GET /mybooks/_search?scroll=10m&size=1
 {
   "query": {
     "match_all": {}
   },
   "sort": [
     "_doc"
   ]
 }

滚动结果值保存在内存中,直到滚动超时。如果您不再使用滚动条,最好清理此内存;从 Elasticsearch 内存中删除一个滚动条,命令如下:

  • If you know your scroll ID or IDs, you can provide them to the DELETE scroll API call, as follows:
DELETE /_search/scroll
 {
   "scroll_id": [
     "DnF1ZXJ5VGhlbkZldGNoBQAA..."
   ]
 }
  • If you want to clean all the scrolls, you can use the special _all keyword, as follows:
DELETE /_search/scroll/_all

See also

Using the search_after functionality

使用 fromsize 的 Elasticsearch 标准分页在大型数据集上表现非常差,因为对于每个查询,您需要计算并丢弃 from 值。滚动不存在这个问题,但是由于内存搜索上下文的原因消耗很大,所以不能用于频繁的用户查询。

为了绕过这些问题,Elasticsearch 5.x 及更高版本提供了 search_after 功能,可以快速跳过滚动结果。

Getting ready

您需要一个正常运行的 Elasticsearch 安装,正如我们在 下载和安装 Elasticsearch 第 1 章,  开始

要执行这些命令,可以使用任何 HTTP 客户端,例如 curl (https://curl.haxx.se/)、postman (https://www.getpostman.com/),或类似的。我建议使用 Kibana 控制台,因为它为 Elasticsearch 提供了代码完成和更好的字符转义。

要正确执行以下命令,您需要使用 ch04/populate_kibana.txt 命令填充的索引,该命令可用在在线代码中。

How to do it...

为了执行滚动查询,我们将执行以下步骤:

  1. From the command line, we can execute a search, which will provide a sort for your value, and use the _doc or _id of the document as the last sort parameter, as follows:
GET /mybooks/_search
{
   "size": 1,
   "query": {
     "mlatch_all": {}
   },
   "sort": [
     {
       "price": "asc"
     },
     {
       "_doc": "desc"
     }
   ]
 }
  1. If everything works, the command will return the following result:
{
  ...truncated...
   "hits" : {
     "total" : 3,
     "max_score" : null,
     "hits" : [
       {
         "_index" : "mybooks",
         "_type" : "_doc",
         "_id" : "1",
         "_score" : null,
         "_source" : {
           "uuid" : "11111",
           "position" : 1,
           "title" : "Joe Tester",
           "description" : "Joe Testere nice guy",
           "date" : "2015-10-22",
           "price" : 4.3,
           "quantity" : 50
         },
         "sort" : [
           4.3,
           0
         ]
       }
     ]
   }
 }
  1. To use the search_after functionality, you need to keep track of your last sort result, which in this case is [4.3, 0].
  2. To fetch the next result, you must provide the search_after functionality with the last sort value of your last record, as follows:
GET /mybooks/_search
 {
   "size": 1,
   "query": {
     "match_all": {}
   },
   "search_after": [
     4.3,
     0
   ],
   "sort": [
     {
       "price": "asc"
     },
     {
       "_doc": "desc"
     }
   ]
 }

How it works...

Elasticsearch 使用 Lucene 来索引数据。在 Lucene 索引中,所有的术语都以有序的方式进行排序和存储,因此 Lucene 非常快地跳到术语值是很自然的。此操作在 Lucene 核心中使用 skipTo 方法进行管理。此操作不消耗内存,在 search_after 的情况下,使用 search_after 值构建查询以快速跳过 Lucene 搜索并加快结果分页。

search_after 功能是在 Elasticsearch 5.x 中引入的,但它必须作为一个重要的焦点来改善搜索滚动/分页结果的用户体验。

See also

  • Refer to the Executing a search recipe in this chapter to learn how to structure a search for size pagination and the Executing a scrolling query recipe for scrolling values in a query.

Returning inner hits in results

在 Elasticsearch 中,当使用嵌套和子文档时,我们可以拥有复杂的数据模型。默认情况下,Elasticsearch 仅返回与搜索类型匹配的文档,而不是与查询匹配的嵌套或子文档。

inner_hits 函数在 Elasticsearch 5.x 中引入以提供此功能。

Getting ready

您需要一个正常运行的 Elasticsearch 安装,正如我们在 下载和安装 Elasticsearch 第 1 章,  开始

要执行这些命令,可以使用任何 HTTP 客户端,例如 curl (https://curl.haxx.se/)、postman (https://www.getpostman.com/),或类似的。我建议使用 Kibana 控制台,因为它为 Elasticsearch 提供了代码完成和更好的字符转义。

要正确执行以下命令,您需要使用 ch04/populate_kibana.txt 命令填充的索引,该命令可用在在线代码中。

How to do it...

要在查询期间返回内部命中,我们将执行以下步骤:

  1. From the command line, we can execute a call by adding inner_hits, as follows:
POST /mybooks-join/_search
{
  "query": {
    "has_child": {
      "type": "author",
      "query": {
        "term": {
          "name": "peter"
        }
      },
      "inner_hits": {}
    }
  }
}
  1. The result returned by Elasticsearch, if everything works, should be as follows:
{
  ...truncated...
  "hits" : {
    "total" : 1,
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "mybooks-join",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 1.0,
        "_source" : ...truncated...,
        "inner_hits" : {
          "author" : {
            "hits" : {
              "total" : 1,
              "max_score" : 1.2039728,
              "hits" : [
                {
                  "_index" : "mybooks-join",
                  "_type" : "_doc",
                  "_id" : "a1",
                  "_score" : 1.2039728,
                  "_routing" : "1",
                  "_source" : {
                    "name" : "Peter",
                    "surname" : "Doyle",
                    "join" : {
                      "name" : "author",
                      "parent" : "1"
                    }
                  }
                }
              ]
            }
          }
        }
      }
    ]
  }
}

How it works...

在执行嵌套查询或子查询时,Elasticsearch 会执行两步查询,如下所示:

  1. It executes the nested or children query and returns the IDs of the referred values
  2. It executes the other part of the query filtering by the returned IDs of Step 1

一般来说,嵌套查询或子查询的结果不会被获取,因为它们需要内存。使用 inner_hits,嵌套或子查询中间命中被保留并返回给用户。

为了控制 inner_hits 返回的文档,可以使用标准的搜索参数,例如 fromsizesorthighlight_sourceexplainscripted_fieldsdocvalues_fields版本

还有一个特殊的属性名称用于命名inner_hits,这使用户可以在多个inner_hits 返回部分的情况下轻松确定它。

See also

与本食谱相关的可参考以下几点:

  • The Executing a search recipe in this chapter for all the standard parameters in searches for controlling returned hits
  • The Using a has_child query, Using a top_children query, Using a has_parent query, and Using a nested query recipes in Chapter 6, Relationships and Geo Queries, are useful when using queries that can be used for inner hits

Suggesting a correct query

用户犯打字错误或要求对他们正在写的单词提出建议是很常见的。 Elasticsearch 使用建议的功能解决了这些问题。

Getting ready

您需要一个正常运行的 Elasticsearch 安装,正如我们在 下载和安装 Elasticsearch 第 1 章,  开始

要执行这些命令,可以使用任何 HTTP 客户端,例如 curl (https://curl.haxx.se/)、postman (https://www.getpostman.com/),或类似的。我建议使用 Kibana 控制台,因为它为 Elasticsearch 提供了代码完成和更好的字符转义。

要正确执行以下命令,您需要使用 ch04/populate_kibana.txt 命令填充的索引,该命令可用在在线代码中。

How to do it...

要通过查询建议相关术语,我们将执行以下步骤:

  1. From the command line, we can execute a suggest call, as follows:
GET /mybooks/_search
 {
   "suggest": {
     "suggest1": {
       "text": "we find tester",
       "term": {
         "field": "description"
       }
     }
   }
 }
  1. The result returned by Elasticsearch, if everything works, should be as follows:
{
  ...truncated...
   "suggest" : {
     "suggest1" : [
       {
         "text" : "we",
         "offset" : 0,
         "length" : 2,
         "options" : [ ]
       },
...
...           {
             "text" : "testere",
             "score" : 0.8333333,
             "freq" : 2
           }
         ]
       }
     ]
   }
 }

结果由以下部分组成:

  • The shards' status at the time of the query
  • The list of tokens with their available candidates

How it works...

建议部分通过收集所有索引分片的术语统计信息来工作。使用 Lucene 字段统计,可以检测出正确的词条或完整的词条。这是一种统计方法!

建议词和短语有两种类型,它们如下:

  • The simpler suggester to use is the term suggester. It requires only the text and the field to work. It also allows the user to set a lot of parameters, such as the minimum size for a word, learn how to sort results, and the suggester strategy. A complete reference is available on the Elasticsearch website at https://www.elastic.co/guide/en/elasticsearch/reference/master/search-suggesters-term.html.
  • The phrase suggester is able to keep relations between terms that it needs to suggest. The phrase suggester is less efficient than the term, but it provides better results.

建议 API 功能、参数和选项 经常 在不同版本之间更改。

可以使用插件添加新的建议者。

See also

与本食谱相关的可参考以下几点:

Counting matched results

通常只需要返回匹配结果的计数,而不是结果本身。

涉及计数的场景有很多,例如:

  • To return the number of something (how many posts for a blog, how many comments for a post).
  • Validating whether some items are available. Are there posts? Are there comments?

Getting ready

您需要一个正常运行的 Elasticsearch 安装,正如我们在 下载和安装 Elasticsearch 第 1 章,  开始

要执行这些命令,可以使用任何 HTTP 客户端,例如 curl (https://curl.haxx.se/)、postman (https://www.getpostman.com/),或类似的。我建议使用 Kibana 控制台,因为它为 Elasticsearch 提供了代码完成和更好的字符转义。

要正确执行以下命令,您需要使用 ch04/populate_kibana.txt 命令填充的索引,该命令可用在在线代码中。

How to do it...

为了执行计数查询,我们将执行以下步骤:

  1. From the command line, we will execute a count query, as follows:
GET /mybooks/_count
{
  "query": {
    "match_all": {}
  }
}
  1. The result returned by ElasticSearch, if everything works, should be as follows:
{
  "count" : 3,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  }
}

结果由count结果(long类型)和查询时的shard状态组成。

How it works...

查询的解释方式与搜索相同。计数操作在所有分片中处理和分发,作为低级 Lucene 计数调用执行。每个命中分片都会返回一个计数,该计数会聚合并返回给用户。

在 Elasticsearch 中,计数比搜索更快。在不需要结果源命中的情况下,最好使用 count API,因为它更快并且需要的资源更少。

执行计数的 HTTP 方法是 GET(但也可以使用 POST),REST 端点如下:

http://<server>/_count
http://<server>/<index_name(s)>/_count

多索引和类型以逗号分隔。如果定义了索引或类型,则搜索仅限于它们。别名可用作索引名称。

通常,主体用于表达查询,但对于简单查询,可以使用 q(查询参数)。例如,看下面的代码:

GET /mybooks/_count?q=uuid:11111

There's more...

在之前的 Elasticsearch 版本中,计数 API 调用(_count REST 入口点)被实现为自定义操作,但在 Elasticsearch 5.x 及更高版本中,这已被删除。在内部,以前的计数 API 是作为标准搜索实现的,大小设置为 0

使用这个技巧,它不仅加快了搜索速度,而且减少了网络。您可以使用这种方法来执行聚合(我们将在 第 7 章中看到它们, 聚合)而不返回点击。

前面的查询也可以执行如下:

GET /mybooks/_count?q=uuid:11111

如果一切正常,Elasticsearch 返回的结果应该如下:

{
  "count" : 1,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  }
}

计数结果(长类型)也可在标准 _search 结果中获得 hits.total

See also

与本食谱相关的可参考以下几点:

  • The Executing a search recipe in this chapter on using size to paginate
  • Chapter 7, Aggregations, on how to use the aggregations

Explaining a query

执行搜索时,文档与预期的查询不匹配是很常见的。为了轻松 调试 这些场景,Elasticsearch 提供了 explain 查询调用,它允许您检查分数是如何针对文档计算的。

Getting ready

您需要一个正常运行的 Elasticsearch 安装,正如我们在 下载和安装 Elasticsearch 第 1 章,  开始

要执行这些命令,可以使用任何 HTTP 客户端,例如 curl (https://curl.haxx.se/)、postman (https://www.getpostman.com/),或类似的。我建议使用 Kibana 控制台,因为它为 Elasticsearch 提供了代码完成和更好的字符转义。

要正确执行以下命令,您需要使用 ch04/populate_kibana.txt 命令填充的索引,该命令可用在在线代码中。

How to do it...

执行解释查询调用所需的步骤如下:

  1. From the command line, we will execute an explain query against a document, as follows:
GET /mybooks/_doc/1/_explain?pretty
{
  "query": {
    "term": {
      "uuid": "11111"
    }
  }
}
  1. The result returned by Elasticsearch, if everything works, should be as follows:
{
  "_index" : "mybooks",
  "_type" : "_doc",
  "_id" : "1",
  "matched" : true,
  "explanation" : {
    "value" : 0.9808292,
    "description" : "weight(uuid:11111 in 0) [PerFieldSimilarity], result of:",
    "details" : [
...
...
              {
                "value" : 3,
                "description" : "N, total number of documents with field",
                "details" : [ ]
              }
            ]
          },
         ...truncated...
}

结果的重要部分如下:

  • matched: Whether the documents match or not in the query
  • explanation: This section is composed of objects made of the following:
    • value: A double score of that query section
    • description: A string representation of the matching token (in case of wildcards or multi-terms, it can give information about the matched token)
    • details: An optional list of explanation objects

How it works...

解释调用是 Lucene 如何计算结果的视图。在解释对象的描述部分,有该部分查询的 Lucene 表示。

用户无需成为 Lucene 专家即可理解解释描述,但它们提供了查询执行方式和术语匹配方式的亮点。

具有许多子查询的更复杂的查询很难调试,主要是如果您需要提升一些特殊字段以获得文档的所需序列。在这些情况下,使用 explain API 可以帮助您管理字段提升,因为它允许您轻松调试它们在查询或文档中的交互方式。

Query profiling

此功能可从 Elasticsearch 5.x 或更高版本通过配置文件 API 获得。这允许用户跟踪 Elasticsearch 在执行搜索或聚合时花费的时间。

Getting ready

您需要一个正常运行的 Elasticsearch 安装,正如我们在 下载和安装 Elasticsearch 第 1 章,  开始

要执行这些命令,可以使用任何 HTTP 客户端,例如 curl (https://curl.haxx.se/)、postman (https://www.getpostman.com/),或类似的。我建议使用 Kibana 控制台,因为它为 Elasticsearch 提供了代码完成和更好的字符转义。

要正确执行以下命令,您需要使用 ch04/populate_kibana.txt 命令填充的索引,该命令可用在在线代码中。

How to do it...

分析查询的步骤如下:

  1. From the command line, we will execute a search with the true profile set as follows:
GET /mybooks/_search
{
  "profile": true,
  "query": {
    "term": {
      "uuid": "11111"
    }
  }
}
  1. The result returned by Elasticsearch, if everything works, should be as follows:
{
  ...truncated...
  "profile" : {
    "shards" : [
      {
        "id" : "[4pptZx4jSM-xcpiVT_d7Rw][mybooks][0]",
        "searches" : [
...
...

            ],
            "rewrite_time" : 5954,
            "collector" : [
              {
                "name" : "CancellableCollector",
                "reason" : "search_cancelled",
                "time_in_nanos" : 204857,
                "children" : [
                  {
                    "name" : "SimpleTopScoreDocCollector",
                    "reason" : "search_top_hits",
                    "time_in_nanos" : 12288
                  }
                ]
              }
            ]
          }
        ],
        "aggregations" : [ ]
      }
    ]
  }
}

输出非常冗长。它分为碎片和单次命中。

结果显示查询类型(例如, TermQuery)以及内部 Lucene 参数的详细信息。对于每一步,时间都以用户可以轻松检测到查询时间瓶颈的方式进行跟踪。

How it works...

配置文件 API 是在 Elasticsearch 5.x 中引入的,用于跟踪执行查询和聚合的时间。执行查询时,如果激活了分析,则使用内部工具 API 跟踪所有内部调用。出于这个原因,配置文件 API 增加了计算的开销。

输出也非常冗长,并且取决于 Elasticsearch 和 Lucene 的内部组件,因此结果的格式将来可能会发生变化。此功能的典型用法是减少执行时间跟踪,这是查询中最慢的步骤,并尝试对其进行优化。

Deleting by query

我们在 删除文档 配方中了解了如何删除文档 第 3 章基本操作。 删除文档非常快,但需要知道文档 ID 才能直接访问,在某些情况下还需要知道路由值.

Elasticsearch 使用默认安装的名为 re-index 的附加模块提供删除与查询匹配的所有文档的调用。

Getting ready

您需要一个正常运行的 Elasticsearch 安装,正如我们在 下载和安装 Elasticsearch 第 1 章,  开始

要执行这些命令,可以使用任何 HTTP 客户端,例如 curl (https://curl.haxx.se/)、postman (https://www.getpostman.com/),或类似的。我建议使用 Kibana 控制台,因为它为 Elasticsearch 提供了代码完成和更好的字符转义。

要正确执行以下命令,您需要使用 ch04/populate_kibana.txt 命令填充的索引,该命令可用在在线代码中。

How to do it...

为了通过查询删除,我们将执行以下步骤:

  1. From the command line, we will execute a query, as follows:
POST /mybooks/_delete_by_query?pretty
{
  "query": {
    "match_all": {}
  }
}
  1. The result returned by Elasticsearch, if everything works, should be as follows:
{
  "took" : 10,
  "timed_out" : false,
  "total" : 3,
  "deleted" : 3,
  "batches" : 1,
  "version_conflicts" : 0,
  "noops" : 0,
  "retries" : {
    "bulk" : 0,
    "search" : 0
  },
  "throttled_millis" : 0,
  "requests_per_second" : -1.0,
  "throttled_until_millis" : 0,
  "failures" : [ ]
}

结果的主要组成部分如下:

  • total: The number of documents that match the query
  • deleted: The number of documents deleted
  • batches: The number of bulk actions executed to delete the documents
  • version_conflicts: The number of documents not deleted due to a version conflict during the bulk action
  • noops: The number of documents not executed to a noop event
  • retries.bulk: The number of bulk actions that are retried
  • retries.search: The number of searches that are retried
  • requests_per_second: The number of requests for seconds executed (-1.0 if this value is not set)
  • throttled_millis: The time of sleep to conform to request_per_second value
  • throttled_until_millis: This is generally 0, and it indicates the time for the next request if the request_per_second value is set
  • failures: An array of failures

How it works...

delete_by_query 函数使用以下步骤自动执行:

  1. In a master node, the query is executed and the results are scrolled.
  2. For every bulk size element (default 1,000), a bulk is executed.
  3. The bulk results are checked for conflicts. If no conflicts exist, a new bulk is executed, until all the matched documents are deleted.

delete_by_query 调用自动管理背压(如果服务器负载很高,它会降低删除命令的速率)。

When you want to remove all the documents without re-indexing a new index, a delete_by_query with a match_all query allows you to clean your mapping of all the documents. This call is analogous to the truncate_table of the SQL language.

执行 delete_by_query 命令的 HTTP 方法是 POST; REST 端点如下:

http://<server>/_delete_by_query
http://<server>/<index_name(s)>/_delete_by_query

多个索引 被定义为一个唯一的逗号分隔的 string。如果定义了索引或类型,则搜索仅限于它们。别名可用作索引名称。

通常,主体用于表达查询,但对于简单查询,可以使用 q(查询参数)。例如,看下面的代码:

DELETE /mybooks/_delete_by_query?q=uuid:11111

There's more...

进一步的查询参数如下:

  • conflicts: If it is set to proceed, when there is a version conflict, the call doesn't exit; it skips the error and it finishes execution.
  • routing: This is used to target only some shards.
  • scroll_size: This controls the size of the scrolling and the bulk (default 1000).
  • request_per_seconds (default -1.0): This controls how may requests can be executed in a second. The default value is unlimited.

See also

与本食谱相关的可参考以下几点:

  • The Deleting a document recipe in Chapter 3Basic Operations, is useful for executing a delete for a single document
  • The Delete by query task recipe in Chapter 9Managing Clusters, is useful for monitoring asynchronous delete by query actions

Updating by query

在上一章中,我们在 更新文档秘籍中了解了如何更新文档。

update_by_query API 调用允许用户对匹配查询的所有文档执行更新。如果您需要执行以下操作,这将非常有用:

  • Reindex a subset of your records that match a query. It's common if you change your document mapping and need the documents to be reprocessed.
  • Update values of your records that match a query.

它是 SQL 更新命令的 Elasticsearch 版本。

此功能由默认安装的名为 reindex 的附加模块提供。 

Getting ready

您需要一个正常运行的 Elasticsearch 安装,正如我们在 下载和安装 Elasticsearch 第 1 章,  开始

要执行这些命令,可以使用任何 HTTP 客户端,例如 curl (https://curl.haxx.se/)、postman (https://www.getpostman.com/),或类似的。我建议使用 Kibana 控制台,因为它为 Elasticsearch 提供了代码完成和更好的字符转义。

要正确执行以下命令,您需要使用 ch04/populate_kibana.txt 命令填充的索引,该命令可用在在线代码中。

How to do it...

为了从简单地重新索引您的文档的查询中执行更新,我们将执行以下步骤:

  1. From the command line, we will execute a query, as follows:
POST /mybooks/_update_by_query
{
  "query": {
    "match_all": {}
  },
  "script": {
    "source": "ctx._source.quantity=50"
  }
}
  1. The result returned by Elasticsearch, if everything works, should be as follows:
{
  "took" : 7,
  "timed_out" : false,
  "total" : 3,
  "updated" : 3,
  "deleted" : 0,
  "batches" : 1,
  "version_conflicts" : 0,
  "noops" : 0,
  "retries" : {
    "bulk" : 0,
    "search" : 0
  },
  "throttled_millis" : 0,
  "requests_per_second" : -1.0,
  "throttled_until_millis" : 0,
  "failures" : [ ]
}

结果中最重要的组成部分如下:

  • total: The number of documents that match the query
  • updated: The number of documents updated
  • batches: The number of bulk actions executed to update the documents
  • version_conflicts: The number of documents not deleted due to a version conflict during bulk action
  • noops: The number of documents not changed due to a noop event
  • retries.bulk: The number of bulk actions that are retried
  • retries.search: The number of searches that are retried
  • requests_per_second: The number of requests for seconds executed (-1.0 if this value is not set)
  • throttled_millis: The time of sleep to conform to request_per_second value
  • throttled_until_millis: This is generally 0, and it indicates the time for the next request if request_per_second value is set
  • failures: An array of failures

How it works...

update_by_query 函数的工作方式与 delete_by_query API 非常相似,并使用以下步骤自动执行:

  1. In a master node, the query is executed and the results are scrolled.
  2. For every bulk size element (default 1,000), a bulk with the update commands is executed.
  3. The bulk results are checked for conflicts. If there are no conflicts, a new bulk is executed and the action search or bulk are executed until all the matched documents are deleted.

执行 update_by_query 的 HTTP 方法是 POST,REST 端点如下:

http://<server>/_update_by_query
http://<server>/<index_name(s)>/<type_name(s)>/_update_by_query

多个索引通过逗号分隔的字符串定义。如果定义了索引或类型,则搜索仅限于它们。别名可用作索引名称。

附加查询参数如下:

  • conflicts: If it is set to proceed, when there is a version conflict, the call doesn't exit; it skips the error and it finishes execution.
  • routing: This is used to target only some shards.
  • scroll_size: This controls the size of the scrolling and the bulk (the default size is 1000).
  • request_per_seconds (default -1.0): This controls how many requests can be executed in a second. The default value is unlimited.

There's more...

update_by_query API 可以在其主体中接受脚本部分。通过这种方式,它可以成为对文档子集执行自定义更新的强大工具。 (我们将在 第 8 章中详细了解脚本, Scripting in Elasticsearch). 可以认为类似于SQL update  命令。

使用此功能,我们可以添加一个新字段并使用脚本初始化其值,如下所示:

POST /mybooks/_update_by_query
{
  "script": {
    "source": "ctx._source.hit=4"
  },
  "query": {
    "match_all": {}
  }
}

在前面的示例中,我们为每个匹配查询的文档添加一个 hit field set 为 4。这类似于SQL命令,如下:

update mybooks set hit=4
update_by_query API 是 Elasticsearch 提供的更强大的工具之一。

See also

Matching all the documents

match_all 查询中最常见的查询之一。这种查询允许用户返回索引中可用的所有文档。 match_all 和其他查询运算符是 Elasticsearch 查询 DSL 的一部分。

Getting ready

您需要一个正常运行的 Elasticsearch 安装,正如我们在 下载和安装 Elasticsearch 第 1 章,  开始

要执行这些命令,可以使用任何 HTTP 客户端,例如 curl (https://curl.haxx.se/)、postman (https://www.getpostman.com/),或类似的。我建议使用 Kibana 控制台,因为它为 Elasticsearch 提供了代码完成和更好的字符转义。

要正确执行以下命令,您需要使用 ch04/populate_kibana.txt 命令填充的索引,该命令可用在在线代码中。

How to do it...

为了执行 match_all 查询,我们将执行以下步骤:

  1. From the command line, we execute the query as follows:
POST /mybooks/_search
{
  "query": {
    "match_all": {}
  }
}
  1. The result returned by Elasticsearch, if everything works, should be as follows:
{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 3,
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "mybooks",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 1.0,
        "_source" : {
          "date" : "2015-10-22",
          "hit" : 4,
          "quantity" : 50,
          "price" : 4.3,
          "description" : "Joe Testere nice guy",
          "position" : 1,
          "title" : "Joe Tester",
          "uuid" : "11111"
        }
      },
      ...truncated...
    ]
  }
}

结果是一个标准的查询结果,正如我们在本章的执行搜索配方中看到的。

How it works...

match_all 查询是最常见的查询之一。它更快,因为它不需要分数计算(它被封装在 Lucene 中, ConstantScoreQuery)。

如果搜索对象中没有定义查询,   match_all  将是默认查询。

See also

请参阅本章中的 执行搜索 食谱以获取更多参考。

Using a Boolean query

大多数使用搜索引擎的人有时会使用带减号 (-) 和加号 (+) 的语法来包含或排除查询词。布尔查询允许用户以编程方式定义查询以在查询中包含、排除、可选地包含 (should) 或过滤。

这种查询是最重要的查询之一,因为它允许用户聚合许多我们将在本章中看到的简单查询或过滤器来构建一个大而复杂的查询。

两个主要概念在搜索中很重要:query 和 filter。查询表示匹配结果使用内部 Lucene 评分算法进行评分;对于过滤器,结果匹配不计分。因为过滤器不需要计算分数,所以它通常更快并且可以缓存。

Getting ready

您需要一个正常运行的 Elasticsearch 安装,正如我们在 下载和安装 Elasticsearch 第 1 章,  开始

要执行这些命令,可以使用任何 HTTP 客户端,例如 curl (https://curl.haxx.se/)、postman (https://www.getpostman.com/),或类似的。我建议使用 Kibana 控制台,因为它为 Elasticsearch 提供了代码完成和更好的字符转义。

要正确执行以下命令,您需要使用 ch04/populate_kibana.txt 命令填充的索引,该命令可用在在线代码中。

How to do it...

为了执行布尔查询,我们将执行以下步骤:

  1. We can execute a Boolean query from the command line as follows:
POST /mybooks/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "term": {
            "description": "joe"
          }
        }
      ],
...
...
      "filter": [
        {
          "term": {
            "description": "joe"
          }
        }
      ],
      "minimum_should_match": 1,
      "boost": 1
    }
  }
}
  1. The result returned by Elasticsearch is similar to the previous recipes, but in this case, it should return one record (id:1).

How it works...

bool 查询通常是最常用的查询之一,因为它允许用户使用许多更简单的查询来组成大型查询。以下四个部分之一是强制性的:

  • must: A list of queries that must be satisfied. All the must queries must be verified to return the hits. It can be seen as an AND filter with all its sub queries.
  • must_not: A list of queries that must not be matched. It can be seen as not filter of an AND query.
  • should: A list of queries that can be verified. The minimum number of these queries that must be verified and this value is controlled by minimum_should_match (default 1).
  • filter: A list of queries to be used as the filter. They allow the user to filter out results without changing the score and relevance. The filter queries are faster than standard ones because they don't need to compute the score.

There's more...

如果您在布尔值中定义多个子查询,请了解您的结果遇到的任何查询在应用程序级别都可能非常重要;通常,最好缩小结果范围。要获得此结果,您可以使用特殊的 _name 属性,该属性可以在查询组件中定义。

可以通过这种方式更改之前的查询:

POST /mybooks/_search
{
  "query": {
    "bool": {
          "should": [
        {
          "term": {
            "uuid": {
              "value": "11111",
              "_name": "uuid:11111:matched"
            }
          }
        },
        {
          "term": {
            "uuid": {
              "value": "22222",
              "_name": "uuid:22222:matched"
            }
          }
        }
      ],
      "filter": [
        {
          "term": {
            "description": {
              "value": "joe",
              "_name": "fiter:term:joe"
            }
          }
        }
      ],
      "minimum_should_match": 1,
      "boost": 1
    }
  }
}

对于每个匹配的文档,结果将包含匹配的查询:

{
  ...truncated...
  "hits" : {
    "total" : 1,
    "max_score" : 0.9808292,
    "hits" : [
      {
        "_index" : "mybooks",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.9808292,
        ...truncated...
        "matched_queries" : [
          "uuid:11111:matched",
          "fiter:term:joe"
        ]
      }
    ]
  }
}

Using the search template

Elasticsearch 提供了提供模板和一些参数来填充它的能力。此功能非常有用,因为它允许您管理存储在 .scripts 索引中的查询模板并允许您更改它们无需更改应用程序代码。

Getting ready

您需要一个正常运行的 Elasticsearch 安装,正如我们在 下载和安装 Elasticsearch 第 1 章,  开始

要执行这些命令,可以使用任何 HTTP 客户端,例如 curl (https://curl.haxx.se/)、postman (https://www.getpostman.com/),或类似的。我建议使用 Kibana 控制台,因为它为 Elasticsearch 提供了代码完成和更好的字符转义。

要正确执行以下命令,您需要使用 ch04/populate_kibana.txt 命令填充的索引,该命令可在在线代码。

How to do it...

模板查询由两部分组成:查询和必须填写的参数。我们可以通过多种方式执行模板查询;在这个秘籍中,我们将看到一些我们将在下一章中探索的查询类型。

使用新的 REST 入口点 _search/template 是使用模板的最佳方式。要使用它,请执行以下步骤:

  1. We execute the query as follows:
POST /_search/template
{
  "source": {
    "query": {
      "term": {
        "uuid": "{{value}}"
      }
    }
  },
  "params": {
    "value": "22222"
  }
}
  1. The result returned by Elasticsearch, if everything is alright, should be as follows:
{
  "took" : 3,
  "timed_out" : false,
  "_shards" : {
    "total" : 3,
    "successful" : 3,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 0.9808292,
    "hits" : [
      {
        "_index" : "mybooks",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 0.9808292,
        "_source" : {
          "uuid" : "22222",
          "position" : 2,
          "title" : "Bill Baloney",
          "description" : "Bill Testere nice guy",
          "date" : "2016-06-12",
          "price" : 5,
          "quantity" : 34
        }
      }
    ]
  }
}

如果我们要使用索引存储模板,步骤如下:

  1. We store the template in the .scripts index:
POST _scripts/myTemplate
{
  "script": {
    "lang": "mustache",
    "source": {
      "query": {
        "term": {
          "uuid": "{{value}}"
        }
      }
    }
  }
}
  1. Now, we can call the template with the following code:
POST /mybooks/_search/template
{
  "id": "myTemplate",
  "params": {
    "value": "22222"
  }
}

如果您有存储的模板并且想要验证它,您可以使用 REST render  入口点。

索引的模板和脚本存储在   .script   索引。这是一个普通索引,可以作为标准数据索引来管理。

如果要渲染查询模板,主要用于调试目的,请按照以下步骤操作:

  1. We render the template using the _render/template REST:
POST /_render/template
{
  "id": "myTemplate",
  "params": {
    "value": "22222"
  }
}

结果如下:

{
  "template_output" : {
    "query" : {
      "term" : {
        "uuid" : "22222"
      }
    }
  }
}

How it works...

模板查询由以下两个组件组成:

  • A template is a query object that is supported by Elasticsearch. The template uses the mustache (http://mustache.github.io/) syntax, a very common syntax to express templates.
  • An optional dictionary of parameters that is used to fill the template.

调用搜索查询时,将加载模板、填充参数数据并作为普通查询执行。模板查询是一种快捷方式,因此您可以使用具有不同值的相同查询。

通常,模板是通过以标准方式执行查询,然后在模板化过程中根据需要添加参数来生成的; mustache 语法非常丰富,并提供默认值、JSON 转义、条件部分等等(官方文档位于 https://www.elastic.co/guide/en/elasticsearch/reference/master/search-template.html 涵盖了所有这些方面)。

它允许您从应用程序代码中删除查询执行并将其放在文件系统或索引上。

See also

与本食谱相关的可参考以下几点: