您需要一个正常运行的 Elasticsearch 安装,正如我们在 the 下载和安装 Elasticsearch recipe in 中所述a href="https://subscription.packtpub.com/book/big_data_and_business_intelligence/9781789956504/1" linkend="ch01">第 1 章, 开始< /em>.
要执行这些命令,可以使用 HTTP 客户端,例如 curl (https://curl.haxx.se/)、postman (https://www.getpostman.com/),或类似的东西。我建议使用 Kibana 控制台,因为它为 Elasticsearch 提供了代码完成和更好的字符转义。
要正确执行以下命令,您需要一个填充有
ch04/populate_kibana.txt
命令,在线代码中提供。
本章所有查询和搜索中使用的映射类似于以下书籍表示:
创建模式的命令如下:
要执行搜索并查看结果,我们将执行以下步骤:
- From the command line, we can execute a search as follows:
在本例中,我们使用了返回所有文档的 match_all 查询。我们将在本章的匹配所有文档秘籍中讨论这种查询。
- If everything works, the command will return the following:
这些结果包含以下信息:
- took is the milliseconds of time required to execute the query.
- time_out indicates whether a timeout occurred during the search. This is related to the timeout parameter of the search. If a timeout occurs, you will get partial or no results.
- _shards is the status of shards divided into the following sections:
- total, which is the number of shards
- successful, which is the number of shards in which the query was successful
- skipped, which is the number of shards that are skipped during the search (for example, if you are searching more than 720 shards simultaneously)
- failed, which is the number of shards in which the query failed, because some error or exception occurred during the query
- hits are the results, and are composed of the following:
- total is the number of documents that match the query.
- max_score is the match score of first document. It is usually one if no match scoring was computed, for example, in sorting or filtering.
- hits, which is a list of result documents.
生成的文档有很多始终可用的字段和其他依赖于搜索参数的字段。最重要的字段如下:
- _index: The index that contains the document.
- _type: The type of the document (that is, _doc). It will disappear in future ES versions.
- _id: The ID of the document.
- _source: The document source—the original json sent to Elasticsearch.
- _score: Query score of the document (if the query doesn't require a score, it's 1.0).
- sort: If the document is sorted, values that are used for sorting.
- highlight: Highlighted segments if highlighting was requested.
- stored_fields: Some fields can be retrieved without needing to fetch the source object.
- script_fields: Some fields that can be computed using scripting.
Elasticsearch中的搜索是一个由很多步骤组成的分布式计算,主要有以下几个:
- In the master or coordinator nodes, validation of the query body is needed
- A selection of indices to be used in the query are needed; the shards are randomly chosen
- Execution of the query part in data nodes that collects the top hits or the query
- Aggregation of results in the master and coordinator nodes, as well as scoring
- Return the results to the user
下图显示了查询在集群中的分布情况:
执行搜索的 HTTP 方法是 GET(虽然 POST 也可以); REST 端点如下:
并非所有 HTTP 客户端都允许您通过
GET 调用,所以如果你需要发送正文数据,最好的做法是使用
POST 调用。
多索引和类型以逗号分隔。如果定义了索引或类型,则搜索仅限于它们。一个或多个别名可以用作索引名称。
核心查询通常包含在 GET/POST 调用的主体中,但很多选项也可以表示为 URI 查询参数 ,例如:
- q: This is the query string to perform simple string queries, which can be done as follows:
- df: This is the default field to be used within the query and can be done as follows:
- from (the default value is 0): The start index of the hits.
- size (the default value is 10): The number of hits to be returned.
- analyzer: The default analyzer to be used.
- default_operator (the default value is OR): This can be set to AND or OR.
- explain: This allows the user to return information about how the score is calculated. It is calculated as follows:
- stored_fields: These allow the user to define fields that must be returned, and can be done as follows:
- sort (the default value is score): This allows the user to change the documents in order. Sort is ascendant by default; if you need to change the order, add desc to the field, as follows:
- timeout (not active by default): This defines the timeout for the search. Elasticsearch tries to collect results until a timeout. If a timeout is fired, all the hits that have been accumulated are returned.
- search_type: This defines the search strategy. A reference is available in the online Elasticsearch documentation at https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-search-type.html.
- track_scores (the default value is false): If true, this tracks the score and allows it to be returned with the hits. It's used in conjunction with sort because sorting by default prevents the return of a match score.
- pretty (the default value is false): If true, the results will be pretty-printed.
通常,搜索正文中包含的查询是 JSON 对象。搜索主体是 Elasticsearch 搜索功能的核心;搜索功能列表在每个版本中都有扩展。对于 Elasticsearch 当前版本(7.x),可用参数如下:
- query: This contains the query to be executed. Later in this chapter, we will see how to create different kinds of queries to cover several scenarios.
- from: This allows the user to control pagination. The from parameter defines the start position of the hits to be returned (default 0) and size (default 10).
The pagination is applied to the currently returned search results. Firing the same query can bring different results if a lot of records have the same score, or a new document is ingested. If you need to process all the result documents without repetition, you need to execute
scan or
scroll queries.
- sort: This allows the user to change the order of the matched documents. This option is fully covered in the Sorting results recipe.
- post_filter: This allows the user to filter out the query results without affecting the aggregation count. It's usually used for filtering by facet values.
- _source: This allows the user to control the returned source. It can be disabled (false), partially returned (obj.*), or use multiple exclude/include rules. This functionality can be used instead of fields to return values (for complete coverage of this, take a look at the online Elasticsearch reference at http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-request-source-filtering.html).
- fielddata_fields: This allows the user to return a field data representation of the field.
- stored_fields: This controls the fields to be returned.
Returning only the required fields reduces the network and memory usage, thus improving performance. The suggested way to retrieve custom fields is to use the
_source filtering function because it doesn't need to use Elasticsearch's extra resources.
- aggregations/aggs: These control the aggregation layer analytics. These will be discussed in the next chapter.
- index_boost: This allows the user to define the per-index boost value. It is used to increase/decrease the score of results in boosted indices.
- highlighting: This allows the user to define fields and settings to be used for calculating a query abstract (see the Highlighting results recipe in this chapter).
- version (the default value false) This adds the version of a document in the results.
- rescore: This allows the user to define an extra query to be used in the score to improve the quality of the results. The rescore query is executed on the hits that match the first query and filter.
- min_score: If this is given, all the result documents that have a score lower than this value are rejected.
- explain: This returns information on how the TD/IF score is calculated for a particular document.
- script_fields: This defines a script that computes extra fields via scripting to be returned with a hit. We'll look at Elasticsearch scripting in Chapter 8, Scripting in Elasticsearch.
- suggest: If given a query and a field, this returns the most significant terms related to this query. This parameter allows the user to implement the Google-like do you mean functionality similar to Google one (see the Suggesting a correct query recipe).
- search_type: This defines how Elasticsearch should process a query. We'll see the scrolling query in the Executing a scrolling query recipe in this chapter.
- scroll: This controls the scrolling in scroll/scan queries. scroll allows the user to have an Elasticsearch equivalent of a DBMS cursor.
- _name: This allows returns for every hit that matches the named queries. It's very useful if you have a Boolean and you want the name of the matched query.
- search_after: This allows the user to skip results using the most efficient way of scrolling. We'll see this functionality in the Using search_after functionality recipe in this chapter.
- preference: This allows the user to select which shard/s to use for executing the query.
为了提高结果评分的质量,Elasticsearch 提供了 rescore 功能。此功能允许用户使用通常更昂贵(CPU 或耗时)的另一个查询重新排序数量最多的文档,例如,如果查询包含大量匹配查询或脚本。这种方法允许用户仅对一小部分结果执行 rescore 查询,从而减少总体计算时间和资源。
rescore 查询,对于每个查询,都是在分片级别执行的,因此它是自动分布的。
The best candidates to be executed in the
rescore query are complex queries with a lot of nested options, and everything that is used is scripting (due to the massive overhead of scripting languages).
以下示例将向您展示如何在第一阶段执行快速查询(布尔值),然后在 中使用 match 查询来查询它。重新评分 部分:
rescore参数如下:
- window_size: The example is 100. This controls how many results per shard must be considered in the rescore functionality.
- query_weight: The default value is 1.0, and the rescore_query_weight default value is 1.0. These are used to compute the final score using the following formula:
如果用户只想保留 rescore 分数,他们可以将 query_weight 设置为 0。