vlambda博客
学习文章列表

读书笔记《elasticsearch-7-0-cookbook-fourth-edition》文本和数字查询

Text and Numeric Queries

在本章中,我们将看到用于搜索文本和数值的查询。它们更简单,也是 Elasticsearch 中最常用的。 本章的第一部分涵盖了从简单术语和术语查询到复杂查询字符串查询的文本查询。我们将了解查询如何与映射密切相关,以便根据映射选择正确的查询。

在本章的最后一部分,我们将看到许多涵盖字段的特殊查询、用于从字符串构建复杂查询的助手以及查询模板。

在本章中,我们将介绍以下食谱:

  • Using a term query
  • Using a terms query
  • Using a prefix query
  • Using a wildcard query
  • Using a regexp query
  • Using span queries
  • Using a match query
  • Using a query string query
  • Using a simple query string query
  • Using the range query
  • The common terms query
  • Using an IDs query
  • Using the function score query
  • Using the exists query

Using a term query

经常搜索或过滤特定术语。术语查询与精确值匹配一起使用,并且通常非常快。

术语查询可以与 SQL 世界中的 equal (=) 查询进行比较。

Getting ready

您需要一个正常运行的 Elasticsearch 安装,正如我们在 下载和安装 Elasticsearch 配方中所述第 1 章 开始时间>。

要执行这些命令,可以使用任何 HTTP 客户端,例如 curl (https://curl.haxx.se/ )、邮递员 (https://www.getpostman.com/) 或类似名称。我建议您使用 Kibana 控制台,因为它为 Elasticsearch 提供了代码完成和更好的字符转义。

要正确执行以下命令,您需要使用 ch05/kibana_commands_005.txt 命令填充的索引,该命令可用 在在线代码中。

How to do it...

要执行术语查询,我们将执行以下步骤:

  1. We will execute a term query from the command line, as follows:
POST /mybooks/_search
{
  "query": {
    "term": {
      "uuid": "33333"
    }
  }
}
  1. The result returned by Elasticsearch, if everything is alright, should be as follows:
{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 0.9808292,
    "hits" : [
      {
        "_index" : "mybooks",
        "_type" : "_doc",
        "_id" : "3",
        "_score" : 0.9808292,
        "_source" : {
          "uuid" : "33333",
          "position" : 3,
          "title" : "Bill Klingon",
          "description" : "Bill is not\n nice guy",
          "date" : "2017-09-21",
          "price" : 6,
          "quantity" : 33
        }
      }
    ]
  }
}
  1. For executing a term query as a filter, we need to use it wrapped in a Boolean query. The preceding term query will be executed in the following way:
POST /mybooks/_search
{
  "query": {
    "bool": {
      "filter": {
        "term": {
          "uuid": "33333"
        }
      }
    }
  }
}
  1. The result returned by Elasticsearch, if everything is alright, should be as follows:
{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 0.0,
    "hits" : [
      {
        "_index" : "mybooks",
        "_type" : "_doc",
        "_id" : "3",
        "_score" : 0.0,
        "_source" : {
          "uuid" : "33333",
          "position" : 3,
          "title" : "Bill Klingon",
          "description" : "Bill is not\n nice guy",
          "date" : "2017-09-21",
          "price" : 6,
          "quantity" : 33
        }
      }
    ]
  }
}

结果是标准查询结果,正如我们在 执行搜索 配方中看到的那样="ch04">第 4 章探索搜索能力

How it works...

由于其倒排索引,Lucene 是在字段中搜索术语或值的最快引擎之一。 在 Lucene 中索引的每个字段都转换为其特定类型的快速搜索结构:

  • The text is split into tokens, if analyzed or saved as a single token
  • The numeric fields are converted into their fastest binary representation
  • The date and datetime fields are converted into binary forms

在 Elasticsearch 中,所有这些转换步骤都是自动管理的。搜索一个独立于值的术语,您会发现它是由 Elasticsearch 使用该字段的正确格式存档的。

在内部,在执行 term 查询的过程中,会收集所有匹配 term 的文档,然后按 score 对它们进行排序(得分取决于 Lucene,默认选择的相似度算法 BM25)。

有关 Elasticsearch 相似性算法的更多详细信息,请参阅  https://www.elastic.co/guide/en/elasticsearch/参考/master/index-modules-similarity.html

如果我们查找之前搜索的结果,对于词条查询, hit 的得分为 0.30685282,而过滤器的得分为 1.0。如果样本非常小,评分所需的时间并不那么重要,但如果您有数千或数百万个文档,则需要更多时间。

If the score is not important, opt to use the term filter.

当分数不重要时,过滤器优先于查询。典型场景如下:

  • Filtering permissions
  • Filtering numerical values
  • Filtering ranges
在过滤查询中,首先应用过滤器,缩小要与查询匹配的文档数量,然后应用查询。

There's more...

匹配一个term是Lucene和Elasticsearch的基础。要正确使用这些查询,您需要注意字段的索引方式。

正如我们在 第 2 章管理映射中看到的,索引字段的术语取决于用于索引它的分析器。为了更好地理解这个概念,一个短语的表示取决于下表中的几个分析器。对于标准的字符串分析器,如果我们有类似的短语,例如,Phrase: Peter's house is big,结果将类似于下表:

映射索引

分析器

代币

“索引”:假

(无索引)

(没有代币)

“类型”:“关键字”

关键字分析器

[“彼得的房子很大”]

“类型”:“文本”

标准分析仪

[“彼得”、“s”、“房子”、“是”、“大”]

 

搜索中的常见陷阱与对分析器或映射配置的误解有关。KeywordAnalyzer 用作 not tokenized 字段的默认值,将字符串原样保存为一个令牌。

StandardAnalyzertype="text" 字段的默认值,对空格和标点符号进行标记;每个标记都转换为小写。您应该使用相同的索引分析器来分析查询(默认设置)。

在前面的示例中,如果使用 StandardAnalyzer 分析短语,则无法搜索术语 Peter,而是搜索 peter,因为 StandardAnalyzer 对术语执行小写。

When the same field requires one or more search strategies, you need to use the fields property using the different analyzers that you need.

Using a terms query

前一种类型的搜索非常适合单项搜索。如果要搜索多个词条,可以通过两种方式进行处理:使用布尔查询或使用多词条查询。

Getting ready

您需要一个正常运行的 Elasticsearch 安装,正如我们在 下载和安装 Elasticsearch 配方中所述第 1 章 开始时间>。

要执行这些命令,可以使用任何 HTTP 客户端,例如 curl (https://curl.haxx.se/)、postman (https://www.getpostman.com/),或类似的。我建议您使用 Kibana 控制台,因为它为 Elasticsearch 提供了代码完成和更好的字符转义。

要正确执行以下命令,您需要使用 ch04/populate_kibana.txt 命令填充的索引,该命令可在在线代码。

How to do it...

要执行术语查询,我们将执行以下步骤:

  1. We execute a terms query from the command line, as follows:
POST /mybooks/_search
{
  "query": {
    "terms": {
      "uuid": [
        "33333",
        "32222"
      ]
    }
  }
}

  1. The result returned by Elasticsearch, if everything is alright, should be as follows:
{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "mybooks",
        "_type" : "_doc",
        "_id" : "3",
        "_score" : 1.0,
        "_source" : {
          "uuid" : "33333",
          "position" : 3,
          "title" : "Bill Klingon",
          "description" : "Bill is not\n nice guy",
          "date" : "2017-09-21",
          "price" : 6,
          "quantity" : 33
        }
      }
    ]
  }
}

How it works...

术语查询与前一种查询相关;它扩展了术语查询以支持多值。 这个调用非常有用,因为它对于多值过滤的概念非常常见。在传统SQL中,这个操作是通过where子句中的in关键字来实现的,即Select * from *** where color in("red" , "绿色").

在前面的示例中,查询搜索值为 3333322222uuidterms 查询是不仅仅是术语匹配功能的助手。terms 查询允许您定义额外的参数来控制查询行为,例如 以下:

  • minimum_match/minimum_should_match: This controls how many matched terms are required to validate the query, as follows:
"terms": {
  "color": ["red", "blue", "white"],
  "minimum_should_match":2
}
  • The preceding query matches all the documents where the color field has at least two values among red, blue, and white.
  • boost: This is the standard query boost value used to modify the query weight. This can be very useful if you want to give more relevance to the terms that have been matched to increase the final document score.

There's more...

因为词条过滤非常强大,为了加快搜索速度,可以在查询期间由其他文档获取词条。

这是一个非常常见的场景。例如,假设用户包含与他们相关联的组列表,并且您想要过滤只能由某些组看到的文档。伪代码应如下所示:

GET /my-index/document/_search
{
  "query": {
    "terms": {
      "can_see_groups": {
        "index": "my-index",
        "type": "user",
        "id": "1bw71LaxSzSp_zV6NB_YGg",
        "path": "groups"
      }
    }
  }
}

在前面的示例中,组列表在运行时从文档(始终由索引、类型和 ID 标识)和包含要放入的值的路径(field)中获取它。 routing 参数也受支持。

使用包含大量术语的术语查询将非常慢。为防止这种情况,限制为 65536 个术语。如果需要,可以通过设置索引设置来提升此值 index.max_terms_count

这是类似于 SQL 的模式,如以下示例所示:

select * from xxx where can_see_group in (select groups from user where user_id='1bw71LaxSzSp_zV6NB_YGg')

通常,NoSQL 数据存储不支持连接,因此必须使用非规范化或其他技术优化数据以进行搜索。

Elasticsearch 不提供与 SQL 连接类似的任何东西,但它提供了类似的替代方案,例如 以下:

  • Child/parent queries via join field
  • Nested queries
  • Terms filtered with external document term fetching

See also

您可以参考以下几点进一步参考,所有这些都与本食谱有关:

  • The Executing a search recipe in Chapter 4, Exploring Search Capabilities
  • The Using a term query in this chapter
  • The Using a boolean query recipe in Chapter 4Exploring Search Capabilities
  • The Using the nested query, Using the has_child query and Using the has_parent query recipes in Chapter 6, Relationships and Geo Queries

Using a prefix query

当仅知道术语的开始部分时使用前缀查询。它允许完成截断或部分项。

Getting ready

您需要一个正常运行的 Elasticsearch 安装,正如我们在 下载和安装 Elasticsearch 配方中所述第 1 章 开始时间>。

要执行这些命令,可以使用任何 HTTP 客户端,例如 curl (https://curl.haxx.se/)、postman (https://www.getpostman.com/) 或类似的。我建议您使用 Kibana 控制台,因为它为 Elasticsearch 提供了代码完成和更好的字符转义。

要正确执行以下命令,您需要使用 ch04/populate_kibana.txt 命令填充的索引 这是可用的在在线代码中。

How to do it...

要执行前缀查询,我们将执行以下步骤:

  1. We execute a prefix query from the command line, as follows:
POST /mybooks/_search
{
  "query": {
    "prefix": {
      "uuid": "222"
    }
  }
}
  1. The result returned by Elasticsearch, if everything is alright, should be as follows:
{
  "took" : 13,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "mybooks",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 1.0,
        "_source" : {
          "uuid" : "22222",
          "position" : 2,
          "title" : "Bill Baloney",
          "description" : "Bill Testere nice guy",
          "date" : "2016-06-12",
          "price" : 5,
          "quantity" : 34
        }
      }
    ]
  }
}

How it works...

执行前缀查询时,Lucene 有一个特殊的方法可以跳到以公共前缀: 开头的词条,因此前缀查询的执行速度非常快。

前缀查询一般用于需要词条补全的场景,如下:

  • Name completion
  • Code completion
  • On type completion

在 Elasticsearch 中设计树形结构时,如果 item 的 ID 包含层次关系(这种方式称为Materialized Path),可以大大加快应用过滤的速度。以下示例展示了如何使用 id 上的物化路径对水果和蔬菜类别进行建模:

身份证

元素

001

水果

00102

苹果

0010201

青苹果

0010202

红苹果

00103

0010301

白瓜

002

蔬菜

 

在前面的示例中,我们构建了包含有关树结构信息的 ID,它允许我们创建这样的查询:

  • Filter by all the fruits, as follows:
"prefix": {"fruit_id": "001" }
  • Filter by all apple types, as follows:
"prefix": {"fruit_id": "001002" }
  • Filter by all the vegetables, as follows:
"prefix": {"fruit_id": "002" }

如果将它与一个非常大的数据集上的标准 SQL parent_id 表进行比较,则 Lucene 的连接减少和快速搜索性能可以在几毫秒内过滤结果,而不是几秒钟或几分钟。

Structuring the data in the correct way can give impressive performance boost!

There's more...

当您搜索结尾文本时,前缀查询非常方便。例如,用户必须匹配具有字段 filename 且结尾扩展名为 png 的文档。通常,用户倾向于执行类似于.*png 的性能不佳的正则表达式查询。正则表达式需要检查字段的每一项,因此计算时间很长。

最佳实践是使用反向分析器对文件名字段进行索引,以将后缀查询转换为前缀查询!

为此,请执行以下步骤:

  1. We define reverse_analyzer to index level, putting this in the settings, as follows:
{
  "settings": {
    "analysis": {
      "analyzer": {
        "reverse_analyzer": {
          "type": "custom",
          "tokenizer": "keyword",
          "filter": [
            "lowercase",
            "reverse"
          ]
        }
      }
    }
  }
}

  1. When we define the filename field, we use reverse_analyzer for its subfield, as follows:
  "filename": {
    "type": "keyword",
    "fields": {
      "rev": {
        "type": "text",
        "analyzer": "reverse_analyzer"
      }
    }
  }

  1. Now we can search using a prefix query, using a similar query, as follows:
"query": {
    "prefix": {
      "filename.rev": ".jpg"
    }
  }

使用这种方法,例如,当您索引一个名为 myTest.png 的文件时,内部 Elasticsearch 数据将类似于以下数据:

filename:"myTest.jpg"
filename.rev:"gnp.tsetym"

因为文本分析器既用于索引和搜索前缀文本, .png 将在执行查询时自动在 gnp 中处理。

从正则表达式移动到结束匹配的前缀可以将您的执行时间从几秒减少到几毫秒!

See also

  • The Using a term query recipe, which is about full term search in Elasticsearch

Using a wildcard query

当术语的一部分已知时使用通配符查询。它允许完成截断或部分术语。它们非常有名,因为它们经常用于系统 shell 中文件的命令(即 ls *.jpg)。

Getting ready

您需要一个正常运行的 Elasticsearch 安装,正如我们在 下载和安装 Elasticsearch 配方中所述第 1 章 开始

要执行这些命令,可以使用任何 HTTP 客户端,例如 curl (https://curl.haxx.se/)、postman (https://www.getpostman.com/),或类似的。我建议您使用 Kibana 控制台,因为它为 Elasticsearch 提供了代码完成和更好的字符转义。

要正确执行以下命令,您需要使用 ch04/populate_kibana.txt 命令填充的索引,该命令可在线获取代码。

How to do it...

要执行通配符查询,我们将执行以下步骤:

  1. We will execute a wildcard query from the command line, as follows:
POST /mybooks/_search
{
  "query": {
    "wildcard": {
      "uuid": "22?2*"
    }
  }
}
  1. The result returned by Elasticsearch, if everything is alright, should be as follows:
{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "mybooks",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 1.0,
        "_source" : {
          "uuid" : "22222",
          "position" : 2,
          "title" : "Bill Baloney",
          "description" : "Bill Testere nice guy",
          "date" : "2016-06-12",
          "price" : 5,
          "quantity" : 34
        }
      }
    ]
  }
}

How it works...

通配符与正则表达式非常相似,但它只有两个特殊字符:

  • *: This means match zero or more characters
  • ?: This means match one character

在查询执行期间,搜索字段的所有术语都与通配符查询匹配。因此,通配符查询的性能取决于您的术语的基数。

To improve performance, it's suggested to not execute the wildcard query that starts with * or ?.
To speed up a search, it's good practice to have some starting characters to use the skipTo Lucene method in order to reduce the processed terms.

See also

您可以参考以下几点进一步参考,所有这些都与本食谱有关:

  • The Using a regexp query recipe for more complex rules than wildcard ones
  • The Using a prefix query recipe for creating a query with terms that start with a prefix

Using a regexp query

在之前的秘籍中,我们看到了不同的词条查询(词条、前缀和通配符);另一个强大的术语查询是 regexp(正则表达式)之一。

Getting ready

您需要一个正常运行的 Elasticsearch 安装,正如我们在 下载和安装 Elasticsearch 配方中所述第 1 章,  开始

要执行这些命令,可以使用任何 HTTP 客户端,例如 curl (https://curl.haxx.se/)、postman (https://www.getpostman.com/),或类似的。我建议您使用 Kibana 控制台,因为它为 Elasticsearch 提供了代码完成和更好的字符转义。

要正确执行以下命令,您需要使用 ch04/populate_kibana.txt 命令填充的索引,该命令可在线获取代码。

How to do it...

要执行正则表达式查询,我们将执行以下步骤:

  1. We can execute a regexp term query from the command line, as follows:
POST /mybooks/_search
{
  "query": {
    "regexp": {
      "description": {
        "value": "j.*",
        "flags": "INTERSECTION|COMPLEMENT|EMPTY"
      }
    }
  }
}
  1. The query result will be as follows:
{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "mybooks",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 1.0,
        "_source" : {
          "uuid" : "11111",
          "position" : 1,
          "title" : "Joe Tester",
          "description" : "Joe Testere nice guy",
          "date" : "2015-10-22",
          "price" : 4.3,
          "quantity" : 50
        }
      }
    ]
  }
}

匹配的正则表达式结果的分数始终为 1.0。

How it works...

regexp 查询针对文档的所有术语执行正则表达式。在内部,Lucene 在自动机中编译正则表达式以提高性能。因此,通常,此查询的性能并不快,因为性能取决于所使用的正则表达式。

要加快正则表达式查询,一个好方法是使用不以通配符开头的正则表达式。用于控制此过程的参数如下:

  • boost (default 1.0): This includes the values used for boosting the score for this query.
  • flags: This is a list of one or more flags pipe | delimited. The available flags are:
    • ALL: This enables all the optional regexp syntax
    • ANYSTRING: This enables any string (@)
    • AUTOMATON: This enables named automated (<identifier>)
    • COMPLEMENT: This enables complement (~)
    • EMPTY: This enables empty language (#)
    • INTERSECTION: This enables intersection (&)
    • INTERVAL: This enables numerical intervals (<n-m>)
    • NONE: This enables no optional regexp syntax
To avoid poor performance in a search, don't execute regex starting with .*. Instead, use a prefix query on a string processed with a reverse analyzer.

See also

Using span queries

标准数据库(SQL,以及许多 NoSQL 数据库,如 MongoDB、Riak 或 CouchDB)和 Elasticsearch 之间的最大区别在于表达文本查询的工具数量。跨度查询系列是一组使用位置控制文本标记序列的查询:标准查询不关心文本标记的位置存在。

跨度查询允许定义几种查询:

  • The exact phrase query
  • The exact fragment query (that is, take off and give up)
  • Partial exact phrase with a slop (other tokens between the searched terms, that is, the man with slop 2 can also match the strong man, the old wise man, and so on).

Getting ready

您需要一个正常运行的 Elasticsearch 安装,正如我们在 下载和安装 Elasticsearch 配方中所述第 1 章 入门

要执行这些命令,可以使用任何 HTTP 客户端,例如 curl (https://curl.haxx.se/)、postman (https://www.getpostman.com/),或类似的。我建议您使用 Kibana 控制台,因为它为 Elasticsearch 提供了代码完成和更好的字符转义。

要正确执行以下命令,您需要使用 ch04/populate_kibana.txt 命令填充的索引,该命令可在线获取代码。

How to do it...

要执行跨度查询,我们将执行以下步骤:

  1. The main element in span queries is span_term whose usage is similar to the term of the standard query. It is possible to aggregate more than one span_term to formulate a span query.
  2. The span_first query defines a query in which the span_term must match in the first token, or near it. The following code is an example of this:
POST /mybooks/_search
{
  "query": {
    "span_first": {
      "match": {
        "span_term": {
          "description": "joe"
        }
      },
      "end": 5
    }
  }
}
  1. The span_or query is used to define multivalues in a span query. This is very handy for simple synonym search, as shown in the following example:
POST /mybooks/_search
{
  "query": {
    "span_or": {
      "clauses": [
        {
          "span_term": {
            "description": "nice"
          }
        },
        {
          "span_term": {
            "description": "cool"
          }
        },
        {
          "span_term": {
            "description": "wonderful"
          }
        }
      ]
    }
  }
}

子句列表是 span_or 查询的核心,因为它包含应该匹配的跨度词。

  1. Similar to span_or, there is a span_multi query, which wraps multi-term queries such as prefix, wildcard, and so on. Consider the following code, for example:
POST /mybooks/_search
{
  "query": {
    "span_multi": {
      "match": {
        "prefix": {
          "description": {
            "value": "jo"
          }
        }
      }
    }
  }
}
  1. Queries can be used to create the span_near query, which allows you to control the token sequence of the query, as follows:
POST /mybooks/_search
{
  "query": {
    "span_near": {
      "clauses": [
        {
          "span_term": {
            "description": "nice"
          }
        },
        {
          "span_term": {
            "description": "joe"
          }
        },
        {
          "span_term": {
            "description": "guy"
          }
        }
      ],
      "slop": 3,
      "in_order": false
    }
  }
}
  1. For complex queries, skipping matching given positional tokens is very important. This can be achieved with the span_not query, as shown in the following example:
POST /mybooks/_search
{
  "query": {
    "span_not": {
      "include": {
        "span_term": {
          "description": "nice"
        }
      },
      "exclude": {
        "span_near": {
          "clauses": [
            {
              "span_term": {
                "description": "not"
              }
            },
            {
              "span_term": {
                "description": "nice"
              }
            }
          ],
          "slop": 1,
          "in_order": true
        }
      }
    }
  }
}

include 部分包含必须匹配的跨度; exclude 包含不能匹配的跨度。它匹配带有术语 nice 的文档,但不匹配 not nice。这对于排除否定短语非常有用!!!

  1. For searching with a span query that is surrounded by other terms, we can use the span_containing variable, as follows:
POST /mybooks/_search
{
  "query": {
    "span_containing": {
      "little": {
        "span_term": {
          "description": "nice"
        }
      },
      "big": {
        "span_near": {
          "clauses": [
            {
              "span_term": {
                "description": "not"
              }
            },
            {
              "span_term": {
                "description": "guy"
              }
            }
          ],
          "slop": 5,
          "in_order": true
        }
      }
    }
  }
}

little 部分包含必须匹配的跨度。 big 部分包含包含 little 匹配项的跨度。在前面的例子中,匹配的表达式类似于 not * nice * guy

  1. For searching with a span query that is enclosed by other span terms, we can use the span_within variable, as follows:
POST /mybooks/_search
{
  "query": {
    "span_within": {
      "little": {
        "span_term": {
          "description": "nice"
        }
      },
      "big": {
        "span_near": {
          "clauses": [
            {
              "span_term": {
                "description": "not"
              }
            },
            {
              "span_term": {
                "description": "guy"
              }
            }
          ],
          "slop": 5,
          "in_order": true
        }
      }
    }
  }
}

little 部分包含必须匹配的跨度。 big 部分包含包含 little 匹配项的跨度。

How it works...

Lucene 提供了 Elasticsearch 中可用的跨度查询。基本跨度查询是与术语查询完全相同的 span_term。这个跨度查询的目标是匹配一个确切的术语(字段加文本)。它可以被组合来制定其他类型的跨度查询。

跨度查询的主要用途是邻近搜索:彼此接近的词。

span_first 中使用 span_term 表示匹配一个术语,该术语必须在第一位。如果定义了结束参数(整数),它将第一个标记匹配到传递的值。

最强大的跨度查询之一是 span_or,它允许在同一位置定义多个术语。它涵盖了多种场景,例如:

  • Multinames
  • Synonyms
  • Several verbal forms

span_or 查询没有对应的 span_and,它应该没有意义,因为 span 查询是位置查询。

如果必须传递给 span_or 的术语数量很大,可以使用带有前缀或通配符的 span_multi 查询来减少它。例如,这种方法允许使用带有 play 的前缀查询匹配所有术语 play、playing、plays、player、players 等。

否则,最强大的跨度查询是span_near,它允许定义一个跨度查询列表(clauses)以按顺序匹配或不匹配。可以传递给这个跨度查询的参数是:

  • in_order: This defines that the term matched in the clauses must be executed in order. If you define two span near queries with two span terms to match joe and black, and in_order is true, you will not be able to match black joe text (default true).
  • slop: This defines the distance between terms that must be matched from the clauses (default 0).
If you set the values of  slop to 0 and in_order to true, you are creating an exact phrase match query that we will see in the next recipe.

span_near 查询和 slop 可用于创建能够包含一些未知术语的短语匹配。例如,考虑匹配诸如 the house 之类的表达式。如果需要执行完全匹配,则需要编写类似的查询,如下例所示:

{
  "query": {
    "span_near": {
      "clauses": [
        {
          "span_term": {
            "description": "the"
          }
        },
        {
          "span_term": {
            "description": "house"
          }
        }
      ],
      "slop": 0,
      "in_order": true
    }
  }
}

现在,例如,如果您在 the 文章和 house(即美妙的房子、大房子等)之间有一个形容词,那么以前的查询永远不会匹配它们。为了实现这个目标,需要将斜率设置为 1。

通常,slop 设置为 1、2 或 3 作为值:高值 (> 10) 没有意义。

See also

使用匹配查询 方法是一种创建简单跨度查询的简化方法。

Using a match query

Elasticsearch 提供了一个帮助程序来构建依赖于简单预配置设置的复杂跨度查询。这个助手称为匹配查询。

Getting ready

您需要一个正常运行的 Elasticsearch 安装,正如我们在 下载和安装 Elasticsearch 配方中所述第 1 章 开始时间>。

要执行这些命令,可以使用任何 HTTP 客户端,例如 curl (https://curl.haxx.se/)、postman (https://www.getpostman.com/),或类似的。我建议您使用 Kibana 控制台,因为它为 Elasticsearch 提供了代码完成和更好的字符转义。

要正确执行以下命令,您需要使用 ch04/populate_kibana.txt 命令填充的索引,该命令可用 在在线代码中。

How to do it...

要执行匹配查询,我们将执行以下步骤:

  1. The standard usage of a match query simply requires the field name and the query text. Consider the following example:
POST /mybooks/_search
{
  "query": {
    "match": {
      "description": {
        "query": "nice guy",
        "operator": "and"
      }
    }
  }
}
  1. If you need to execute the same query as a phrase query, the type from match changes in match_phrase, as shown in the following example:
POST /mybooks/_search
{
  "query": {
    "match_phrase": {
      "description": "nice guy"
    }
  }
}
  1. An extension of the previous query used in text completion or in search as you type functionality is match_phrase_prefix, as follows:
POST /mybooks/_search
{
  "query": {
    "match_phrase_prefix": {
      "description": "nice gu"
    }
  }
}
  1. A common requirement is the possibility to search for several fields with the same query. The multi_match parameter provides this capability, as shown in the following example:
POST /mybooks/_search
{
  "query": {
    "multi_match": {
      "fields": [
        "description",
        "name"
      ],
      "query": "Bill",
      "operator": "and"
    }
  }
}

How it works...

匹配查询聚合了几种常用的查询类型,涵盖了标准查询场景。

标准匹配查询创建一个可以由这些参数控制的布尔查询:

  • operator: This defines how to store and process the terms. If it's set to OR, all the terms are converted in a boolean query with all the terms in should clauses. If it's set to AND, the terms build a list of must clauses (default OR).
  • analyzer: This allows overriding of the default analyzer of the field (default based on mapping or set in searcher).
  • fuzziness: This allows defining of a fuzzy term. Related to this parameter, prefix_length and max_expansion are available.
  • zero_terms_query (none/all): This allows you to define a tokenizer filter that removes all terms from the query. The default behavior is to return nothing or all the documents. This is the case when you build an English query searching for the or a that means it could match all the documents (default none).
  • cutoff_frequency: This allows the handling of dynamic stopwords (very common terms in text) at runtime. During query execution, terms over the cutoff_frequency are considered stopwords. This approach is very useful as it allows you to convert a general query to a domain-specific query, because terms to skip depend on text statistics. The correct value must be defined empirically.
  • auto_generate_synonyms_phrase_query: (default true), if the match query should use the multi-terms synonym expansion with the synonym_graph token filter ( For more references look at https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-synonym-graph-tokenfilter.html ). 

从匹配查询创建的布尔查询非常方便,但它存在一些与布尔查询相关的常见问题,例如术语位置。如果术语位置很重要,您需要使用另一类匹配查询,即短语 one。

匹配查询中的 match_phrase 类型从查询文本构建长跨度查询。可用于提高短语查询质量的参数是用于文本处理的分析器和控制术语之间距离的 slop(请参阅使用跨度查询 食谱)。

如果最后一个术语部分完成,并且您希望在编写功能时为用户提供查询,则可以将短语类型设置为 match_phrase_prefix。这种类型构建了一个 span Near 查询,其中最后一个子句是一个 span 前缀术语。此功能通常用于 typehead 小部件,例如以下屏幕截图中所示的小部件:

读书笔记《elasticsearch-7-0-cookbook-fourth-edition》文本和数字查询

匹配查询是一种非常有用的查询类型,或者,正如我之前定义的,它是在内部构建几个常见查询的助手。

multi_match 参数类似于 match 查询,允许您定义多个要搜索的字段。为了定义这些字段,可以使用几个助手,例如:

  • Wildcards field definition: Using wildcards is a simple way to define multiple fields in one shot. For example, if you have fields for languages such as name_en, name_es, and name_it, you can define the search field as name_* to automatically search all the name fields.
  • Boosting some fields: Not all the fields have the same importance. You can boost your fields using the ^ operator. For example, if you have title and content fields, and title is more important than content, you can define the fields in this way:

“字段”:[“标题^3”,“内容”]

See also

您可以查看以下与此配方相关的要点以供进一步参考:

  • The Using span queries recipe to build more complex text queries
  • The Using prefix query recipe for simple initial typehead

Using a query string query

在前面的秘籍中,我们已经看到了几种使用文本来匹配结果的查询。查询字符串查询是一种特殊的查询类型,它允许我们通过混合字段规则来定义复杂的查询。

它使用 Lucene 查询解析器将文本解析为复杂查询。

Getting ready

您需要一个正常运行的 Elasticsearch 安装,正如我们在 下载和安装 Elasticsearch 配方中所述第 1 章 开始时间>。

要执行这些命令,可以使用任何 HTTP 客户端,例如 curl (https://curl.haxx.se/)、postman (https://www.getpostman.com/),或类似的。我建议您使用 Kibana 控制台,因为它为 Elasticsearch 提供了代码完成和更好的字符转义。

要正确执行以下命令,您需要使用 ch04/populate_kibana.txt 命令填充的索引,该命令可用 在在线代码中。

How to do it...

要执行 query_string 查询,我们将执行以下步骤:

  1. We want to search for the text nice guy, but with a condition of discarding the term not and displaying a price less than 5. The query will be as follows:
POST /mybooks/_search
{
  "query": {
    "query_string": {
      "query": """"nice guy" -description:not price:{ * TO 5 } """,
      "fields": [
        "description^5"
      ],
      "default_operator": "and"
    }
  }
}
  1. The result returned by Elasticsearch, if everything is alright, should be as follows:
{
  "took" : 17,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 2.3786995,
    "hits" : [
      {
        "_index" : "mybooks",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 2.3786995,
        "_source" : {
          "uuid" : "11111",
          "position" : 1,
          "title" : "Joe Tester",
          "description" : "Joe Testere nice guy",
          "date" : "2015-10-22",
          "price" : 4.3,
          "quantity" : 50
        }
      }
    ]
  }
}

How it works...

query_string 查询是最强大的查询类型之一。唯一需要的字段是 query,它包含必须使用 Lucene 查询解析器解析的查询。有关详细信息,请参阅以下链接:http: //lucene.apache.org/core/7_0_0/queryparser/org/apache/lucene/queryparser/classic/package-summary.html

Lucene 查询解析器能够分析复杂的查询语法并将其转换为我们在前面的秘籍中看到的许多查询类型。

可以传递给查询字符串查询的可选参数如下:

  • default_field: This defines the default field to be used to the query. It can also be set at an index level defining the index property index.query.default_field (default _all).
  • fields: This defines a list of fields to be used. It replaces the default_field. The fields parameter also allows us to use wildcards as values (that is, city.*).
  • default_operator: This is the default operator to be used for text in the query parameter (the default OR; the available values are AND and OR).
  • analyzer: This is the analyzer that must be used for a query string.
  • allow_leading_wildcard: Here, the * and ? wildcards are allowed as first characters. Using similar wildcards gives performance penalties (default true).
  • lowercase_expanded_terms: This controls whether all expansion terms (generated by fuzzy, range, wildcard, and prefix) must be lowercase (default true).
  • enable_position_increments: This enables the position increment in queries. For every query token, the positional value is incremented by 1 (default true).
  • fuzzy_max_expansions: This controls the number of terms to be used in fuzzy term expansion (default 50).
  • fuzziness: This sets the fuzziness value for fuzzy queries (default AUTO).
  • fuzzy_prefix_length: This sets the prefix length for fuzzy queries (default 0).
  • phrase_slop: This sets the default slop (number of optional terms that can be present in the middle of the given terms) for phrases. If it sets to zero, the query is an exact phrase match (default 0).
  • boost: This defines the boost value of the query (default 1.0).
  • analyze_wildcard: This enables the processing of wildcard terms in the query (default false).
  • auto_generate_phrase_queries: This enables the auto-generation of phrase queries from the query string (default false).
  • minimum_should_match: This controls how many should clauses should be verified to match the result. The value could be an integer value (that is, 3) or a percentage (that is, 40%) or a combination of both (default 1).
  • lenient: If it's set to true, the parser will ignore all format-based failures (such as text to number of date conversion) (default false).
  • locale: This is the locale used for string conversion (default ROOT).

There's more...

查询解析器是一个 非常强大 支持广泛的复杂查询的工具。最常见的情况如下:

  • field:text: This is used to match a field that contains some text. It's mapped on a term query.
  • field:(term1 OR term2): This is used to match some terms in OR. It's mapped on a terms query.
  • field:"text": This is used to match the exact text. It's mapped on a match query.
  • _exists_:field: This is used to match documents that have a field. It's mapped on an exists filter.
  • _missing_:field: This is used to match documents that don't have a field. It's mapped on a missing filter.
  • field:[start TO end]: This is used to match a range from the start value to the end value. The start and end values could be terms, numbers, or a valid date-time value. The start and end values are included in the range; if you want to exclude a range, you must replace the [] delimiters with {}.
  • field:/regex/: This is used to match a regular express.

查询解析器还支持文本修饰符,用于操作文本功能。最常用的如下:

  • Fuzziness using the form text~. The default fuzziness value is 2, which allows a Damerau-Levenshtein edit-distance algorithm (http://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance) of 2.
  • Wildcards with ? which replace a single character, or * to replace zero or more characters. (That is, b?ll or bi* to match bill.)
  • Proximity search "term1 term2"~3, allows matching phrase terms with a defined slop. (That is, "my umbrella"~3 matches "my green umbrella", "my new umbrella", and so on.)

See also

Using a simple query string query

通常,程序员可以控制使用布尔查询和其他查询类型来构建复杂查询。因此,Elasticsearch 提供了两种查询,使用户能够创建包含多个运算符的字符串查询。

这些类型的查询在高级搜索引擎使用中非常常见,例如 Google,它允许我们在术语上使用 +- 运算符。

Getting ready

您需要一个正常运行的 Elasticsearch 安装,正如我们在 下载和安装 Elasticsearch 配方中所述第 1 章 开始

要执行这些命令,可以使用任何 HTTP 客户端,例如 curl (https://curl.haxx.se/)、postman (https://www.getpostman.com/),或类似的。我建议您使用 Kibana 控制台,因为它为 Elasticsearch 提供了代码完成和更好的字符转义。

要正确执行以下命令,您需要使用 ch04/populate_kibana.txt 命令填充的索引,该命令可在线获取代码。

How to do it...

要执行一个简单的查询字符串查询,我们将执行以下步骤:

  1. We want to search for text nice guy, but not excluding the term not. The query will be as follows:
POST /mybooks/_search
{
  "query": {
    "simple_query_string": {
      "query": """"nice guy" -not""",
      "fields": [
        "description^5",
        "_all"
      ],
      "default_operator": "and"
    }
  }
}
  1. The result returned by Elasticsearch, if everything is alright, should be as follows:
{
  ...truncated...
  "hits" : {
    "total" : 2,
    "max_score" : 2.3786995,
    "hits" : [
      {
        "_index" : "mybooks",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 2.3786995,
        "_source" : {
          "uuid" : "11111",
          "position" : 1,
          "title" : "Joe Tester",
          "description" : "Joe Testere nice guy",
          "date" : "2015-10-22",
          "price" : 4.3,
          "quantity" : 50
        }
      },
      {
        "_index" : "mybooks",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 2.3786995,
        "_source" : {
          "uuid" : "22222",
          "position" : 2,
          "title" : "Bill Baloney",
          "description" : "Bill Testere nice guy",
          "date" : "2016-06-12",
          "price" : 5,
          "quantity" : 34
        }
      }
    ]
  }
}

How it works...

简单查询字符串查询获取查询文本,对其进行标记,并应用文本查询中提供的规则构建布尔查询。

如果将它提供给最终用户,它是一个很好的工具,可以表达简单的高级查询。它的解析器非常复杂,因此它能够提取精确匹配的片段以解释为跨度查询。

简单查询字符串查询的优点是解析器总是给你一个有效的查询。

如果您使用之前的查询,查询字符串查询, 如果用户给您输入格式错误的查询,它将引发错误。如果您使用简单的查询字符串,如果格式错误,它将被“修复”并执行而不会出错。

See also

Using the range query

前面的所有查询都适用于 已定义 或部分定义的值,但在现实世界的应用程序中适用于一系列值是很常见的。最常见的标准场景是:

  • Filtering by numeric value range (that is, price, size, and age)
  • Filtering by date (that is, events of 03/07/12 can be a range query from 03/07/12 00:00:00 to 03/07/12 24:59:59)
  • Filtering by term range (that is, from A to D).

Getting ready

您需要一个正常运行的 Elasticsearch 安装,正如我们在 下载和安装 Elasticsearch 配方中所述第 1 章 入门

要执行这些命令,可以使用任何 HTTP 客户端,例如 curl (https://curl.haxx.se/)、postman (https://www.getpostman.com/),或类似的。我建议您使用 Kibana 控制台,因为它为 Elasticsearch 提供了代码完成和更好的字符转义。

要正确执行以下命令,您需要使用 ch04/populate_kibana.txt 命令填充的索引,该命令可在线获取代码。

How to do it...

要执行范围查询,我们将执行以下步骤:

  1. Consider the sample data of the previous examples, which contains an integer field position. Using it to execute a query for filtering positions between 3 and 5, we will have the following output:
POST /mybooks/_search
{
  "query": {
    "range": {
      "position": {
        "from": 3,
        "to": 4,
        "include_lower": true,
        "include_upper": false
      }
    }
  }
}
  1. The result returned by Elasticsearch, if everything is alright, should be as follows:
{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
 "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "mybooks",
        "_type" : "_doc",
        "_id" : "3",
        "_score" : 1.0,
        "_source" : {
          "uuid" : "33333",
          "position" : 3,
          "title" : "Bill Klingon",
          "description" : "Bill is not\n nice guy",
          "date" : "2017-09-21",
          "price" : 6,
          "quantity" : 33
        }
      }
    ]
  }
}

How it works...

使用范围查询是因为评分结果可以涵盖几个有趣的场景,例如:

  • Items with high availability in stocks should be presented first
  • New items should be boosted
  • Most bought items should be boosted

范围查询对于数值非常方便,如前面的示例所示。范围查询接受的参数是:

  • from: This is the starting value for the range (optional)
  • to: This is the ending value for the range (optional)
  • include_in_lower: This includes the starting value in the range (optional, default true)
  • include_in_upper: This includes the ending value in the range (optional, default true)

在范围查询中,还可以使用其他辅助参数来简化搜索,列举如下:

  • gt: (greater than), this has the same functionality to set the from parameter and include_in_lower to false
  • gte: (greater than or equal), this has the same functionality to set the from parameter and include_in_lower to true
  • lt: (less than), this has the same functionality to set the to parameter and the include_in_upper to false
  • lte: (less than or equal to), this has the same functionality to set the to parameter and the include_in_upper to true

There's more...

在Elasticsearch中,哪种查询涵盖了几种SQL范围查询,如<<=>>= 在数值上。因为在 Elasticsearch 中,日期或时间字段在内部作为数字字段进行管理,所以可以将范围查询或过滤器与日期值一起使用。如果该字段是 date 字段,则范围查询中的每个值都会自动转换为数值。例如,如果您需要过滤今年的文档,范围片段将如下所示:

  "range": {
    "timestamp": {
      "from": "2014-01-01",
      "to": "2015-01-01",
      "include_lower": true,
      "include_upper": false
    }
  }

对于 date 字段,还可以指定要使用的 time_zone 值,以便正确计算匹配。

If you are using a date value you can use a date math ( https://www.elastic.co/guide/en/elasticsearch/reference/master/common-options.html#date-math) to round the values.

The common terms query

当用户使用查询搜索某些文本时,并非用户使用的所有术语都具有相同的重要性。更常见的术语通常会在查询执行时被删除,以减少它们产生的噪音。这些术语称为停用词,它们通常是冠词、连词和常用语言词(即,theaso、< kbd>和、等)。

停用词列表取决于语言,并且独立于您的文档。 Lucene 提供了在查询 时间内使用常用术语查询 来动态计算停用词列表的方法。

Getting ready

您需要一个正常运行的 Elasticsearch 安装,正如我们在 下载和安装 Elasticsearch 配方中所述第 1 章 开始时间>。

要执行这些命令,可以使用任何 HTTP 客户端,例如 curl (https://curl.haxx.se/)、postman (https://www.getpostman.com/),或类似的。我建议您使用 Kibana 控制台为 Elasticsearch 提供代码完成和更好的字符转义。

要正确执行以下命令,您需要使用 ch04/populate_kibana.txt 命令填充的索引,该命令可在线获取代码。

How to do it...

要执行常用术语查询,我们将执行以下步骤:

  1. We want to search for a nice guy, so we will use the following code:
POST /mybooks/_search
{
  "query": {
    "common": {
      "description": {
        "query": "nice guy",
        "cutoff_frequency": 0.001
      }
    }
  }
}
  1. The result returned by Elasticsearch, if everything is alright, should be as follows:
{
  ...truncated...
  "hits" : {
    "total" : 3,
    "max_score" : 0.2757399,
    "hits" : [
      {
        "_index" : "mybooks",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.2757399,
        "_source" :...truncated...,
      {
        "_index" : "mybooks",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 0.2757399,
        "_source" :...truncated...,,
      {
        "_index" : "mybooks",
        "_type" : "_doc",
        "_id" : "3",
        "_score" : 0.25124985,
        "_source" :...truncated...,
      }
    ]
  }
}

How it works...

Elasticsearch 的核心引擎 Lucene 提供了大量关于您的索引词的统计信息,这些统计信息是计算不同分数类型的算法所需的。

这些统计信息使用查询时间,一般来说,一个查询将查询词分为两类:

  • Low frequency terms: These are the less common terms in your index. They are generally the most important ones for your current query. For the preceding query, the terms could be ["nice", "guy"].
  • High frequency terms: They are the most common ones and mainly defined as stop words. For the preceding query, the term could be ["a"].

前面的查询,基于词条统计,在 Elasticsearch 内部被转换成类似的查询,如下:

{
  "query": {
    "bool": {
      "must": [ # low frequency terms
        {
          "term": {
            "description": "nice"
          }
        },
        {
          "term": {
            "description": "guy"
          }
        }
      ],
      "should": [ # high frequency terms
        {
          "term": {
            "description": "a"
          }
        }
      ]
    }
  }
}

要控制常用术语查询,可以使用以下选项:

  • cutoff_frequency: This value defines the cut frequency that allows to us to partition the low and high frequency term lists. Its better value depends on your data. Some empirical tests are needed to evaluate the correct value
  • minimum_should_match. This can be defined in two ways:
    • As a single value. This defines the minimum terms that must be matched for low frequency terms, that is, "minimum_should_match" : 2
    • As an object containing the low and high values, that is as follows:
  "minimum_should_match": {
    "low_freq": 1,
    "high_freq": 2
  }
请注意,该术语的统计信息取决于您的 Lucene 索引中的数据,因此它们在 Elasticsearch 中作为分片级别。

See also

您可以查看以下与此配方相关的要点以供进一步参考:

  • The Using a term query recipe for a simple term match.
  • The Using a boolean query recipe in Chapter 4, Exploring Search Capabilities

Using an IDs query

IDs 查询允许通过 ID 匹配文档,将查询分布在所有搜索到的分片中。

Getting ready

您需要一个正常运行的 Elasticsearch 安装,正如我们在 下载和安装 Elasticsearch 配方中所述第 1 章 入门

要执行这些命令,可以使用任何 HTTP 客户端,例如 curl (https://curl.haxx.se/)、postman (https://www.getpostman.com/),或类似的。我建议您使用 Kibana 控制台,因为它为 Elasticsearch 提供了代码完成和更好的字符转义。

要正确执行以下命令,您需要使用 ch04/populate_kibana.txt 命令填充的索引,该命令可在线获取代码。

How to do it...

要执行 ID 查询或过滤器,我们将执行给定的步骤,如下所示:

  1. The IDs query for fetching IDs "1", "2", "3" of type test-type is in the following form:
POST /mybooks/_search
{
  "query": {
    "ids": {
      "type": "test-type",
      "values": [
        "1",
        "2",
        "3"
      ]
    }
  }
}
  1. The result returned by Elasticsearch, if everything is alright, should be as follows:
{
  ...truncated...
  "hits" : {
    "total" : 3,
    "max_score" : 0.2757399,
    "hits" : [
      {
        "_index" : "mybooks",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.2757399,
        "_source" :...truncated...,
      {
        "_index" : "mybooks",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 0.2757399,
        "_source" :...truncated...,,
      {
        "_index" : "mybooks",
        "_type" : "_doc",
        "_id" : "3",
        "_score" : 0.25124985,
        "_source" :...truncated...,
      }
    ]
  }
}

在结果中,请求 ID 顺序未得到遵守。因此,如果您询问多类型,您需要使用文档元数据 (_index, _type, _id) 来更好地管理您的结果。

How it works...

按 ID 查询是一种非常快速的操作,因为 ID 通常缓存在内存中以便快速查找。

此查询中使用的参数是:

  • ids: This includes a list of IDs that must be matched (required)
  • type: This is a string, or a list of strings, which defines the types in which we need to search. If not defined, they are taken from the URL of the call (optional).
Elasticsearch 在内部将文档的 ID 存储在一个名为 _id 。安  _id 在索引中是唯一的。

通常,使用 ID 查询的标准方式是选择文档;此查询允许在不知道包含文档的分片的情况下获取文档。文档存储在分片中,这些分片是根据对文档 ID 计算的模运算来选择的。如果定义了父 ID 或路由,则它们用于选择分片:在这种情况下,获取知道其 ID 的文档的唯一方法是使用 IDs 查询。

如果你需要获取多ID并且没有路由变化(由于routing参数在索引时),最好不要使用这种查询,而是使用get或multi-get API调用以获取文档,因为它们更快并且可以实时工作。

See also

您可以查看以下与此配方相关的要点以供进一步参考:

  • The Getting a document recipe in Chapter 3, Basic Operations.
  • The Speeding up GET operations (Multi GET) recipe in Chapter 3Basic Operations.

Using the function score query

这种查询是可用的最强大的查询之一,因为它允许对评分算法进行广泛的定制。函数 score 查询允许我们定义一个函数来控制查询返回的文档的分数。

通常,这些函数是 CPU 密集型的,在大型数据集上执行它们需要大量内存,但在小子集上计算它们可以显着提高搜索质量。

用于此查询的常见场景是:

  • Creating a custom score function (with decay function, for example)
  • Creating a custom boost factor, for example, based on another field (that is, boosting a document by distance from a point)
  • Creating a custom filter score function, for example, based on scripting Elasticsearch capabilities
  • Ordering the documents randomly.

Getting ready

您需要一个正常运行的 Elasticsearch 安装,正如我们在 下载和安装 Elasticsearch 配方中所述第 1 章 入门

要执行这些命令,可以使用任何 HTTP 客户端,例如 curl (https://curl.haxx.se/)、postman (https://www.getpostman.com/),或类似的。我建议您使用 Kibana 控制台,因为它为 Elasticsearch 提供了代码完成和更好的字符转义。

要正确执行以下命令,您需要使用 ch04/populate_kibana.txt 命令填充的索引,该命令可在线获取代码。

How to do it...

要执行函数分数查询,我们将执行以下步骤:

  1. We can execute a function_score query from the command line, as follows:
POST /mybooks/_search
{
  "query": {
    "function_score": {
      "query": {
        "query_string": {
          "query": "bill"
        }
      },
      "functions": [
        {
          "linear": {
            "position": {
              "origin": "0",
              "scale": "20"
            }
          }
        }
      ],
      "score_mode": "multiply"
    }
  }
}

我们执行查询以搜索 bill 并使用 position 字段上的 linear 函数对结果进行评分。

  1. The result should be as follows:
{
  "took" : 32,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 2,
    "max_score" : 0.46101078,
    "hits" : [
      {
        "_index" : "mybooks",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 0.46101078,
        "_source" : ...truncated...
      },
      {
        "_index" : "mybooks",
        "_type" : "_doc",
        "_id" : "3",
        "_score" : 0.43475336,
        "_source" : ...truncated...
      }
    ]
  }
}

How it works...

由于评分中涉及的数学算法的自然复杂性,函数评分查询可能是要掌握的最复杂的查询类型。

函数得分查询的通用完整形式如下:

  "function_score": {
    "(query|filter)": {},
    "boost": "boost for the whole query",
    "functions": [
      {
        "filter": {},
        "FUNCTION": {}
      },
      {
        "FUNCTION": {}
      }
    ],
    "max_boost": number,
    "boost_mode": "(multiply|replace|...)",
    "score_mode": "(multiply|max|...)",
    "script_score": {},
    "random_score": {
      "seed ": number
    }
  }

使用的参数如下:

  • query or filter: This is the query used to match the required documents (optional, default a match all query).
  • boost: This is the boost to apply to the whole query (default 1.0).
  • functions: This is a list of functions used to score the queries. In a simple case, use only one function. In the function object, a filter can be provided to apply the function only to a subset of documents, because the filter is applied first.
  • max_boost: This sets the maximum allowed value for the boost score (default java FLT_MAX).
  • boost_mode: This parameter defines how the function score is combined with the query score (default "multiply"). The possible values are:
    • multiply (default): Here, the query score and function score are multiplied
    • replace: Here, only the function score is used; the query score is ignored
    • sum: Here, the query score and function score are added
    • avg: Here, the average between query score and function score is taken
    • max: This is the maximum of query score and function score
    • min: This is the minimum of query score and function score
    • score_mode (default multiply): This parameter defines how the resulting function scores (when multiple functions are defined) are combined. The possible values are:
      • multiply: The scores are multiplied
      • sum: The scores are summed
      • avg: The scores are averaged
      • first: The first function that has a matching filter is applied
      • max: The maximum score is used
      • min: The minimum score is used
  • script_score: This allows you to define a script score function to be used to compute the score (optional). (Elasticsearch scripting will be discussed in Chapter 8, Scripting in Elasticsearch.) This parameter is very useful in implementing simple script algorithms. The original score value is in the _score function scope. This allows the defining of similar algorithms, as follows:
  • "script_score": {
        "script": {
          "params": {
            "param1": 2,
            "param2": 3.1
          },
          "source": "_score * doc['my_numeric_field'].value /pow(param1, param2)"
        }
      }
    在 Elasticsearch 7.x 中, script_score 可以用作 脚本分数查询作为实验功能( https://www.elastic.co/guide/ en/elasticsearch/reference/7.0/query-dsl-script-score-query.html)。
  • random_score: This allows us to randomly score the documents. It is very useful for retrieving records randomly (optional).

Elasticsearch 为最常见的评分衰减分布算法提供原生支持,例如:

  • Linear: This is used to linearly distribute the scores based on a distance from a value
  • Exponential (exp): This is used for an exponential decay function
  • Gaussian (gauss): This is used for the Gaussian decay function
Choosing the correct function distribution depends on the context and data distribution.

See also

您可以参考以下几点进一步参考,所有这些都与本食谱有关:

Using the exists query

Elasticsearch 的主要特征之一是其无模式索引功能。 Elasticsearch 中的记录可能有缺失值。由于其无模式特性,需要两种查询:

  • Exists field: This is used to check if a field exists in a document.
  • Missing field: This is used to check if a field is missing in a document.

Getting ready

您需要一个正常运行的 Elasticsearch 安装,正如我们在 下载和安装 Elasticsearch 配方中所述第 1 章 入门

要执行这些命令,可以使用任何 HTTP 客户端,例如 curl (https://curl.haxx.se/)、postman (https://www.getpostman.com/),或类似的。我建议您使用 Kibana 控制台,因为它为 Elasticsearch 提供了代码完成和更好的字符转义。

要正确执行以下命令,您需要使用 ch04/populate_kibana.txt 命令填充的索引,该命令可用 在在线代码中。

How to do it...

要执行现有和缺失的过滤器,我们将执行以下步骤:

  1. To search all the test-type documents that have a field called description, the query will be as follows:
POST /mybooks/_search
{
  "query": {
    "exists": {
      "field": "description"
    }
  }
}

  1. To search all the test-type documents that do not have a field called description because there is not a missing query, and we can obtain it using the boolean or not query; the query will be as follows:
POST /mybooks/_search
{
  "query": {
    "bool": {
      "must_not": {
        "exists": {
          "field": "description"
        }
      }
    }
  }
}

How it works...

存在和缺失过滤器仅采用 field 参数,其中包含要检查的字段的名称。使用简单的字段,没有陷阱;但是,如果您使用单个嵌入对象或它们的列表,则由于 Elasticsearch 和 Lucene 的工作方式,您需要使用子对象字段。

如果您尝试索引 JSON 文档,以下示例可帮助您了解 Elasticsearch 如何在内部将 JSON 对象映射到 Lucene 文档:

{
  "name": "Paul",
  "address": {
    "city": "Sydney",
    "street": "Opera House Road",
    "number": "44"
  }
}

Elasticsearch 将在内部对其进行索引,如下所示:

name:paul
address.city:Sydney
address.street:Opera House Road
address.number:44

正如我们所见,没有索引字段 address,因此 address 上的存在过滤器失败。要将文档与地址匹配,您必须搜索子字段(即 address.city)。