Managing Mapping

映射是 Elasticsearch 中一个非常重要的概念，因为它定义了搜索引擎应该如何处理文档及其字段。

搜索引擎执行以下两个主要操作：

Indexing: This is the action to receive a document and to process it and store it in an index
Searching: This is the action to retrieve the data from the index

这两个部分是严格连接的；索引步骤中的错误会导致不需要或丢失的搜索结果。

Elasticsearch 在索引级别上有显式映射。索引时，如果未提供映射，则创建默认映射，并从组成文档的数据字段中猜测结构。然后，此新映射会自动传播到所有集群节点。

默认类型映射具有合理的默认值，但是当您想要更改它们的行为或自定义索引的其他几个方面（存储、忽略、完成等）时，您需要提供新的映射定义。

在本章中，我们将研究构成文档映射的所有可能的映射字段类型。

本章将介绍以下食谱：

Using explicit mapping creation
Mapping base types
Mapping arrays
Mapping an object
Mapping a document
Using dynamic templates in document mapping
Managing nested objects
Managing a child document with a join field
Adding a field with multiple mappings
Mapping a GeoPoint field
Mapping a GeoShape field
Mapping an IP field
Mapping an alias field
Mapping a Percolator field
Mapping feature and feature vector fields
Adding metadata to a mapping
Specifying a different analyzers
Mapping a completion field

Using explicit mapping creation

如果我们将索引视为 SQL 世界中的数据库，映射类似于表定义。

Elasticsearch 能够理解您正在索引的文档的结构（反射）并自动创建映射定义（显式映射创建）。

Getting ready

要执行本秘籍中的代码，您需要一个正常运行的 Elasticsearch 安装，如第 1 章，入门。

要执行这些命令，可以使用任何 HTTP 客户端，例如 curl (https://curl.haxx.se/ )、邮递员 (https://www.getpostman.com/) 或其他类似平台。我建议使用 Kibana 控制台为 Elasticsearch 提供代码完成和更好的字符转义。

为了更好地理解本秘籍中的示例和代码，需要具备 JSON 的基本知识。

How to do it...

您可以通过在 Elasticsearch 中添加新文档来显式创建映射。为此，我们将执行以下步骤：

Create an index like so:

PUT test

答案如下：

{
 "acknowledged" : true,
 "shards_acknowledged" : true,
 "index" : "test"
 }

Put a document in the index, as shown in the following code:

PUT test/_doc/1
{"name":"Paul", "age":35}

答案如下：

{
  "_index" : "test",
  "_type" : "_doc",
  "_id" : "1",
  "_version" : 1,
  "result" : "created",
  "_shards" : {
    "total" : 2,
    "successful" : 1,
    "failed" : 0
  },
  "_seq_no" : 0,
  "_primary_term" : 1
}

Get the mapping with the following code:

GET test/_mapping

The result mapping that's autocreated by Elasticsearch should be as follows:

{
  "test" : {
    "mappings" : {
      "properties" : {
        "age" : {
          "type" : "long"
        },
        "name" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        }
      }
    }
  }
}

To delete the index, you can call the following:

DELETE test

答案如下：

{
  "acknowledged" : true
}

How it works...

第一个命令行创建一个索引，我们将在其中配置类型/映射并插入文档。

第二个命令在索引中插入一个文档（我们将在第 3 章，基本操作，并在索引文档中记录索引/em> 第 3 章中的配方，< em>基本操作）。

在文档索引阶段，Elasticsearch 会在内部检查 _doc 类型是否存在，否则它会动态创建一个。

Elasticsearch 读取映射字段的所有默认属性并开始处理它们，如下所示：

If the field is already present in the mapping and the value of the field is valid (it matches the correct type), Elasticsearch does not need to change the current mappings.
If the field is already present in the mapping but the value of the field is of a different type, it tries to upgrade the field type (that is, from integer to long). If the types are not compatible, it throws an exception and the index process fails.
If the field is not present, it tries to auto detect the type of field. It updates the mappings with a new field mapping.

There's more...

在 Elasticsearch 中，文档类型的分离是逻辑的，而不是物理的。 Elasticsearch 核心引擎透明地管理这一点。物理上，所有文档类型都在同一个 Lucene 索引中，因此它们之间没有完全分离。类型的概念是纯逻辑的，由 Elasticsearch 强制执行。用户并不关心这种内部管理，但在某些情况下，如果你有大量记录，这会影响读写记录的性能，因为所有记录都存储在同一个索引文件中。

每个文档都有一个唯一的标识符，称为索引的 UID，它存储在特殊的 _uid 字段文件。这是通过将文档类型添加到 _id 自动计算的。在我们的示例中， _uid 将是 _doc#1.

_id 可以在索引时提供，或者如果缺少它可以由 Elasticsearch 自动分配。

当创建或更改映射类型时，Elasticsearch 会自动将映射更改传播到集群中的所有节点，以便所有分片都对齐以处理该特定类型。

每个索引只能包含一个类型；以前版本的 Elasticsearch 中的类型名称可能会有所不同。由于该类型在 7.x 中已弃用，因此最佳做法是调用 _doc 类型。

Mapping base types

使用显式映射可以更快地开始使用无模式方法提取数据，而无需担心字段类型。因此，为了在索引中获得更好的结果和性能，需要手动定义映射。

微调映射带来了一些优势，例如：

Reducing the index size on the disk (disabling functionalities for custom fields)
Indexing only interesting fields (general speed up)
Precooking data for fast search or real-time analytics (such as facets)
Correctly defining whether a field must be analyzed in multiple tokens or considered as a single token

Elasticsearch 允许您使用具有广泛配置的基本字段。

Getting ready

您需要一个正常运行的 Elasticsearch 安装，正如我们在 第 1 章，入门。

要执行这些命令，可以使用任何 HTTP 客户端，例如 curl (https://curl.haxx.se/)、postman (https://www.getpostman.com/)，或类似的。我建议使用 Kibana 控制台，它为 Elasticsearch 提供代码完成和更好的字符转义。

要执行这个秘籍的示例，您需要创建一个索引，其中包含一个 test name，您可以在其中放置映射，如使用显式映射创建配方。

How to do it...

让我们为我们的 eBay-like 商店使用一个半真实的商店订单示例：

First, we define an order:

姓名	类型	说明
`id`	`标识符`	订单标识符
`日期`	`日期（时间）`	订购日期
`customer_id`	`id 参考`	客户 ID 参考
`名称`	`字符串`	项目名称
`数量`	`整数`	有多少项？
`价格`	`双`	商品的价格
`增值税`	`双`	商品增值税
`发送`	`布尔值`	订单已发送

Our order record must be converted into an Elasticsearch mapping definition as follows:

PUT test/_mapping
{
    "properties" : {
      "id" : {"type" : "keyword"},
      "date" : {"type" : "date"},
      "customer_id" : {"type" : "keyword"},
      "sent" : {"type" : "boolean"},
      "name" : {"type" : "keyword"},
      "quantity" : {"type" : "integer"},
      "price" : {"type" : "double"},
      "vat" : {"type" : "double", "index":"false"}
    }
}

现在，映射已准备好放入索引中。我们将在第 4 章，基本操作。

How it works...

字段类型必须映射到 Elasticsearch 基本类型之一，并且需要添加有关如何索引字段的选项。

下表是映射类型的参考：

类型	ES 类型	说明
`字符串`, `VarChar`	`关键字`	这是一个不可标记的文本字段：`CODE001`
`字符串`、`VarChar`、`文本`	`文本`	这是一个要标记的文本字段：一个不错的文本
`整数`	`整数`	这是一个整数（32 位）：1、2、3 或 4
`长`	`长`	这是一个长值（64 位）
`浮动`	`浮动`	这是一个浮点数（32 位）：1.2 或 4.5
`双`	`双`	这是一个浮点数（64 位）
`布尔值`	`布尔值`	这是一个布尔值：真或假
`日期`/`日期时间`	`日期`	这是日期或日期时间值：`2013-12-25`、`2013-12-25T22:21:20`
`字节`/`二进制`	`二进制`	这包括一些用于二进制数据的字节，例如文件或字节流。

根据数据类型，可以在处理字段时向 Elasticsearch 提供明确的指令以便更好地管理。最常用的选项如下：

store (default false): This marks the field to be stored in a separate index fragment for fast retrieval. Storing a field consumes disk space, but reduces computation if you need to extract it from a document (that is, in scripting and aggregations). The possible values for this option are false and true.

The stored fields are faster than others in aggregations.

index: This defines whether or not the field should be indexed. The possible values for this parameter are true and false. Index fields are not searchable (default true).
null_value: This defines a default value if the field is null.
boost: This is used to change the importance of a field (default 1.0).

Boost works on a term level only, so it's mainly used in term, terms, and match queries.

search_analyzer: This defines an analyzer to be used during the search. If not defined, the analyzer of the parent object is used (default null).
analyzer: This sets the default analyzer to be used (default null).
include_in_all: This marks the current field to be indexed in the special _all field (a field that contains the concatenated text of all fields) (default true).
norms: This controls the Lucene norms. This parameter is used to better score queries. If the field is used only for filtering, it's best practice to disable it to reduce resource usage (default true for analyzed fields and false for not_analyzed ones).
copy_to: This allows you to copy the content of a field to another one to achieve functionalities, similar to the _all field.
ignore_above: This allows you to skip the indexing string that's bigger than its value. This is useful for processing fields for exact filtering, aggregations, and sorting. It also prevents a single term token from becoming too big and prevents errors due to the Lucene term byte-length limit of 32766 (default 2147483647).

There's more...

在之前版本的 Elasticsearch 中，字符串的标准映射是 string。在 5.x 版中，不推荐使用字符串映射，并将其迁移到关键字和文本映射。

在 Elasticsearch 版本 6.x 中，如 使用显式映射创建 配方中所示，字符串的显式推断类型是多字段映射：

The default processing is text. This mapping allows textual queries (that is, term, match, and span queries). In the example provided in the Using explicit mapping creation recipe, this was name.
The keyword subfield is used for keyword mapping. This field can be used for exact term matching and for aggregation and sorting. In the example provided in the Using explicit mapping creation recipe, the referred field was name.keyword.

另一个仅可用于 text 映射的重要参数是 term_vector（构成字符串的术语向量。有关详细信息，请参阅 Lucene 文档，位于 http://lucene.apache.org/core/6_1_0/core/org/apache/ lucene/index/Terms.html）。

term_vector 可以接受以下值：

no: This is the default value, skip term vector
yes: This is the store term vector
with_offsets: This is the store term vector with token offset (start, end position in a block of characters)
with_positions: This is used to store the position of the token in the term vector
with_positions_offsets: This stores all term vector data

Term vectors allow fast highlighting, but consume disk space due to storing of additional text information. It's a best practice to only activate in fields that require highlighting, such as title or document content.

Mapping arrays

数组或多值字段在数据模型中非常常见（例如多个电话号码、地址、姓名、别名等），但在传统 SQL 解决方案中并不原生支持。

在 SQL 中，多值字段需要创建必须连接的辅助表以收集所有值，当记录的基数很大时，性能会很差。

Elasticsearch 在 JSON 中原生工作，透明地提供对多值字段的支持。

Getting ready

您需要一个正常运行的 Elasticsearch 安装，正如我们在第 1 章，入门。

How to do it...

Every field is automatically managed as an array. For example, to store tags for a document, the mapping will be as follows:

{
    "properties" : {
      "name" : {"type" : "keyword"},
      "tag" : {"type" : "keyword", "store" : "yes"},
      ...
      }
}

This mapping is valid for indexing both documents. The following is the code for document1:

{"name": "document1", "tag": "awesome"}

The following is the code for document2:

{"name": "document2", "tag": ["cool", "awesome", "amazing"] }

How it works...

Elasticsearch 透明地管理数组：由于其 Lucene 核心特性，声明单个值或多值没有区别。

字段的多值在 Lucene 中进行管理，因此您可以将它们添加到具有相同字段名称的文档中。对于有 SQL 背景的人来说，这种行为可能很奇怪，但这是 NoSQL 世界的一个关键点，因为它减少了对连接查询的需求，并创建了不同的表来管理多值。嵌入对象数组具有与简单字段相同的行为。

Mapping an object

对象是基本结构（类似于 SQL 中的记录）。 Elasticsearch 扩展了对象的传统用途，从而允许递归嵌入对象。

Getting ready

您需要一个正常运行的 Elasticsearch 安装，正如我们在下载和安装 Elasticsearch 配方中所述的第 1 章，入门。

要执行这些命令，可以使用任何 HTTP 客户端，例如 curl (https://curl.haxx.se/)、postman ( https://www.getpostman.com/)，或类似的。同样，我建议使用 Kibana 控制台，它为 Elasticsearch 提供代码完成和更好的字符转义。

How to do it...

我们可以使用项目数组重写上一个示例中的映射代码 recipe：

PUT test/_doc/_mapping
{
    "properties" : {
      "id" : {"type" : "keyword"},
      "date" : {"type" : "date"},
      "customer_id" : {"type" : "keyword", "store" : "yes"},
      "sent" : {"type" : "boolean"},
      "item" : {
        "type" : "object",
        "properties" : {
          "name" : {"type" : "text"},
          "quantity" : {"type" : "integer"},
          "price" : {"type" : "double"},
          "vat" : {"type" : "double"}
          }
        }
     }
}

How it works...

Elasticsearch 使用原生 JSON，因此每个复杂的 JSON 结构都可以映射到其中。

当 Elasticsearch 解析对象类型时，它会尝试提取字段并将它们作为其定义的映射进行处理。如果不是，它会使用反射来学习对象的结构。

对象最重要的属性如下：

properties: This is a collection of fields or objects (we can consider them as columns in the SQL world).
enabled: This establishes whether or not the object should be processed. If it's set to false, the data contained in the object is not indexed and it cannot be searched (default true).
dynamic: This allows Elasticsearch to add new field names to the object using a reflection on the values of the inserted data. If it's set to false, when you try to index an object containing a new field type, it'll be rejected silently. If it's set to strict, when a new field type is present in the object, an error is raised, skipping the index process. The dynamic parameter allows you to be safe about changes in the document structure (default true).
include_in_all: This adds the object values to the special _all field (used to aggregate the text of all document fields) (default true).

最常用的属性是 properties，它允许您在 Elasticsearch 字段中映射对象的字段。

禁用文档的索引部分会减小索引大小；但是，无法搜索数据。换句话说，您最终会在磁盘上得到一个较小的文件，但会产生功能成本。

Mapping a document

该文档也称为根对象。这具有控制其行为的特殊参数，这些参数主要在内部用于进行特殊处理，例如路由或文档的生存时间。

在这个秘籍中，我们将看看这些特殊字段并学习如何使用它们。

Getting ready

您需要一个正常运行的 Elasticsearch 安装，正如我们在第 1 章，入门。

要执行这些命令，可以使用每个 HTTP 客户端，例如 curl (https://curl.haxx.se/)、postman (https://www.getpostman.com/)，或类似的。我建议使用 Kibana 控制台，它为 Elasticsearch 提供代码完成和更好的字符转义。

How to do it...

我们可以通过添加一些特殊字段来扩展前面的订单示例，例如：

PUT test/_mapping
{
    "_source": {      "store": true
    },
    "_routing": {      "required": true
   },
    "_index": {
      "enabled": true
    },    "properties": {
 ... truncated ....
    }
}

How it works...

每个特殊字段都有自己的参数和值选项，例如：

_id: This allows you to index only the ID part of the document. All the ID queries will speed up using the ID value (default not indexed and not stored).
_index: This controls whether or not the index must be stored as part of the document. It can be enabled by setting the "enabled": true parameter (enabled=false default).
_source: This controls the storage of the document source. Storing the source is very useful, but it's a storage overhead, so it is not required. Consequently, it's better to turn it off (enabled=true default).
_routing: This defines the shard that will store the document. It supports additional parameters, such as required (true/false). This is used to force the presence of the routing value, raising an exception if not provided.

控制如何索引和处理文档非常重要，它可以让您解决与复杂数据类型相关的问题。

每个特殊字段都有用于设置特定配置的参数，并且它们的某些行为可能会在 Elasticsearch 的不同版本中发生变化。

Using dynamic templates in document mapping

在使用显式映射创建秘籍中，我们看到了 Elasticsearch 如何使用反射来猜测字段类型。在这个秘籍中，我们将看到如何通过动态模板帮助它提高猜测能力。

动态模板功能非常有用。例如，在您需要创建多个具有相似类型的索引的情况下，它可能很有用因为它允许您将定义映射的需求从编码的初始例程转移到自动索引文档的创建。一个典型的用法是定义 Logstash 日志索引的类型。

Getting ready

您需要一个正常运行的 Elasticsearch 安装，正如我们在下载和安装 Elasticsearch 配方中所述的第 1 章，入门。

How to do it...

我们可以通过添加文档相关设置来扩展之前的映射，如下：

PUT test/_mapping
{
    "dynamic_date_formats":["yyyy-MM-dd", "dd-MM-yyyy"],\
    "date_detection":true,
    "numeric_detection":true,
    "dynamic_templates":[
      {"template1":{
        "match":"*",
        "match_mapping_type":"long",
        "mapping":{"type":" {dynamic_type}", "store":true}
      }}    ],
    "properties" : {...}
}

How it works...

根对象（文档）控制其字段及其所有子对象字段的行为。在文档映射中，我们可以定义以下内容：

date_detection: This enables the extraction of a date from a string (true default).
dynamic_date_formats: This is a list of valid date formats. This is used if date_detection is active.
numeric_detection: This enables you to convert strings into numbers, if possible (false default).
dynamic_templates: This is a list of templates that's used to change the explicit mapping inference. If one of these templates is matched, the rules defined in it are used to build the final mapping.

动态模板由两部分组成：匹配器和映射。

要匹配字段以激活模板，可以使用几种类型的匹配器，例如：

match: This allows you to define a match on the field name. The expression is a standard GLOB pattern (http://en.wikipedia.org/wiki/Glob_(programming)).
unmatch: This allows you to define the expression to be used to exclude matches (optional).
match_mapping_type: This controls the types of the matched fields. For example, string, integer, and so on (optional).
path_match: This allows you to match the dynamic template against the full dot notation of the field, for example, obj1.*.value (optional).
path_unmatch: This will do the opposite of path_match, excluding the matched fields (optional).
match_pattern: This allows you to switch the matchers to regex (regular expression); otherwise, the glob pattern match is used (optional).

动态模板映射部分是标准的，但可以使用特殊的占位符，例如以下：

{name}: This will be replaced with the actual dynamic field name
{dynamic_type}: This will be replaced with the type of the matched field

The order of the dynamic templates is very important; only the first one that is matched is executed. It is good practice to order the ones with more strict rule s first, and then the others.

There's more...

当您需要为所有字段设置映射配置时，动态模板非常方便。这个动作可以通过添加一个动态模板来完成，类似于这个：

"dynamic_templates" : [
  {
    "store_generic" : {
     "match" : "*",
      "mapping" : {
        "store" : "true"
      }
    }
  }
]

在此示例中，将存储将使用显式映射添加的所有新字段。

Managing nested objects

有一种特殊类型的嵌入对象：嵌套对象。这解决了与 Lucene 索引体系结构相关的问题，在该体系结构中，嵌入对象的所有字段都被视为单个对象。在搜索过程中，在 Lucene 中，无法区分同一个多值数组中的值和不同的嵌入对象。

如果我们考虑前面的订单示例，不可能用相同的查询区分项目名称和数量，因为 Lucene 将它们放在同一个 Lucene 文档对象中。我们需要在不同的文档中对它们进行索引，然后加入它们。整个行程由嵌套对象和嵌套查询管理。

Getting ready

您需要一个正常运行的 Elasticsearch 安装，正如我们在下载和安装 Elasticsearch 配方中所述的第 1 章，开始。

要执行这些命令，可以使用任何 HTTP 客户端，例如 curl (https://curl.haxx.se/ )、邮递员 (https://www.getpostman.com/) 或类似名称。我建议使用 Kibana 控制台，它为 Elasticsearch 提供了代码完成和更好的字符转义。

How to do it...

嵌套对象被定义为具有嵌套类型的标准对象。

从 Mapping an object 配方中的示例中，我们可以将类型从 object 更改为 nested 如下：

PUT test/_mapping
{
    "properties" : {
      "id" : {"type" : "keyword"},
      "date" : {"type" : "date"},
      "customer_id" : {"type" : "keyword"},
      "sent" : {"type" : "boolean"},
      "item" : {"type" : "nested",
        "properties" : {
            "name" : {"type" : "keyword"},
            "quantity" : {"type" : "long"},
            "price" : {"type" : "double"},
            "vat" : {"type" : "double"}
          }
      }
    }
}

How it works...

当一个文档被索引时，如果一个嵌入对象被标记为nested，它会被原始文档提取出来，然后在一个新的外部文档中被索引并保存在父文档附近的一个特殊索引位置。

在前面的示例中，我们重用了 Mapping an Object 配方的映射，但我们将项目的类型从 object 更改为 nested。无需采取其他必要的操作即可将嵌入对象转换为嵌套对象。

嵌套对象是特殊的 Lucene 文档，它们保存在与其父文档相同的数据块中——这种方法允许与父文档快速连接。

嵌套对象不能用标准查询搜索，只能用嵌套查询。它们不会显示在标准查询结果中。

嵌套对象的生命与其父对象相关：删除/更新父对象会自动删除/更新所有嵌套子对象。更改父级意味着 Elasticsearch 将执行以下操作：

Mark old documents as deleted
Mark all nested documents as deleted
Index the new document version
Index all nested documents

There's more...

有时，需要将嵌套对象的信息传播到其父对象或根对象。这主要是为了构建关于父母的更简单的查询（例如不使用嵌套的术语查询）。为实现这一目标，嵌套对象有两个特殊属性必须使用：

include_in_parent: This makes it possible to automatically add the nested fields to the immediate parent
include_in_root: This adds the nested object fields to the root object

这些设置增加了数据冗余，但它们降低了某些查询的复杂性，从而提高了性能。

Managing a child document with a join field

在前面的秘籍中，我们已经看到了如何使用嵌套对象类型来管理对象之间的关系。嵌套对象的缺点是它们依赖于它们的父对象。如果需要更改嵌套对象的值，则需要重新索引父对象（如果嵌套对象更改过快，这会带来潜在的性能开销）。为了解决这个问题，Elasticsearch 允许您定义子文档。

Getting ready

您需要一个正常运行的 Elasticsearch 安装，正如我们在下载和安装 Elasticsearch 配方中所述的第 1 章，入门。

How to do it...

在下面的示例中，我们有两个相关的对象：Order 和 Item。

它们的 UML 表示如下：

读书笔记《elasticsearch-7-0-cookbook-fourth-edition》管理映射

最终映射应该是 Order 和 Item 的字段定义的合并，加上一个特殊字段（在本例中为 join_field）采用父/子关系。

映射如下：

PUT test1/_mapping
{
  "properties": {
    "join_field": {
      "type": "join",
      "relations": {
        "order": "item"
      }
    },
    "id": {
      "type": "keyword"
    },
    "date": {
      "type": "date"
    },
    "customer_id": {
      "type": "keyword"
    },
    "sent": {
      "type": "boolean"
    },
    "name": {
      "type": "text"
    },
    "quantity": {
      "type": "integer"
    },
    "vat": {
      "type": "double"
    }
  }
}

前面的映射与上一个秘籍中的映射非常相似。

如果我们要存储连接的记录，我们需要先保存父节点这样保存子节点：

PUT test/_doc/1?refresh
 {
   "id": "1",
   "date": "2018-11-16T20:07:45Z",
   "customer_id": "100",
   "sent": true,
   "join_field": "order"
 }

PUT test/_doc/c1?routing=1&refresh
 {
   "name": "tshirt",
   "quantity": 10,
   "price": 4.3,
   "vat": 8.5,
   "join_field": {
     "name": "item",
     "parent": "1"
   }
 }

子项需要特殊管理，因为我们需要将 routing 与父项 i 添加。此外，在对象中，我们需要指定父名称及其 ID。

How it works...

同一个索引中的多个item关系的映射需要计算为所有其他映射字段的总和。

对象之间的关系必须在 join_field中定义。
映射的join_field必须只有一个；如果你需要提供很多关系，你可以在 relations 对象中提供它们。

子文档必须在父文档的同一个分片中被索引；因此，当被索引时，必须传递一个额外的参数，即 routing （我们将在接下来的索引文档配方中看到如何做到这一点章节）。

当我们想要更改其值时，子文档不需要重新索引父文档。因此，它的索引、重新索引（更新）和删除速度很快。

There's more...

在 Elasticsearch 中，我们有不同的方式来管理对象之间的关系，如下：

Embedding with type=object: This is implicitly managed by Elasticsearch and it considers the embedding as part of the main document. It's fast, but you need to reindex the main document to change a value of the embedded object.
Nesting with type=nested: This allows for more accurate search and filtering of the parent using nested queries on children. Everything works for the embedded object except for query (you must use a nested query to search for them).
External children documents: Here, the children are the external document, with a join_field property to bind them to the parent. They must be indexed in the same shard as the parent. The join with the parent is a bit slower than the nested one, because the nested objects are in the same data block of the parent in Lucene index and they are loaded with the parent, otherwise, the child document requires more read operations.

选择如何对对象的关系进行建模取决于您的应用场景。

还有另一种方法可以使用，但是在大数据文档上，它带来了很差的性能——它是解耦连接关系。您分两步进行连接查询：首先，您收集子/其他文档的 ID，然后在其父的字段中搜索它们。

Adding a field with multiple mappings

通常必须使用多种核心类型或以不同的方式处理一个字段。例如，字符串字段必须被处理为 tokenized 用于搜索，not-tokenized 用于排序。为此，我们需要定义一个 fields multifield 特殊属性。

fields 属性是映射的一个非常强大的特性，因为它允许您以不同的方式使用相同的字段。

Getting ready

您需要一个正常运行的 Elasticsearch 安装，正如我们在下载和安装 Elasticsearch 配方中所述的第 1 章，开始。

要执行这些命令，可以使用任何 HTTP 客户端，例如 curl (https://curl.haxx.se/)、postman ( https://www.getpostman.com/)，或类似的。我建议使用 Kibana 控制台，它为 Elasticsearch 提供代码完成和更好的字符转义。

How to do it...

要定义多字段属性，我们需要定义一个包含 fields 子字段的字典。与父字段同名的子字段是默认子字段。

如果我们考虑 order 示例的项目，我们可以通过以下方式索引名称：

{
  "name": {
    "type": "keyword",
    "fields": {
      "name": {"type": "keyword"},
      "tk": {"type": "text"},
      "code": {"type": "text","analyzer": "code_analyzer"}
  }
},

如果我们已经在 Elasticsearch 中存储了一个映射，并且我们想要迁移多字段属性中的字段，那么保存一个不同类型的新映射就足够了，Elasticsearch 会自动提供合并。 fields 属性中的新子字段可以随时添加而不会出现问题，但新子字段将仅可用于搜索/聚合新索引的文档。

When you add a new subfield to already indexed data, you need to reindex your record to ensure you have it correctly indexed for all your records.

How it works...

在索引期间，当 Elasticsearch 处理多字段类型的 fields 属性时，它会为映射中定义的每个子字段重新处理相同的字段。

要访问多字段的子字段，我们在基本字段和子字段名称上构建了一个新路径。如果我们考虑前面的例子，我们有以下内容：

name: This points to the default multifield subfield-field (the keyword one)
name.tk: This points to the standard analyzed (tokenizated) text field
name.code: This points to a field analyzed with a code extractor analyzer

正如您在前面的示例中可能已经注意到的那样，我们已更改分析器以引入代码提取器分析器，该分析器允许您从字符串中提取项目代码。

使用多字段，如果我们索引一个字符串，例如 Good Item to buy - ABC1234，我们将有以下内容：

name = Good Item to buy - ABC1234 (useful for sorting)
name.tk= ["good", "item", "to", "buy", "abc1234"] (useful for searching)
- name.code = ["ABC1234"] (useful for searching and faceting)

在代码分析器的情况下，如果在字符串中找不到代码，则不会生成标记。这使得开发在索引时执行信息检索任务并在搜索时使用这些任务的解决方案成为可能。

There's more...

fields 属性在数据处理中非常有用，因为它允许您定义几种处理字段数据的方法。

例如，如果我们正在处理文档内容（如文章、word 文档等），我们可以将字段定义为子字段分析器，以提取名称、地点、日期/时间、地理位置等。

多字段的子字段是标准的核心类型字段——我们可以对它们进行任何我们想要的处理，例如搜索、过滤、聚合和脚本。

Mapping a GeoPoint field

Elasticsearch 原生支持使用地理定位类型——特殊类型，允许您在世界各地的地理坐标（纬度和经度）中本地化您的文档。

地理世界中使用的主要类型有两种：点和形状。在这个秘籍中，我们将研究 GeoPoint——地理位置的基本元素。

Getting ready

您需要一个正常运行的 Elasticsearch 安装，正如我们在 下载和安装 Elasticsearch 配方中所述第 1 章, 开始。

要执行这些命令，可以使用任何 HTTP 客户端，例如 curl (https://curl.haxx.se/)、postman ( https://www.getpostman.com/)，或类似的。我建议使用 Kibana 控制台，它为 Elasticsearch 提供代码完成和更好的字符转义。

How to do it...

字段的类型必须设置为 geo_point 才能定义 GeoPoint。

我们可以通过添加一个存储客户位置的新字段来扩展订单示例。结果如下：

PUT test/_mapping
{
  "properties": {"id": {"type": "keyword",},
    "date": {"type": "date"},
    "customer_id": {"type": "keyword"},
    "customer_ip": {"type": "ip"},
    "customer_location": {"type": "geo_point"},
    "sent": {"type": "boolean"}
    }
}

How it works...

当 Elasticsearch 索引具有 GeoPoint 字段（lat,lon）的文档时，它会处理纬度和经度坐标并创建特殊的辅助字段数据以在这些坐标上提供更快的查询能力。这是因为创建了一个特殊的数据结构来内部管理经纬度。

根据属性，给定纬度和经度，可以计算 geohash 值（http:// en.wikipedia.org/wiki/Geohash），索引过程还针对特殊计算优化这些值，例如距离、范围和形状匹配。

GeoPoint 具有允许您存储其他地理数据的特殊参数：

lat_lon: This allows you to store the latitude and longitude as the .lat and .lon fields. Storing these values improves the performance in many memory algorithms used in distance and in shape calculus (false default).

设置是有意义的 lat_lon 到 true 以便在字段有单点值时存储它们。这可以加快搜索速度并减少计算期间的内存使用。

geohash: This allows you to store the computed geohash value (false default).
geohash_precision: This defines the precision to be used in geohash calculus. For example, given a geo point value [45.61752, 9.08363], it will store the following (12 default):
- customer_location = 45.61752, 9.08363
- customer_location.lat = 45.61752
- customer_location.lon = 9.08363
- customer_location.geohash = u0n7w8qmrfj

There's more...

GeoPoint 是一种特殊类型，可以接受多种格式作为输入：

lat and lon as properties, as shown here:

{
"customer_location": {
"lat": 45.61752,
"lon": 9.08363
},

lan and lon as string, as follows:

"customer_location": "45.61752,9.08363",

geohash string, as shown here:

"customer_location": "u0n7w8qmrfj",

As a GeoJSON array (note in it lat and lon are reversed), shown in the following code snippet:

"customer_location": [9.08363, 45.61752]

Mapping a GeoShape field

点概念的扩展是形状。 Elasticsearch 提供了一种便于管理 GeoShape 中任意多边形的类型。

Getting ready

您需要一个正常运行的 Elasticsearch 安装，正如我们在 第 1 章，入门。

为了能够使用高级形状管理，Elasticsearch 需要在其 classpath （通常是 lib 目录）中有两个 JAR 库，如下所示：

Spatial4J (v0.3)
JTS (v1.13)

How to do it...

要映射 geo_shape 类型，用户必须显式提供一些参数：

tree: This is the name of the PrefixTree implementation—geohash for GeohashPrefixTree and quadtree for QuadPrefixTree (geohash default).
precision: This is used instead of tree_levels to provide a more human value to be used in the tree level. The precision number can be followed by the unit, that is, 10 m, 10 km, 10 miles, and so on.
tree_levels: This is the maximum number of layers to be used in the prefix tree.
distance_error_pct: This sets the maximum errors allowed in a prefix tree (0,025% - max 0,5% default).

customer_location 映射，我们在前面的秘籍中使用 geo_shape 看到的，如下所示：

"customer_location": {
  "type": "geo_shape",
  "tree": "quadtree",
  "precision": "1m"
},

How it works...

当形状在内部被索引或搜索时，会创建并使用路径树。

路径树是包含地理信息的术语列表，并经过计算以提高评估地理演算的性能。

路径树还取决于形状类型：点、线串、多边形、多点和多多边形。

Mapping an IP field

Elasticsearch 在很多系统中用于收集和搜索日志，例如 Kibana (https://www.elastic.co/products/ kibana）和 LogStash（https://www.elastic.co/products/logstash）。为了在使用 IP 地址时改进搜索，Elasticsearch 提供了可用于以优化方式存储 IP 地址的 IPv4 和 IPv6 类型。

Getting ready

您需要一个正常运行的 Elasticsearch 安装，正如我们在第 1 章，入门。

How to do it...

您需要将包含 IP 地址的字段的类型定义为 ip。

使用前面的订单示例，我们可以通过添加客户 IP 来扩展它，如下所示：

"customer_ip": {
  "type": "ip"
}

IP 必须采用标准点符号形式，如下所示：

"customer_ip":"19.18.200.201"

How it works...

当 Elasticsearch 处理文档时，如果一个字段是 IP 字段，它会尝试将其值转换为数字形式并生成令牌以进行快速值搜索。

IP具有特殊属性：

index: This defines whether the field must be indexed. If not, false must be used (true default).
doc_values: This defines whether the field values should be stored in a column-stride fashion to speed up sorting and aggregations (true default).

其他属性（store、boost、null_value 和 include_in_all）用作其他基本类型。

使用 IP 字段而不是字符串的优点是每个范围和过滤器的速度更快，并且资源使用率更低（磁盘和内存）。

Mapping an alias field

在几个索引中有很多不同的类型是很常见的。因为 Elasticsearch 可以在多个索引中进行搜索，所以您应该同时过滤常用字段。

在现实世界中，这些字段在所有映射中并不总是以相同的方式调用（通常是因为它们是从不同实体派生的），混合使用 add_date、timestamp、@timestamp 和 date_add 字段引用相同的日期概念。

alias 字段允许您定义要解析的别名，以及查询时间以简化所有具有相同含义的字段的调用。

Getting ready

您需要一个正常运行的 Elasticsearch 安装，正如我们在 下载和安装 Elasticsearch 配方中所述第 1 章, 开始。

要执行这些命令，可以使用任何 HTTP 客户端，例如 curl (https://curl.haxx.se/)、postman ( https://www.getpostman.com/)，或类似的。我建议使用 Kibana 控制台，它为 Elasticsearch 提供代码完成和更好的字符转义。

How to do it...

如果我们以我们在前面的食谱中看到的订单示例为例，我们可以在 item 子字段中为价格值添加一个别名来成本。

此过程可以通过执行以下操作来实现：

To add this alias, we need to have a mapping that's similar to the following:

PUT test/_mapping
{
  "properties": {
    "id": {"type": "keyword"},
    "date": {"type": "date"},
    "customer_id": {"type": "keyword"},
    "sent": {"type": "boolean"},
    "item": {
      "type": "object",
      "properties": {
        "name": {"type": "keyword"},
        "quantity": {"type": "long"},
        "cost": {
          "type": "alias",
          "path": "item.price"
        },
        "price": {"type": "double"},
        "vat": {"type": "double"}
      }
    }
  }
}

We can now index a record as follows:

PUT test/_doc/1?refresh
{
  "id": "1",
  "date": "2018-11-16T20:07:45Z",
  "customer_id": "100",
  "sent": true,
  "item": [
    {
      "name": "tshirt",
      "quantity": 10,
      "price": 4.3,
      "vat": 8.5
    }
  ]
}

We can search it using the cost alias like so:

GET test/_search
{
  "query": {
    "term": {
      "item.cost": 4.3
    }
  }
}

结果将是保存的文档。

How it works...

别名是为您的搜索字段使用相同名称的便捷方式无需更改字段的数据结构。别名字段不需要更改文档的结构，从而为您的数据模型提供更大的灵活性。

别名在查询的搜索索引扩展期间被解析并且没有由于其使用而造成的性能损失。

如果您尝试使用别名字段中的值索引文档，则会引发异常。

别名字段的path必须包含目标字段的全解析，必须是具体字段，并且在定义别名时必须知道。

对于嵌套对象中的别名，它必须与目标位于同一嵌套范围内。

Mapping a Percolator field

Percolator 是一种特殊类型的字段，可以在该字段内存储 Elasticsearch 查询并在 percolator 查询中使用它。

Percolator 可用于检测与文档匹配的所有查询。

Getting ready

您需要一个正常运行的 Elasticsearch 安装，正如我们在 下载和安装 Elasticsearch 配方中所述第 1 章, 开始。

要执行这些命令，可以使用任何 HTTP 客户端，例如 curl (https://curl.haxx.se/)、postman ( https://www.getpostman.com/)，或类似的。我建议使用 Kibana 控制台，它为 Elasticsearch 提供代码完成和更好的字符转义。

How to do it...

要映射渗滤器字段，请使用以下步骤：

We want to create a Percolator that matches some text in a body field. We will define mapping in a similar way:

PUT test-percolator
{
  "mappings": {
    "properties": {
      "query": {
        "type": "percolator"
      },
      "body": {
        "type": "text"
      }
    }
  }
}

Now, we can store a document with a query percolator inside it, as follows:

PUT test-percolator/_doc/1?refresh
{
  "query": {
    "match": {
      "body": "quick brown fox"
    }
  }
}

We can now execute a search on it, as shown in the following code:

GET test-percolator/_search
{
  "query": {
    "percolate": {
      "field": "query",
      "document": {
        "body": "fox jumps over the lazy dog"
      }
    }
  }
}

The result will return in the hits of the stored document, as follows:

{
   ... truncated...
   "hits" : [
     {
     "_index" : "test-percolator",
     "_type" : "_doc",
     "_id" : "1",
     "_score" : 0.2876821,
     "_source" : {
         "query" : {
             "match" : {
             "body" : "quick brown fox"
             }
         }
     },
     "fields" : {
         "_percolator_document_slot" : [0]
         }
     }
     ]
     }
 }

How it works...

Percolator 字段在其中存储了一个 Elasticsearch 查询。

因为所有的过滤器都被缓存并且始终处于活动状态以进行表演，所以查询中所需的所有字段都必须在文档的映射中定义。

Due to the fact the all the queries in all percolators documents will be executed against every document, for the best performance, the query inside the percolators must be optimized to provide fast execution of them inside the percolator query.

Mapping feature and feature vector fields

根据上下文对文档进行动态评分是很常见的。例如，对某个类别中更具体的文档进行评分——典型的场景是根据页面排名等值来提升（增加低评分）文档，点击或类别。

Elasticsearch 7.x 提供了两种基于值提高分数的新方法：一种是特征字段，另一种是将其扩展到值向量。

Getting ready

您需要一个正常运行的 Elasticsearch 安装，正如我们在 下载和安装 Elasticsearch 配方中所述第 1 章, 开始。

要执行这些命令，可以使用任何 HTTP 客户端，例如 curl (https://curl.haxx.se/)、postman ( https://www.getpostman.com/)，或类似的。我建议使用 Kibana 控制台，它为 Elasticsearch 提供代码完成和更好的字符转义。

How to do it...

我们希望使用 feature 类型来实现一个常见的 PageRank 场景，其中文档根据相同的特征进行评分。为此，将执行以下步骤：

To be able to score base on a pagerank value and an inverse url length, we will use a similar mapping:

PUT test-feature
{
  "mappings": {
    "properties": {
      "pagerank": {
        "type": "feature"
      },
      "url_length": {
        "type": "feature",
        "positive_score_impact": false
      }
    }
  }
}

Now, we can store a document as shown here:

PUT test-feature/_doc/1
{
  "pagerank": 5,
  "url_length": 20
}

Now, we can execute a feature query on the pagerank value to return our record with a similar query, like so:

GET test-feature/_search
{
  "query": {
    "feature": {
      "field": "pagerank"
    }
  }
}

The evolution of the previous feature functionality is to define a vector of values using the feature_vector type; usually it cab be use to score by topics, categories, or similar discerning facets. We can implement this functionality using the following steps:

因此，以下代码定义了 categories 字段的映射：

PUT test-features
{
  "mappings": {
    "properties": {
      "categories": {
        "type": "feature_vector"
      }
    }
  }
}

We can now store some documents in the index by using the following commands:

PUT test-features/_doc/1
{
  "categories": {
    "sport": 14.2,
    "economic": 24.3
  }
}

PUT test-features/_doc/2
{
  "categories": {
    "sport": 19.2,
    "economic": 23.1
  }
}

Now, we can search based on saved feature values, as shown here:

GET test-features/_search
{
  "query": {
    "feature": {
      "field": "categories.sport"
    }
  }
}

How it works...

feature 和 feature vector 是特殊类型的字段，用于存储主要用于对结果进行评分的值。

这些字段中存储的值只能使用 feature 查询进行查询。这不能用于标准查询和聚合。

feature 和 feature_vector 中的值编号只能是单个正值（不允许多值）。

在 feature_vector 的情况下，值必须是一个哈希，由一个字符串和一个正数值组成。

有一个标志可以改变评分的行为——positive_score_impact。这个值是 true默认，但是如果你想让feature的值降低分数，你可以设置 span> false 参数。在 pagerank 示例中，url 的长度会降低文档的分数，因为 url 越长，就越不相关。

Adding metadata to a mapping

有时，当我们使用映射时，需要存储一些额外的数据以用于显示目的、ORM 设施、权限，或者只是在映射中跟踪它们。

Elasticsearch 允许您使用特殊的 _meta 字段在映射中存储所需的各种 JSON 数据。

Getting ready

您需要一个正常运行的 Elasticsearch 安装，正如我们在第 1 章，开始。

How to do it...

_meta 映射字段可以填充我们想要的任何数据。考虑以下示例：

{
  "_meta": {
    "attr1": ["value1", "value2"],
    "attr2": {
      "attr3": "value3"
    }
  }
}

How it works...

当 Elasticsearch 处理一个新的映射并找到一个 _meta 字段时，它会将其存储在全局映射状态中，并将信息传播到所有集群节点。

_meta 仅用于存储目的；它没有被索引和搜索。

它可用于以下原因：

Storing type metadata
Storing object relational mapping (ORM) related information
Storing type permission information
Storing extra type information (that is, icon filename used to display the type)
Storing template parts for rendering web interfaces

Specifying different analyzers

在之前的秘籍中，我们已经了解了如何在 Elasticsearch 中映射不同的字段和对象，并且我们已经描述了如何使用 analyzer 和 search_analyzer 轻松更改标准分析器特性。

在这个秘籍中，我们将掠夺几个分析器并学习如何使用它们来提高索引和搜索质量。

Getting ready

您需要一个正常运行的 Elasticsearch 安装，正如我们在第 1 章，入门。

How to do it...

每个核心类型字段都允许您为索引和搜索指定自定义分析器作为字段参数。

例如，如果我们希望 name 字段使用标准分析器进行索引和简单分析器进行搜索，则映射如下：

{
  "name": {
    "type": "string",
    "index_analyzer": "standard",
    "search_analyzer": "simple"
  }
}

How it works...

分析器的概念来源于 Lucene（Elasticsearch 的核心）。分析器是一个 Lucene 元素，由将文本拆分为标记的标记器以及一个或多个标记过滤器组成。这些过滤器执行令牌操作，例如小写、规范化、删除停用词、词干提取等。

在索引阶段，当 Elasticsearch 处理一个必须索引的字段时，会选择一个分析器，首先检查它是否定义在 index_analyzer 字段中，然后在文档中，最后在指数。

Choosing the correct analyzer is essential to getting good results during the query phase.

Elasticsearch 在其标准安装中提供了多种分析器。在下表中，描述了最常见的一些：

姓名	说明
`标准`	这使用标准标记器划分文本：规范化标记、小写标记和删除不需要的标记
`简单`	这会拆分非字母文本并将它们转换为小写
`空格`	这在空格分隔符上划分文本
`停止`	这将使用标准分析器处理文本，然后应用自定义停用词
`关键字`	这将所有文本视为标记
`模式`	这使用正则表达式划分文本
`雪球`	这可用作标准分析器，以及处理结束时的词干

对于特殊语言用途，Elasticsearch 支持一组分析器，旨在分析特定语言的文本，例如阿拉伯语、亚美尼亚语、巴斯克语、巴西语、保加利亚语、加泰罗尼亚语、中文、CJK、捷克语、丹麦语、荷兰语、英语、芬兰语、法语、加利西亚语、德语、希腊语、印地语、匈牙利语、印度尼西亚语、意大利语、挪威语、波斯语、葡萄牙语、罗马尼亚语、俄语、西班牙语、瑞典语、土耳其语和泰语。

Mapping a completion field

为了能够为我们的用户提供搜索功能，最常见的要求之一是为我们的查询提供文本建议。

Elasticsearch 使用称为 completion 的特殊类型映射为归档此功能提供了帮助程序。

Getting ready

您需要一个正常运行的 Elasticsearch 安装，正如我们在第 1 章，入门。

How to do it...

完成字段的定义类似于前面的核心类型字段。例如，要为具有别名的名称提供建议，我们可以编写类似如下的映射：

{
  "name": {"type": "string", "copy_to":["suggest"]},
  "alias": {"type": "string", "copy_to":["suggest"]},
  "suggest": {
    "type": "completion",
    "payloads": true,
    "analyzer": "simple",
    "search_analyzer": "simple"
  }
}

在此示例中，我们定义了两个字符串字段： name 和 alias，并为它们定义了一个 suggest 完成符。

How it works...

有几种方法可以提供建议在 Elasticsearch 中。您可以获得一个简单的术语建议，或者使用一些带有通配符或前缀的查询，但由于使用了本机优化的结构，补全字段更快更强大。

在内部，Elasticsearch 构建了一个有限状态传感器 (FST) 结构来建议术语。（此主题在以下 Wikipedia 页面中有详细描述： http://en.wikipedia.org/wiki/Finite_state_transducer< /a>。）

可以配置为使用 completion 字段的最重要的属性如下：

analyzer: This defines the analyzer to be used for indexing within this document. The default is simple to use for keeping stopwords in suggested terms such as at, the, of, and so (simple default).
search_analyzer: This defines the analyzer to be used for searching (simple default).
preserve_separators: This controls how tokens are processed. If disabled, the spaces are removed in suggestion; this makes it possible to match fightc as fight club (true default).
max_input_length: This property reduces the characters in the input string to reduce the suggested terms. Suggesting the longest text is nonsense (no one write long strings of text and want a suggestion on it) (50 default).
payloads: This allows you to store payloads (additional item values to be returned) (false default). For example, if you are searching for a book, it will be useful as it not only returns the book title, but also its ISBN. This is shown in the following example:

PUT test/_doc/1
{
  "name": "Elasticsearch Cookbook",
  "suggest": {
    "input": ["ES", "Elasticsearch", "Elastic Search", "ElasticSearch Cookbook"],
    "output": "Elasticsearch Cookbook",
    "payload": {"isbn": "1782166629"},
    "weight": 34
  }
}

在前面的示例中，我们可以看到 completion 字段在索引期间可用的一些功能，如下所示：

input: This manages a list of provided values that are usable for suggesting. If you are able to enrich your data, this can improve the quality of your suggester.
output: This is an optional string to be shown as a result and mainly used for presenting to the user a text representation (optional).
payload: This includes some extra data to be returned (optional).
weight: This is a weight boost to be used to score suggester (optional).

在本秘籍开始时，我使用了一个快捷方式，使用 copy_to 字段属性从几个字段中填充完成字段。 copy_to 属性只是将一个字段的内容复制到一个或多个其他字段中。

vlambda博客 学习文章列表

读书笔记《elasticsearch-7-0-cookbook-fourth-edition》管理映射

标签:

vlambda博客
学习文章列表