vlambda博客
学习文章列表

读书笔记《elasticsearch-7-0-cookbook-fourth-edition》Elasticearch中的脚本编写

Scripting in Elasticsearch

Elasticsearch 通过使用可以用多种编程语言编写的自定义脚本来扩展其功能的强大方式。最常见的是 Painless、Express 和 Mustache。 在本章中,我们将探讨如何创建自定义评分算法、特殊处理的返回字段、自定义排序和复杂的记录更新操作。 脚本Elasticsearch 的概念是 NoSQL 世界中先进的存储过程系统;因此,每个 Elasticsearch 的高级用户都应该学习如何掌握它。

Elasticsearch 原生提供 Java 脚本(即用 JAR 编译的 Java 代码)、Painless、Express 和 Mustache;但是,许多其他有趣的语言也可以作为插件使用,例如 Kotlin 和 Velocity。 在旧的 Elasticsearch 版本中,在 5.0 版之前,官方脚本语言是 Groovy。但为了更好的沙盒和性能,官方语言现在是 Painless,Elasticsearch默认提供

在本章中,我们将介绍以下食谱:

  • Painless scripting
  • Installing additional script plugins
  • Managing scripts
  • Sorting data using scripts
  • Computing return fields with scripting
  • Filtering a search using scripting
  • Using scripting in aggregations
  • Updating a document using scripts
  • Reindexing with a script

Painless scripting

Painless 是一种简单、安全的脚本语言,默认情况下在 Elasticsearch 中可用。它由 Elasticsearch 团队设计,用于专门 与 Elasticsearch 一起使用,并且可以安全地与内联和存储脚本一起使用。它的语法类似于 Groovy。

Getting ready

您将需要一个正常运行的 Elasticsearch 安装 - 类似于我们在 下载和安装 Elasticsearch  中描述的安装配方在 第 1 章 开始

为了执行命令,可以使用任何 HTTP 客户端,例如 Curl (https://curl.haxx.se/)或邮递员 (https://www.getpostman.com/)。您可以使用 Kibana 控制台,因为它为 Elasticsearch 提供了代码完成和更好的字符转义。

要正确执行以下命令,您需要使用 ch07/populate_aggregation.txt commands  填充的索引– 这些都可以在在线代码中找到。

为了能够在 Painless 脚本中使用正则表达式,您需要在 elasticsearch.yml 通过添加以下代码:

script.painless.regex.enabled: true

这个秘籍中使用的索引是 index-agg index.

How to do it...

我们将使用 Painless 脚本通过以下步骤计算得分: 

  1. We can execute a search with a scripting function in kibana using the following code:
POST /index-agg/_search
{
  "query": {
    "function_score": {
      "script_score": {
        "script": {
          "lang": "painless",
          "source": "doc['price'].value * 1.2"
        }
      }
    }
  }
}
  1. If everything works correctly, the result will be as follows:
{
  ...truncated ...
  "hits" : {
    "total" : 1000,
    "max_score" : 119.97963,
    "hits" : [
      {
        "_index" : "index-agg",
        "_type" : "_doc",
        "_id" : "857",
        "_score" : 119.97963,
        "_source" : {
          ... truncated ...
          "price" : 99.98302508488757,
         ... truncated ...
      },
      {
        "_index" : "index-agg",
        "_type" : "_doc",
        "_id" : "136",
        "_score" : 119.90164,
        "_source" : {
          ...truncated ...
          "price" : 99.91804048691392,
... truncated ...
    ]
  }
}

您可以看到,在这种情况下,使用脚本与按价格排序具有相同的含义!

How it works...

Painless 是为 Elasticsearch 开发的一种脚本语言,用于快速数据处理和安全性(它被沙盒化以防止恶意代码注入)。 语法基于 Groovy,在每个安装中默认提供。  Painless 被 Elasticsearch 团队标记为实验性的,因为某些功能在未来可能会发生变化——但是,它是脚本编写的首选语言。

Elasticsearch 分两步处理脚本语言:

  1. The script code is compiled in an object in order to be used in a script call; if the scripting code is invalid, then an exception is raised.
  2. For each element, the script is called and the result is collected; if the script fails on some elements, then the search or computation may fail.
Using scripting is a powerful Elasticsearch functionality, but it costs a lot in terms of memory and CPU cycles. The best practice, if possible, is to optimize the indexing of data to search or aggregate   and to avoid using scripting completely.

在 Elasticsearch 中用于定义脚本的方法始终相同;该脚本包含在 script 对象中,格式如下:

"script": {
    "lang":   "...",
    "source" | "id": "...",
    "params": { ... }
  }

它接受几个参数,如下所示:

  • source/id: These are the references for the script, which can be defined as follows:
    • source: If the string with your script code is provided with the call.
    • id: If the script is stored in the cluster, then it refers to the id parameter, which is used to save the script in the cluster.
  • params (an optional JSON object): This defines the parameters to be passed, which are, in the context of scripting, available using the params variable.
  • lang (the default is painless): This defines the scripting language that is to be used.
For complex scripts that contain the special character " in the text , I suggest using kibana and a triple " to escape the script text (similar to Python and Scala special """). In this way you can improve the readability of your script code.

There's more...

如果脚本不太复杂,Painless 是首选;否则,原生插件提供了更好的环境来实现复杂的逻辑和数据管理。

对于在 Painless 脚本中访问文档属性,使用与其他脚本语言相同的方法:

  • doc._score: This stores the document score; it's generally available in searching, sorting, and aggregations.
  • doc._source: This allows access to the source of the document. Use it wisely because it requires the entire source to be fetched and it's very CPU-and-memory-intensive.
  • _fields['field_name'].value: This allows you to load the value from the stored field (in mapping, the field has the stored:true parameter).
  • doc['field_name']: This extracts the document field value from the doc values of the field. In Elasticsearch, the doc values are automatically stored for every field that is not of the text type.
  • doc['field_name'].value: This extracts the value of the field_name field from the document. If the value is an array, or if you want to extract the value as an array, then you can use doc['field_name'].values.
  • doc['field_name'].empty: This returns true if the field_name field has no value in the document.
  • doc['field_name'].multivalue: This returns true if the field_name field contains multiple values.
For performance, the fastest access method for a field value is through the doc value, then the stored field, and, finally, from the source.

如果该字段包含 GeoPoint 值,则可以使用其他方法,例如:

  • doc['field_name'].lat: This returns the latitude of a GeoPoint value. If you need the value as an array, then you can use doc['field_name'].lats.
  • doc['field_name'].lon: This returns the longitude of a GeoPoint value. If you need the value as an array, then you can use doc['field_name'].lons.
  • doc['field_name'].distance(lat,lon): This returns the plane distance in miles from a latitude/longitude point.
  • doc['field_name'].arcDistance(lat,lon): This returns the arc distance in miles, which is given as a latitude/longitude point.
  • doc['field_name'].geohashDistance(geohash): This returns the distance in miles, which is given as a geohash value.

通过使用这些辅助方法,可以创建高级脚本以将文档提升一段距离,这对于开发以地理空间为中心的应用程序非常方便。

See also

您可以参考以下与本配方相关的 URL 以获得进一步的参考:

Installing additional script plugins

Elasticsearch 提供原生脚本(即用 JAR 编译的 Java 代码)和 Painless,但也有很多其他有趣的语言可用,例如 Kotlin。

在编写本书时,没有可用的语言插件作为 Elasticsearch 官方插件的一部分。通常,插件作者在主要版本发布后最多需要一周或一个月的时间将他们的插件更新到新版本。

如前所述,官方语言现在是 Painless,这是 Elasticsearch 默认提供的,以实现更好的沙盒和性能。

Getting ready

您将需要一个正常运行的 Elasticsearch 安装——类似于我们在 第 1 章, 开始

How to do it...

为了安装 Elasticsearch 的 JavaScript 语言支持,我们将执行以下步骤:

  1. From the command line, simply call the following command:
bin/elasticsearch-plugin install lang-kotlin
  1. It will print the following output:
-> Downloading lang-kotlin from elastic
[=================================================] 100%??
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@ WARNING: plugin requires additional permissions @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
* java.lang.RuntimePermission createClassLoader
* org.elasticsearch.script.ClassPermission <<STANDARD>>
* org.elasticsearch.script.ClassPermission
org.mozilla.javascript.ContextFactory
* org.elasticsearch.script.ClassPermission
org.mozilla.javascript.Callable
* org.elasticsearch.script.ClassPermission
org.mozilla.javascript.NativeFunction
* org.elasticsearch.script.ClassPermission
org.mozilla.javascript.Script
* org.elasticsearch.script.ClassPermission
org.mozilla.javascript.ScriptRuntime
* org.elasticsearch.script.ClassPermission
org.mozilla.javascript.Undefined
* org.elasticsearch.script.ClassPermission
org.mozilla.javascript.optimizer.OptRuntime
See http://docs.oracle.com/javase/8/docs/technotes/
guides/security/permissions.html
for descriptions of what these permissions allow and the
associated risks.

Continue with installation? [y/N]y
-> Installed lang-javascript

如果安装成功,输出将以 Installed 结尾;否则,返回错误。可以参考 http://docs.oracle.com/javase/8 /docs/technotes/guides/security/permissions.html 有关这些权限允许的内容及其相关风险的说明:

  1. Restart your Elasticsearch server in order to check that the scripting plugins are loaded:
[...][INFO ][o.e.n.Node ] [] initializing ...
[...][INFO ][o.e.e.NodeEnvironment ] [R2Gp0ny] using [1]
data paths, mounts [[/ (/dev/disk1)]], net usable_space
[82.4gb], net total_space [930.7gb], spins? [unknown], types
[hfs]
[...][INFO ][o.e.e.NodeEnvironment ] [R2Gp0ny] heap size
[1.9gb], compressed ordinary object pointers [true]
[...][INFO ][o.e.n.Node ] [R2Gp0ny] node name
[R2Gp0ny] derived from node ID; set [node.name] to override
[...][INFO ][o.e.n.Node ] [R2Gp0ny]
version[5.0.0-beta1], pid[58291], build[7eb6260/2016-09-
20T23:10:37.942Z], OS[Mac OS X/10.12/x86_64], JVM[Oracle
Corporation/Java HotSpot(TM) 64-Bit Server VM/1.8.0_101/25.101-
b13]
[...][INFO ][o.e.p.PluginsService ] [R2Gp0ny] loaded module
[aggs-matrix-stats]
[...][INFO ][o.e.p.PluginsService ] [R2Gp0ny] loaded module
[ingest-common]
[...][INFO ][o.e.p.PluginsService ] [R2Gp0ny] loaded module
[lang-expression]
[...][INFO ][o.e.p.PluginsService ] [R2Gp0ny] loaded module
[lang-groovy]
[...][INFO ][o.e.p.PluginsService ] [R2Gp0ny] loaded module
[lang-mustache]
[...][INFO ][o.e.p.PluginsService ] [R2Gp0ny] loaded module
[lang-painless]
[...][INFO ][o.e.p.PluginsService ] [R2Gp0ny] loaded module
[percolator]
[...][INFO ][o.e.p.PluginsService ] [R2Gp0ny] loaded module
[reindex]
[...][INFO ][o.e.p.PluginsService ] [R2Gp0ny] loaded module
[transport-netty3]
[...][INFO ][o.e.p.PluginsService ] [R2Gp0ny] loaded module
[transport-netty4]
[...][INFO ][o.e.p.PluginsService ] [R2Gp0ny] loaded plugin
[lang-javascript]
[...][INFO ][o.e.p.PluginsService ] [R2Gp0ny] loaded plugin
[lang-python]
[...][INFO ][o.e.n.Node ] [R2Gp0ny] initialized
[...][INFO ][o.e.n.Node ] [R2Gp0ny] starting ...

How it works...

语言插件允许扩展可用于脚本的支持语言的数量。在安装过程中,它们需要特殊权限才能访问 Elasticsearch 安全层禁止的类和方法,例如访问 ClassLoader 或类权限。

在 Elasticsearch 启动期间,称为 PluginService 的内部 Elasticsearch 服务会加载所有已安装的语言插件。

安装或升级插件需要重启节点。

从 7.x 版本开始,所有插件都具有相同版本的 Elasticsearch。

Elasticsearch 社区提供了许多常见的脚本语言(完整列表可在 Elasticsearch 站点插件页面上获得,网址为 http://www.elastic.co/guide/en/elasticsearch/reference/current/modules-plugins.html),而其他语言在 GitHub 存储库中可用(在GitHub 将允许您找到它们)。

There's more...

我们用来安装插件的插件管理器还提供了以下命令:

  • list: This command is used to list all the installed plugins.

例如,您可以执行以下命令:

bin/elasticsearch-plugin list

上述命令的结果如下:

  • remove: This command is used to remove an installed plugin.

例如,您可以执行以下命令:

bin/elasticsearch-plugin remove lang-kotlin

上述命令的结果如下:

-> Removing lang-kotlin...

Managing scripts

根据您的脚本使用情况,有多种方法可以自定义 Elasticsearch 以使用您的脚本扩展。

在这个秘籍中,我们将演示如何使用文件、索引或在行中向 Elasticsearch 提供脚本。

Getting ready

您将需要一个正常运行的 Elasticsearch 安装——类似于我们在 下载和安装 Elasticsearch< span> 第 1 章中的食谱 a>, 开始

为了执行命令,可以使用任何 HTTP 客户端,例如 Curl (https://curl.haxx.se/)或邮递员 (https://www.getpostman.com/)。您可以使用 Kibana 控制台,因为它为 Elasticsearch 提供了代码完成和更好的字符转义。

要正确执行以下命令,您需要使用 ch08/populate_aggregation.txt 命令填充的索引–  这些是 在在线代码中可用。

为了能够在 Painless 脚本中使用正则表达式,您需要在 elasticsearch.yml 通过 添加以下代码来激活它们:

script.painless.regex.enabled: true

这个秘籍中使用的索引是 index-agg index.

How to do it...

要管理脚本,我们将执行以下步骤:

  1. Dynamic scripting (except Painless) is disabled by default for security reasons. We need to activate it in order to use dynamic scripting languages such as JavaScript and Python. To do this, we need to enable the scripting flags in the Elasticsearch configuration file (config/elasticseach.yml), and then restart the cluster:
script.inline: true
script.stored: true
  1. If the dynamic script is enabled (as done in the first step), Elasticsearch lets us store the scripts in a special part of the cluster state: "_scripts". In order to put my_script in the cluster state, execute the following code:
POST /_scripts/my_script
{
  "script": {
    "source": """doc['price'].value * params.factor""",
    "lang":"painless"
  }
}
  1. The script can be used by simply referencing it in the script/id field:
POST /index-agg/_search?size=2
{
  "sort": {
    "_script": {
      "script": {
        "id": "my_script",
        "params": {
          "factor": 1.2
        }
      },
      "type": "number",
      "order": "desc"
    }
  },
  "_source": {
    "includes": [
      "price"
    ]
  }
}

How it works...

Elasticsearch 允许以不同的方式加载脚本,并且每种方式都有其优点和缺点。这些脚本也可以在特殊的 _script 集群状态下使用。 REST 端点如下:

POST http://<server>/_scripts/<id> (to retrieve a script)
PUT http://<server>/_scripts/<id> (to store a script)
DELETE http://<server>/_scripts/<id> (to delete a script)

可以在代码中使用 "script":{id": "id_of_the_script"} 引用存储的脚本。

本书的以下部分将使用内联脚本,因为它在开发和测试阶段更易于使用。

Generally, a good workflow is to do the development using inline dynamic scripting on request  – this is because it's faster to prototype. Once the script is ready and no more changes are required, then it should be stored in the index, so that it will be simpler to call and manage in all the cluster nodes. In production, the best practice is to disable dynamic scripting and to store the script on a disk (by dumping the indexed script to the disk) in order to improve security.

在磁盘上存储文件时,请注意文件扩展名;下表总结了插件的状态:

语言

提供为

文件扩展名

状态

无痛

内置/模块

无痛

默认

表达

内置/模块

表达式

已弃用

胡子

内置/模块

小胡子

默认

 

config/elasticsearch.yml中可以设置的其他脚本参数如下:

  • script.max_compilations_per_minute (the default is 25): This default scripting engine has a global limit for how many compilations can be done per minute. You can change this to a higher value, for example, script.max_compilations_per_minute: 1000.
  • script.cache.max_size (the default is 100): This defines how many scripts are cached; it depends on context, but, in general, it's better to increase this value.
  • script.max_size_in_bytes (the default is 65535): This defines the maximum text size for a script; for large scripts, the best practice is to develop native plugins.
  • script.cache.expire (the default is disabled): This defines a time-based expiration for the cached scripts.

There's more...

在前面的示例中,我们为所有引擎激活了 Elasticsearch 脚本,但 Elasticsearch 提供了细粒度的设置来控制它们。

在 Elasticsearch 中,脚本可用于以下不同的上下文:

上下文

说明

聚合

聚合

搜索

搜索 API、过滤器 API 和建议 API

更新

更新 API

插件

通用 plugin 类别下的特殊脚本

此处,默认情况下为所有上下文启用脚本。

您可以通过设置 script.allowed_contexts: none 值来禁用所有脚本>elasticsearch.yml。

要仅为 updatesearch 激活脚本,则 您可以使用 script.allowed_contexts: 搜索、更新.

对于更细粒度的控制,可以通过使用 elasticsearch.yml 条目的活动脚本类型来控制脚本: script.allowed_types

为了只启用内联脚本,那么我们可以使用如下命令:

script.allowed_types: inline 

See also

Sorting data using scripts

Elasticsearch 为排序功能提供脚本支持。在实际应用中,通常需要使用依赖于上下文和一些外部变量的算法来修改默认排序。一些常见的场景如下:

  • Sorting places near a point
  • Sorting by most read articles
  • Sorting items by custom user logic
  • Sorting items by revenue
Because the computing of scores on a large dataset is very CPU-intensive, if you use scripting, then it's better to execute it on a small dataset using standard score queries for detecting the top documents, and then execute a rescoring on the top subset.

Getting ready

您将需要一个正常运行的 Elasticsearch 安装– 类似于我们在 下载和安装 Elasticsearch 中描述的安装 第 1 章中的食谱< /a>, 开始

要执行这些命令,可以使用任何 HTTP 客户端,例如 curl (https://curl.haxx.se/) 或 postman (https://www.getpostman.com/)。您可以使用 Kibana 控制台,因为它为 Elasticsearch 提供了代码完成和更好的字符转义。

要正确执行以下命令,您需要使用 ch07/populate_aggregation.txt 命令填充的索引– 这些都在在线代码中提供。

为了能够在无痛脚本中使用正则表达式,您需要在 elasticsearch.yml 通过添加< kbd>script.painless.regex.enabled: true。

这个秘籍中使用的索引是 index-agg index.

How to do it...

对于使用脚本进行排序,我们将执行以下步骤:

  1. If we want to order our documents by the price field multiplied by a factor parameter (that is, sales tax), then the search will be as follows:
POST /index-agg/_search?size=3
{
  "sort": {
    "_script": {
      "script": {
        "source": """
Math.sqrt(doc["price"].value *
params.factor)
""",
        "params": {
          "factor": 1.2
        }
      },
      "type": "number",
      "order": "desc"
    }
  },
  "_source": {
    "includes": [
      "price"
    ]
  }
}

在这里,我们使用了 sort 脚本;在实际应用中,要排序的文档不应该有很高的基数。

  1. If everything's correct, then the result that is returned by Elasticsearch will be as follows:
{
  ... deprecated ...
  "hits" : {
    "total" : 1000,
    "max_score" : null,
    "hits" : [
      {
        "_index" : "index-agg",
        "_type" : "_doc",
        "_id" : "857",
        "_score" : null,
        "_source" : {
          "price" : 99.98302508488757
        },
        "sort" : [
          10.953521329536066
        ]
      },
      {
        "_index" : "index-agg",
        "_type" : "_doc",
        "_id" : "136",
        "_score" : null,
        "_source" : {
          "price" : 99.91804048691392
        },
        "sort" : [
          10.949960954152345
        ]
      },
      {
        "_index" : "index-agg",
        "_type" : "_doc",
        "_id" : "762",
        "_score" : null,
        "_source" : {
          "price" : 99.86804119988182
        },
        "sort" : [
          10.947221126414913
        ]
      }
    ]
  }
}

How it works...

sort 参数,我们在第4章中讨论过,探索搜索功能,可以在脚本的帮助下进行扩展。

sort 脚本允许您定义多个参数,例如:

  • order (default "asc") ("asc" or "desc"): This determines whether the order must be ascending or descending
  • type: This defines the type in order to convert the value
  • script: This contains the script object that is to be executed

使用脚本扩展 sort 参数允许您使用更广泛的方法来为您的点击评分。

Elasticsearch scripting permits the use of any code that you want to use; for instance, you can create custom complex algorithms for scoring your documents.

There's more...

Painless 和 Groovy 提供了很多可以在脚本中使用的内置函数(主要取自 Java Math 类),例如:

功能

说明

时间()

这是当前时间(以毫秒为单位)

罪(a)

这将返回一个角度的三角正弦

cos(a)

这将返回一个角度的三角余弦

tan(a)

这将返回一个角度的三角正切

asin(a)

这将返回一个值的反正弦

acos(a)

这将返回一个值的反余弦

atan(a)

这将返回一个值的反正切

toRadians(角度)

这会将以度为单位的角度转换为以弧度为单位的大致等效角度

toDegrees(angrad)

这会将以弧度测量的角度转换为以度为单位的大致等效角度

exp(a)

这将返回欧拉数的一个值的幂

日志(a)

这将返回一个值的自然对数(以 e 为底)

log10(a)

这将返回一个值的以 10 为底的对数

sqrt(a)

这将返回值的正确舍入正平方根

cbrt(a)

这将返回双精度值的立方根

IEEEremainder(f1, f2)

这将计算 IEEE 754 标准规定的两个参数的余数运算

ceil(a)

这将返回大于或等于参数且等于数学整数的最小(最接近负无穷大)值

地板(一)

这将返回小于或等于参数且等于数学整数的最大(最接近正无穷大)值

rint(a)

这将返回与参数值最接近且等于数学整数的值

atan2(y, x)

这将返回从直角坐标 (x, y_) 转换为极坐标 (r, _theta) 的角度 theta

pow(a, b)

这将返回第一个参数的第二个参数的幂的值

回合(一)

这将返回最接近参数的整数

随机()

这将返回一个随机的双精度值

abs(a)

这将返回一个值的绝对值

最大(a,b)

这将返回两个值中的较大者

min(a, b)

这将返回两个值中的较小者

ulp(d)

这将返回参数最后一个位置的单位大小

signum(d)

这将返回参数的符号函数

sinh(x)

这将返回一个值的双曲正弦

cosh(x)

这将返回一个值的双曲余弦

tanh(x)

这将返回一个值的双曲正切

hypot(x,y)

这将返回 sqrt(x2+y2) 没有中间溢出或下溢

acos(a)

这将返回一个值的反余弦

atan(a)

这将返回一个值的反正切

 

如果要按随机顺序检索记录,则可以使用带有随机方法的脚本,如下代码所示:

POST /index-agg/_search?&size=2
{
  "sort": {
    "_script": {
      "script": {
        "source": "Math.random()"
      },
      "type": "number",
      "order": "asc"
    }
  }
}

在这个例子中,对于每一个命中,新的排序值是通过执行 Math.random() 脚本函数来计算的。

Computing return fields with scripting

Elasticsearch 允许我们定义可用于返回新计算字段值的复杂表达式。

这些特殊字段称为 script_fields,它们可以用每种可用的 Elasticsearch 脚本语言中的脚本来表示。

Getting ready

您将需要一个正常运行的 Elasticsearch 安装——类似于我们在 the 下载和安装 Elasticsearch 中描述的那个 配方在 第 1 章, 开始

要执行这些命令,可以使用任何 HTTP 客户端,例如 Curl (https://curl.haxx.se/) 或 postman (https://www.getpostman.com/)。您可以使用 Kibana 控制台,因为它为 Elasticsearch 提供了代码完成和更好的字符转义。

为了正确执行以下命令,您需要使用 ch07/populate_aggregation.txt 命令填充的索引—— 这些 可在在线代码中找到。

为了能够在 Painless 脚本中使用正则表达式,您需要在 elasticsearch.yml by adding script.painless.regex.enabled 中激活它们:真

这个秘籍中使用的索引是 index-agg index.

How to do it...

对于使用脚本计算返回字段,我们将执行以下步骤:

  1. Return the following script fields:
  • "my_calc_field": This concatenates the texts of the "name" and "description" fields
  • "my_calc_field2": This multiplies the "price" value by the "discount" parameter

  1. From the command line, we will execute the following code:
POST /index-agg/_search?size=2
{
  "script_fields": {
    "my_calc_field": {
      "script": {
        "source": """params._source.name + " -- " + params._source.description"""
      }
    },
    "my_calc_field2": {
      "script": {
        "source": """doc["price"].value * params.discount""",
        "params": {
          "discount": 0.8
        }
      }
    }
  }
}
  1. If everything is all right, then the result that is returned by Elasticsearch will be as follows:
{
  ... truncated ...
  "hits" : {
    "total" : 1000,
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "index-agg",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 1.0,
        "fields" : {
          "my_calc_field" : [
            "Valkyrie -- ducimus nobis harum doloribus voluptatibus libero nisi omnis officiis exercitationem amet odio odit dolor perspiciatis minima quae voluptas dignissimos facere ullam tempore temporibus laboriosam ad doloremque blanditiis numquam placeat accusantium at maxime consectetur esse earum velit officia dolorum corporis nemo consequatur perferendis cupiditate eum illum facilis sunt saepe"
          ],
          "my_calc_field2" : [
            15.696847534179689
          ]
        }
      },
      {
        "_index" : "index-agg",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 1.0,
        "fields" : {
          "my_calc_field" : [
            "Omega Red -- quod provident sequi rem placeat deleniti exercitationem veritatis quasi accusantium accusamus autem repudiandae"
          ],
          "my_calc_field2" : [
            56.201733398437504
          ]
        }
      }
    ]
  }
}

How it works...

脚本字段类似于在选择期间对字段执行 SQL 函数。在 Elasticsearch 中,在执行了一个搜索阶段并计算了要返回的命中后,如果定义了一些字段(标准或脚本),则计算并返回它们。

可以使用所有支持的语言定义的脚本字段通过将值传递给文档的源进行处理,如果在脚本中定义了一些其他参数(例如折扣因子),则将它们传递给脚本函数。

script 函数是一个代码片段,因此它可以包含该语言允许编写的所有内容;但是,它必须被评估为一个值(或一个值列表)。

See also

您可以参考以下食谱以供进一步参考:

  • The Installing additional script plugins recipe in this chapter to install additional languages for scripting
  • The Sorting data using scripts recipe in this chapter for a reference to extra built-in functions for Painless scripts

Filtering a search using scripting

第 4 章中,探索搜索功能, 我们探索了许多过滤器。 Elasticsearch 脚本允许使用自定义脚本扩展传统过滤器。

使用脚本创建自定义过滤器是编写 Lucene 或 Elasticsearch 未提供的脚本规则以及实现 DSL 查询中不可用的业务逻辑的便捷方式。

Getting ready

您将需要一个正常运行的 Elasticsearch 安装——类似于我们在 下载和安装 Elasticsearch< span> 配方在 第 1 章< /a>, 开始

要执行这些命令,可以使用任何 HTTP 客户端,例如 curl (https://curl.haxx.se/) 或 postman (https://www.getpostman.com/)。您可以使用 Kibana 控制台,因为它为 Elasticsearch 提供了代码完成和更好的字符转义。

要正确执行以下命令,您需要使用 ch07/populate_aggregation.txt 命令填充的索引– 这些都在在线代码中提供。

为了能够在 Painless 脚本中使用正则表达式,您需要在 elasticsearch.yml 通过 添加 script.painless 来激活它们。 regex.enabled: true

这个秘籍中使用的索引是 index-agg index.

How to do it...

为了使用脚本过滤搜索,我们将执行以下步骤:

  1. We'll write a search using a filter that filters out a document with a price value that is less than a parameter value:
POST /index-agg/_search?pretty&size=3
{
  "query": {
    "bool": {
      "filter": {
        "script": {
          "script": {
            "source": """doc['price'].value > params.param1""",
            "params": {
              "param1": 80
            }
          }
        }
      }
    }
  },
  "_source": {
    "includes": [
      "name",
      "price"
    ]
  }
}

在这个例子中,所有年龄值大于param1的文档都被视为符合返回条件。

此脚本过滤器用于演示目的  在实际应用程序中,它可以替换为 range 查询,这要快得多。

  1. If everything is correct, the result that is returned by Elasticsearch will be as follows:
{
  ... truncated ...
  "hits" : {
    "total" : 190,
    "max_score" : 0.0,
    "hits" : [
      {
        "_index" : "index-agg",
        "_type" : "_doc",
        "_id" : "7",
        "_score" : 0.0,
        "_source" : {
          "price" : 86.65705393127125,
          "name" : "Bishop"
        }
      },
      {
        "_index" : "index-agg",
        "_type" : "_doc",
        "_id" : "14",
        "_score" : 0.0,
        "_source" : {
          "price" : 84.9516714617024,
          "name" : "Crusader"
        }
      },
      {
        "_index" : "index-agg",
        "_type" : "_doc",
        "_id" : "15",
        "_score" : 0.0,
        "_source" : {
          "price" : 98.22030937628774,
          "name" : "Stacy, George"
        }
      }
    ]
  }
}

How it works...

脚本过滤器是一个返回布尔值(truefalse)的语言脚本。对于每个命中,脚本都会被评估,如果它返回 true,那么命中就会通过过滤器。这种类型的脚本只能用作 Lucene 过滤器,不能用作查询,因为它不影响搜索。

脚本代码可以是您首选的受支持脚本语言中返回布尔值的任何代码。

See also

您可以参考以下食谱以供进一步参考:

  • The Installing additional script plugins recipe in this chapter to install additional languages for scripting
  • The Sorting data using script recipe for a reference to extra built-in functions that are available for Painless scripts

Using scripting in aggregations

脚本可用于聚合以扩展其分析功能,以更改度量聚合中使用的值或定义新规则以创建存储桶。

Getting ready

您将需要一个正常运行的 Elasticsearch 安装  - 类似于我们在 下载和安装 Elasticsearch< span> 配方在 第 1 章< /a>, 开始

要执行这些命令,可以使用任何 HTTP 客户端,例如 curl (https://curl.haxx.se/) 或 postman (https://www.getpostman.com/)。您可以使用 Kibana 控制台,因为它为 Elasticsearch 提供了代码完成和更好的字符转义。

要正确执行以下命令,您需要使用 ch07/populate_aggregation.txt 命令填充的索引– 这些都在在线代码中提供。

为了能够在 Painless 脚本中使用正则表达式,您需要在 elasticsearch.yml 通过 添加 script.painless 来激活它们。 regex.enabled: true

这个秘籍中使用的索引是 index-agg index.

How to do it...

为了在聚合中使用脚本语言,我们将执行以下步骤:

  1. Write a metric aggregation that selects the field using script:
POST /index-agg/_search?size=0
{
  "aggs": {
    "my_value": {
      "sum": {
        "script": {
          "source": """doc["price"].value * doc["price"].value"""
        }
      }
    }
  }
}
  1. If everything is correct, then the result that is returned by Elasticsearch will be as follows:
{
  ... truncated ...
  "hits" : {
    "total" : 1000,
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "my_value" : {
      "value" : 3363069.561000406
    }
  }
}
  1. Then, write a metric aggregation that uses the value field using script:
POST /index-agg/_search?size=0
{
  "aggs": {
    "my_value": {
      "sum": {
        "field": "price",
        "script": {
          "source": "_value * _value"
        }
      }
    }
  }
}
  1. If everything is correct, then the result that is returned by Elasticsearch will be as follows:
{
  ... truncated ...
  "hits" : {
    "total" : 1000,
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "my_value" : {
      "value" : 3363069.561000406
    }
  }
}
  1. Again, write a term bucket aggregation that changes the terms using script:
POST /index-agg/_search?size=0
{
  "aggs": {
    "my_value": {
      "terms": {
        "field": "tag",
        "size": 5,
        "script": {
          "source": """
if(params.replace.containsKey(_value.toUpperCase())) {
  params.replace[_value.toUpperCase()] 
} else {
  _value.toUpperCase() 
}
""",
          "params": {
            "replace": {
              "LABORUM": "Result1",
              "MAIORES": "Result2",
              "FACILIS": "Result3"
            }
          }
        }
      }
    }
  }
}
  1. If everything is correct, then the result that is returned by Elasticsearch will be as follows:
{
  ... truncated ...
  "aggregations" : {
    "my_value" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 2755,
      "buckets" : [
        {
          "key" : "Result1",
          "doc_count" : 31
        },
        {
          "key" : "Result2",
          "doc_count" : 25
        },
        {
          "key" : "Result3",
          "doc_count" : 25
        },
        {
          "key" : "IPSAM",
          "doc_count" : 24
        },
        {
          "key" : "SIT",
          "doc_count" : 24
        }
      ]
    }
  }
}

How it works...

Elasticsearch 提供了两种聚合方式,如下:

  • Metrics that compute some values
  • Buckets that aggregate documents in a bucket

在这两种情况下,您都可以使用脚本或值脚本(如果您定义要在聚合中使用的字段)。聚合中接受的对象是标准的 script 对象;脚本返回的值将用于聚合。

如果在聚合中定义了 field value,则可以使用 value 脚本聚合。在这种情况下,在脚本的上下文中,有一个特殊的 _value 变量可用,其中包含字段的值。

在聚合中使用脚本是一个非常强大的功能;但是,在大基数聚合上使用它可能会占用大量 CPU,并且可能会减慢查询时间。

Updating a document using scripts

Elasticsearch 允许您就地更新文档。使用脚本更新文档可减少网络流量(否则,您需要获取文档、更改一个或多个字段,然后将它们发回)并在需要处理大量文档时提高性能。

Getting ready

您将需要一个正常运行的 Elasticsearch 安装——类似于我们在 the 下载和安装 Elasticsearch 中描述的那个 配方在 第 1 章, 开始

要执行这些命令,可以使用任何 HTTP 客户端,例如 curl (https://curl.haxx.se/) 或 postman (https://www.getpostman.com/)。您可以使用 Kibana 控制台,因为它为 Elasticsearch 提供了代码完成和更好的字符转义。

要正确执行以下命令,您需要使用 ch07/populate_aggregation.txt 命令填充的索引– 这些都在在线代码中提供。

为了能够在 Painless 脚本中使用正则表达式,您需要在 elasticsearch.yml 通过 添加以下代码来激活它们:

script.painless.regex.enabled: true

此配方中使用的索引是 index-agg index.

How to do it...

对于使用脚本进行更新,我们将执行以下步骤:

  1. Write an update action that adds a tag value to the list of tags that are available in the source of the document:
POST /index-agg/_doc/10/_update
{
  "script": {
    "source": "ctx._source.age = ctx._source.age + params.sum",
    "params": {
      "sum": 2
    }
  }
}
  1. If everything is correct, then the result that is returned by Elasticsearch will be as follows:
{
  "_index" : "index-agg",
  "_type" : "_doc",
  "_id" : "10",
  "_version" : 3,
  "result" : "updated",
  "_shards" : {
    "total" : 2,
    "successful" : 1,
    "failed" : 0
  },
  "_seq_no" : 2002,
  "_primary_term" : 3
}
  1. If we now retrieve the document, we will have the following code:
GET /index-agg/_doc/10
{
  "_index" : "index-agg",
  "_type" : "_doc",
  "_id" : "10",
  "_version" : 3,
  "found" : true,
  "_source" : {
    ...truncated...
    "age" : 102, ...truncated...
}

从上面的结果可以看出,版本号增加了1。

How it works...

用于更新文档的 REST HTTP 方法是 POST。 URL 仅包含索引名称、类型、文档 ID 和操作:

http://<server>/<index_name>/_doc/<document_id>/_update

更新操作由三个不同的步骤组成,如下所示:

  1. The Get API Call is very fast: This operation works on real-time data (there is no need to refresh) and retrieves the record
  2. The script execution: The script is executed in the document and, if required, it is updated
  3. Saving the document: The document, if needed, is saved

脚本执行以下列方式遵循工作流程:

  • The script is compiled and the result is cached to improve re-execution. The compilation depends on the scripting language; that is, it detects errors in the script such as typographical errors, syntax errors, and language-related errors. The compilation step can also be CPU-bound, so that Elasticsearch caches the compilation results for further execution.
  • The document is executed in the script context; the document data is available in the ctx variable in the script.

更新脚本可以在ctx变量中设置几个参数;最重要的参数如下:

  • ctx._source: This contains the source of the document.
  • ctx._timestamp: If it's defined, this value is set to the document timestamp.
  • ctx.op: This defines the main operation type to be executed. There are several available values, such as the following:
    • index: This is the default value; the record is reindexed with the update values.
    • delete: The document is deleted and not updated (that is, this can be used for updating a document or removing it if it exceeds a quota).
    • none: The document is skipped without reindexing the document. 
如果您需要执行大量更新操作,那么最好批量执行它们以提高应用程序的性能。

There's more...

在以下示例中,我们将执行更新,将新的 tagslabels 值添加到对象;但是,只有在 tagslabels 值发生更改时,我们才会将文档标记为索引:

POST /index-agg/_doc/10/_update
{
  "script": {
    "source": """
    ctx.op = "none";
    if(ctx._source.containsValue("tags")){
      for(def item : params.new_tags){
        if(!ctx._source.tags.contains(item)){
          ctx._source.tags.add(item);
          ctx.op = "index";
        }
      }
    }else{ 
      ctx._source.tags=params.new_tags; 
      ctx.op = "index" 
    }
    
    if(ctx._source.containsValue("labels")){
      for(def item : params.new_labels){
        if(!ctx._source.labels.contains(item)){
          ctx._source.labels.add(item);
          ctx.op = "index"
        }
      }
    }else{
      ctx._source.labels=params.new_labels;
      ctx.op = "index"
    }
""",
    "params": {
      "new_tags": [
        "cool",
        "nice"
      ],
      "new_labels": [
        "red",
        "blue",
        "green"
      ]
    }
  }
}

上述脚本使用以下步骤:

  1. It marks the operation to none to prevent indexing if, in the following steps, the original source is not changed.
  2. It checks whether the tags field is available in the source object.
  3. If the tags field is available in the source object, then it iterates all the values of the new_tags list. If the value is not available in the current tags list, then it adds it and updates the operation to the index.
  4. It the tags field doesn't exist in the source object, then it simply adds it to the source and marks the operation to the index.
  5. The steps from 2 to 4 are repeated for the labels value. The repetition is present in this example to show the Elasticsearch user how it is possible to update multiple values in a single update operation.

您可以将不同的脚本操作合并到一个脚本中。为此,请使用前面解释的构建脚本、在脚本中添加部分以及仅在记录更改时更改 ctx.op 的工作流程。

这个脚本可能相当复杂,但它展示了 Elasticsearch 强大的脚本功能。

Reindexing with a script

重新索引是 Elasticsearch 5.x 中引入的一项新功能,用于在新索引中自动重新索引数据。通常出于多种原因执行此操作,主要是为了映射需要对数据进行完整重新索引的更改。

Getting ready

您将需要一个正常运行的 Elasticsearch 安装——类似于我们在下载和安装 Elasticsearch 配方中描述的那种 第二章, < /span>管理映射

要使用命令行执行 curl,您需要为您的操作系统安装 curl

为了正确执行以下命令,您需要使用 chapter_09/populate_for_scripting.sh 脚本(可在在线代码中获得)和安装的 JavaScript 或 Python 语言脚本插件来填充索引。

How to do it...

对于使用脚本重新索引,我们将执行以下步骤:

  1. Create the destination index as this is not created by the reindex API:
PUT /reindex-scripting
{
  "mappings": {
    "test-type": {
      "properties": {
        "name": {
          "term_vector": "with_positions_offsets",
          "boost": 1,
          "store": true,
          "type": "text"
        },
        "title": {
          "term_vector": "with_positions_offsets",
          "boost": 1,
          "store": true,
          "type": "text"
        },
        "parsedtext": {
          "term_vector": "with_positions_offsets",
          "boost": 1,
          "store": true,
          "type": "text"
        },
        "tag": {
          "type": "keyword",
          "store": true
        },
        "processed": {
          "type": "boolean"
        },
        "date": {
          "type": "date",
          "store": true
        },
        "position": {
          "type": "geo_point",
          "store": true
        },
        "uuid": {
          "boost": 1,
          "store": true,
          "type": "keyword"
        }
      }
    }
  }
}
  1. Write a reindex action that adds a processed field (a Boolean field set to true); it should look as follows:
POST /_reindex
{
  "source": {
    "index": "index-agg"
  },
  "dest": {
    "index": "reindex-scripting"
  },
  "script": {
    "source": """
if(!ctx._source.containsKey("processed")){
  ctx._source.processed=true
}
"""
  }
}
  1. If everything is correct, then the result that is returned by Elasticsearch should be as follows:
{
  "took" : 386,
  "timed_out" : false,
  "total" : 1000,
  "updated" : 0,
  "created" : 1000,
  "deleted" : 0,
  "batches" : 1,
  "version_conflicts" : 0,
  "noops" : 0,
  "retries" : {
    "bulk" : 0,
    "search" : 0
  },
  "throttled_millis" : 0,
  "requests_per_second" : -1.0,
  "throttled_until_millis" : 0,
  "failures" : [ ]
}
  1. Now if we retrieve the same documents, we will have the following code:
GET /reindex-scripting/_doc/10
{
  "_index" : "reindex-scripting",
  "_type" : "_doc",
  "_id" : "10",
  "_version" : 1,
  "found" : true,
  "_source" : {
    "date" : "2012-06-21T16:46:01.689622",
    "processed" : true,
    ... truncated ...
  }
}

从前面的结果中,我们可以看到脚本被应用了。

How it works...

reindex 中的脚本提供了非常强大的功能,因为它允许执行许多有用的操作,例如:

  • Computing new fields
  • Removing fields from a document
  • Adding a new field with default values
  • Modifying the field values

该脚本的工作原理与 update, 相同,但在重新索引期间,您还可以更改以下文档元数据字段:

  • _id: This is the ID of the document
  • _type: This is the type of the document
  • _index: This is the destination index of the document
  • _version: This is the version of the document
  • _routing: This is the routing value to send the document in a specific shard
  • _parent: This is the parent of the document

更改这些值的可能性在重新索引期间提供了很多选项;例如,将一个类型拆分为两个不同的索引,或者将一个索引划分为多个索引并更改 _index 值。