Elasticsearch 已成为大数据架构中的常用组件,因为它提供了以下几个特性:
- It allows you to search on massive amounts of data in a very fast way
- For common aggregation operations, it provides real-time analytics on big data
- It's more easy to use an Elasticsearch aggregation than a Spark one
- If you need to move on to a fast data solution, starting from a subset of documents after a query is faster than doing a full rescan of all your data
用于处理数据的最常见的大数据软件现在是 Apache Spark (http://spark.apache.org/),它被认为是过时的 Hadoop MapReduce 的演变,用于将处理从磁盘移动到内存。
在本章中,我们将看到如何将 Elasticsearch 集成到 Spark 中,用于写入和读取数据。最后,我们将看到如何使用 Apache Pig 以一种简单的方式在 Elasticsearch 中写入数据。
在本章中,我们将介绍以下食谱:
- Installing Apache Spark
- Indexing data using Apache Spark
- Indexing data with meta using Apache Spark
- Reading data with Apache Spark
- Reading data using Spark SQL
- Indexing data with Apache Pig