vlambda博客
学习文章列表

【组件】高频面试题之介绍一下 Hadoop

  这是关注大数据常用组件的第一篇文章。Hadoop 在大数据领域的地位至关重要。今天就从 Hadoop 开始讲解。面试中经常被问到的问题就是 Hadoop 是什么,或者请介绍一下 Hadoop,今天的文章主要对这个问题做出讲解。这个问题虽然简单,但是回答不好或者回答不全,会给面试官留下不好的印象。想回答好这个问题,最权威的应该是查看 Hadoop 的官网,它的描述最为准确和权威。

1.Hadoop 是什么

  打开 Apache Hadoop 的官网,最上面就是对它的描述:

The Apache™ Hadoop® project develops open-source software for reliable, scalable, distributed computing.

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.

  翻译一下就是:

Apache™ Hadoop® 项目是为了开发可靠、可扩展、分布式计算的开源软件。

Apache Hadoop 软件库是一个框架,它允许使用简单的编程模型跨计算机集群分布式处理大型数据集。它的设计初衷是能够从单个服务器扩展到数千台机器,每台机器都提供本地计算和存储。库本身不是依靠硬件来提供高可用性,而是被设计为在应用层检测和处理故障,因此在计算机集群之上提供高可用性服务,但每台计算机可能容易出现故障。

  因此在回答 Hadoop 是什么的时候,一定要提到可靠、可扩展、分布式计算的特点,适合处理大数据。它的可靠是建立在应用层而不是依靠高性能的硬件设备。

2.模块

Modules

The project includes these modules:

  • Hadoop Common: The common utilities that support the other Hadoop modules.
  • Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data.
  • Hadoop YARN: A framework for job scheduling and cluster resource management.
  • Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.

  翻译一下就是:

模块

该项目包括以下模块:

  • Hadoop Common:支持其他 Hadoop 模块的通用程序。
  • **Hadoop 分布式文件系统 (HDFS™)**:一种分布式文件系统,可提供对应用程序数据的高吞吐量访问。
  • Hadoop YARN:作业调度和集群资源管理的框架。
  • Hadoop MapReduce:基于 YARN 的系统,用于并行处理大型数据集。

  在介绍 Hadoop 中可能对经常使用的 HDFS 和 MapReduce 印象深刻,注意不要遗漏作业和资源调度框架 Yarn 以及公用模块 Common。

3.Hadoop 生态

Related projects

相关项目

Other Hadoop-related projects at Apache include:

Apache 中其他与 Hadoop 相关的项目包括:

  • Ambari™: A web-based tool for provisioning, managing, and monitoring Apache Hadoop clusters which includes support for Hadoop HDFS, Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing cluster health such as heatmaps and ability to view MapReduce, Pig and Hive applications visually alongwith features to diagnose their performance characteristics in a user-friendly manner.

  • 一个基于 Web 的工具,用于配置、管理和监控 Apache Hadoop 集群,包括对 Hadoop HDFS、Hadoop MapReduce、Hive、HCatalog、HBase、ZooKeeper、Oozie、Pig 和 Sqoop 的支持。Ambari 还提供了一个仪表板,用于查看集群健康状况,例如热图,并能够直观地查看 MapReduce、Pig 和 Hive 应用程序以及以用户友好的方式诊断其性能特征的功能。

  • Avro™: A data serialization system.

  • 数据序列化系统。

  • Cassandra™: A scalable multi-master database with no single points of failure.

  • 一个可扩展的没有单点故障的多主数据库。

  • Chukwa™: A data collection system for managing large distributed systems.

  • 用于管理大型分布式系统的数据收集系统。

  • HBase™: A scalable, distributed database that supports structured data storage for large tables.

  • 一个可扩展的分布式数据库,支持大型表的结构化数据存储。

  • Hive™: A data warehouse infrastructure that provides data summarization and ad hoc querying.

  • 提供数据汇总和即席查询的数据仓库基础设施。

  • Mahout™: A Scalable machine learning and data mining library.

  • 可扩展的机器学习和数据挖掘库。

  • Ozone™: A scalable, redundant, and distributed object store for Hadoop.

  • Hadoop 的可扩展、冗余和分布式对象存储。

  • Pig™: A high-level data-flow language and execution framework for parallel computation.

  • 用于并行计算的高级数据流语言和执行框架。

  • Spark™: A fast and general compute engine for Hadoop data. Spark provides a simple and expressive programming model that supports a wide range of applications, including ETL, machine learning, stream processing, and graph computation.

  • 用于 Hadoop 数据的快速通用计算引擎。Spark 提供了一个简单而富有表现力的编程模型,支持广泛的应用程序,包括 ETL、机器学习、流处理和图计算。

  • Submarine: A unified AI platform which allows engineers and data scientists to run Machine Learning and Deep Learning workload in distributed cluster.

  • 一个统一的 AI 平台,允许工程师和数据科学家在分布式集群中运行机器学习和深度学习任务。

  • Tez™: A generalized data-flow programming framework, built on Hadoop YARN, which provides a powerful and flexible engine to execute an arbitrary DAG of tasks to process data for both batch and interactive use-cases. Tez is being adopted by Hive™, Pig™ and other frameworks in the Hadoop ecosystem, and also by other commercial software (e.g. ETL tools), to replace Hadoop™ MapReduce as the underlying execution engine.

  • 基于 Hadoop YARN 构建的通用数据流编程框架,它提供了一个强大而灵活的引擎来执行任意 DAG 任务来处理批处理和交互式用例的数据。Tez 正在被 Hadoop 生态系统中的 Hive™、Pig™ 和其他框架以及其他商业软件(例如 ETL 工具)采用,以取代 Hadoop™ MapReduce 作为底层执行引擎。

  • ZooKeeper™: A high-performance coordination service for distributed applications.

  • 分布式应用程序的高性能协调服务。

  以上的框架共同构成了现在的 Hadoop 生态。后面的文章会对这些框架进行学习。