真香！PySpark整合Apache Hudi实战

vlambda
2020-05-10

真香！PySpark整合Apache Hudi实战

1. 准备

Hudi支持Spark-2.x版本，你可以点击如下链接安装Spark，并使用pyspark启动

 
   
   
 
  
    
    
  # pyspark
  
    
    
  export PYSPARK_PYTHON=$(which python3)
  
    
    
  spark-2.4.4-bin-hadoop2.7/bin/pyspark \
  
    
    
   --packages org.apache.hudi:hudi-spark-bundle_2.11:0.5.1-incubating,org.apache.spark:spark-avro_2.11:2.4.4 \
  
    
    
   --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'

spark-avro模块需要在--packages显示指定

spark-avro和spark的版本必须匹配

本示例中，由于依赖spark-avro2.11，因此使用的是scala2.11构建hudi-spark-bundle，如果使用spark-avro2.12，相应的需要使用hudi-spark-bundle_2.12

进行一些前置变量初始化

 
   
   
 
  
    
    
  # pyspark
  
    
    
  tableName = "hudi_trips_cow"
  
    
    
  basePath = "file:///tmp/hudi_trips_cow"
  
    
    
  dataGen = sc._jvm.org.apache.hudi.QuickstartUtils.DataGenerator()

其中DataGenerator可以用来基于行程schema生成插入和删除的样例数据。

2. 插入数据

生成一些新的行程数据，加载到DataFrame中，并将DataFrame写入Hudi表

 
   
   
 
  
    
    
  # pyspark
  
    
    
  inserts = sc._jvm.org.apache.hudi.QuickstartUtils.convertToStringList(dataGen.generateInserts(10))
  
    
    
  df = spark.read.json(spark.sparkContext.parallelize(inserts, 2))
  
    
    
  

  
    
    
  hudi_options = {
  
    
    
   'hoodie.table.name': tableName,
  
    
    
   'hoodie.datasource.write.recordkey.field': 'uuid',
  
    
    
   'hoodie.datasource.write.partitionpath.field': 'partitionpath',
  
    
    
   'hoodie.datasource.write.table.name': tableName,
  
    
    
   'hoodie.datasource.write.operation': 'insert',
  
    
    
   'hoodie.datasource.write.precombine.field': 'ts',
  
    
    
   'hoodie.upsert.shuffle.parallelism': 2, 
  
    
    
   'hoodie.insert.shuffle.parallelism': 2
  
    
    
  }
  
    
    
  

  
    
    
  df.write.format("hudi"). \
  
    
    
   options(**hudi_options). \
  
    
    
   mode("overwrite"). \
  
    
    
   save(basePath)

mode(Overwrite)会覆盖并重新创建数据集。示例中提供了一个主键 (schema中的 uuid)，分区字段( region/county/city）和组合字段(schema中的 ts) 以确保行程记录在每个分区中都是唯一的。

3. 查询数据

将数据加载至DataFrame

 
   
   
 
  
    
    
  # pyspark
  
    
    
  tripsSnapshotDF = spark. \
  
    
    
   read. \
  
    
    
   format("hudi"). \
  
    
    
   load(basePath + "/*/*/*/*")
  
    
    
  

  
    
    
  tripsSnapshotDF.createOrReplaceTempView("hudi_trips_snapshot")
  
    
    
  

  
    
    
  spark.sql("select fare, begin_lon, begin_lat, ts from hudi_trips_snapshot where fare > 20.0").show()
  
    
    
  spark.sql("select _hoodie_commit_time, _hoodie_record_key, _hoodie_partition_path, rider, driver, fare from hudi_trips_snapshot").show()

该查询提供读取优化视图，由于我们的分区路径格式为 region/country/city)，从基本路径（basepath）开始，我们使用 load(basePath+"/*/*/*/*")来加载数据。

4. 更新数据

与插入新数据类似，还是使用DataGenerator生成更新数据，然后使用DataFrame写入Hudi表。

 
   
   
 
  
    
    
  # pyspark
  
    
    
  updates = sc._jvm.org.apache.hudi.QuickstartUtils.convertToStringList(dataGen.generateUpdates(10))
  
    
    
  df = spark.read.json(spark.sparkContext.parallelize(updates, 2))
  
    
    
  df.write.format("hudi"). \
  
    
    
   options(**hudi_options). \
  
    
    
   mode("append"). \
  
    
    
   save(basePath)

注意，现在保存模式现在为 append。通常，除非是第一次尝试创建数据集，否则请始终使用追加模式。每个写操作都会生成一个新的由时间戳表示的commit 。

5. 增量查询

Hudi提供了增量拉取的能力，即可以拉取从指定commit时间之后的变更，如不指定结束时间，那么将会拉取最新的变更。

 
   
   
 
  
    
    
  # pyspark
  
    
    
  # reload data
  
    
    
  spark. \
  
    
    
   read. \
  
    
    
   format("hudi"). \
  
    
    
   load(basePath + "/*/*/*/*"). \
  
    
    
   createOrReplaceTempView("hudi_trips_snapshot")
  
    
    
  

  
    
    
  commits = list(map(lambda row: row[0], spark.sql("select distinct(_hoodie_commit_time) as commitTime from hudi_trips_snapshot order by commitTime").limit(50).collect()))
  
    
    
  beginTime = commits[len(commits) - 2] # commit time we are interested in
  
    
    
  

  
    
    
  # incrementally query data
  
    
    
  incremental_read_options = {
  
    
    
   'hoodie.datasource.query.type': 'incremental',
  
    
    
   'hoodie.datasource.read.begin.instanttime': beginTime,
  
    
    
  }
  
    
    
  

  
    
    
  tripsIncrementalDF = spark.read.format("hudi"). \
  
    
    
   options(**incremental_read_options). \
  
    
    
   load(basePath)
  
    
    
  tripsIncrementalDF.createOrReplaceTempView("hudi_trips_incremental")
  
    
    
  

  
    
    
  spark.sql("select `_hoodie_commit_time`, fare, begin_lon, begin_lat, ts from hudi_trips_incremental where fare > 20.0").show()

这表示查询在开始时间提交之后的所有变更，此增量拉取功能可以在批量数据上构建流式管道。

6. 特定时间点查询

即如何查询特定时间的数据，可以通过将结束时间指向特定的提交时间，将开始时间指向”000”(表示最早的提交时间)来表示特定时间。

 
   
   
 
  
    
    
  # pyspark
  
    
    
  beginTime = "000" # Represents all commits > this time.
  
    
    
  endTime = commits[len(commits) - 2]
  
    
    
  

  
    
    
  # query point in time data
  
    
    
  point_in_time_read_options = {
  
    
    
   'hoodie.datasource.query.type': 'incremental',
  
    
    
   'hoodie.datasource.read.end.instanttime': endTime,
  
    
    
   'hoodie.datasource.read.begin.instanttime': beginTime
  
    
    
  }
  
    
    
  

  
    
    
  tripsPointInTimeDF = spark.read.format("hudi"). \
  
    
    
   options(**point_in_time_read_options). \
  
    
    
   load(basePath)
  
    
    
  

  
    
    
  tripsPointInTimeDF.createOrReplaceTempView("hudi_trips_point_in_time")
  
    
    
  spark.sql("select `_hoodie_commit_time`, fare, begin_lon, begin_lat, ts from hudi_trips_point_in_time where fare > 20.0").show()

7. 删除数据

删除传入的HoodieKey集合，注意：删除操作只支持append模式

 
   
   
 
  
    
    
  # pyspark
  
    
    
  # fetch total records count
  
    
    
  spark.sql("select uuid, partitionPath from hudi_trips_snapshot").count()
  
    
    
  # fetch two records to be deleted
  
    
    
  ds = spark.sql("select uuid, partitionPath from hudi_trips_snapshot").limit(2)
  
    
    
  

  
    
    
  # issue deletes
  
    
    
  hudi_delete_options = {
  
    
    
   'hoodie.table.name': tableName,
  
    
    
   'hoodie.datasource.write.recordkey.field': 'uuid',
  
    
    
   'hoodie.datasource.write.partitionpath.field': 'partitionpath',
  
    
    
   'hoodie.datasource.write.table.name': tableName,
  
    
    
   'hoodie.datasource.write.operation': 'delete',
  
    
    
   'hoodie.datasource.write.precombine.field': 'ts',
  
    
    
   'hoodie.upsert.shuffle.parallelism': 2, 
  
    
    
   'hoodie.insert.shuffle.parallelism': 2
  
    
    
  }
  
    
    
  

  
    
    
  from pyspark.sql.functions import lit
  
    
    
  deletes = list(map(lambda row: (row[0], row[1]), ds.collect()))
  
    
    
  df = spark.sparkContext.parallelize(deletes).toDF(['partitionpath', 'uuid']).withColumn('ts', lit(0.0))
  
    
    
  df.write.format("hudi"). \
  
    
    
   options(**hudi_delete_options). \
  
    
    
   mode("append"). \
  
    
    
   save(basePath)
  
    
    
  

  
    
    
  # run the same read query as above.
  
    
    
  roAfterDeleteViewDF = spark. \
  
    
    
   read. \
  
    
    
   format("hudi"). \
  
    
    
   load(basePath + "/*/*/*/*") 
  
    
    
  roAfterDeleteViewDF.registerTempTable("hudi_trips_snapshot")
  
    
    
  # fetch should return (total - 2) records
  
    
    
  spark.sql("select uuid, partitionPath from hudi_trips_snapshot").count()

8. 总结

本篇博文展示了如何使用pyspark来插入、删除、更新Hudi表，有pyspark和Hudi需求的小伙伴不妨一试！

猜你喜欢

vlambda博客
学习文章列表

真香！PySpark整合Apache Hudi实战

1. 准备

2. 插入数据

3. 查询数据

4. 更新数据

5. 增量查询

6. 特定时间点查询

7. 删除数据

8. 总结

标签:

推荐阅读

相关文章

vlambda博客 学习文章列表

真香！PySpark整合Apache Hudi实战

1. 准备

2. 插入数据

3. 查询数据

4. 更新数据

5. 增量查询

6. 特定时间点查询

7. 删除数据

8. 总结

标签:

推荐阅读

相关文章

vlambda博客
学习文章列表