vlambda博客
学习文章列表

Spark读写XML文件及注意事项

最近有粉丝问浪尖spark 如何读写xml格式的文件,尤其是嵌套型的,spark本身是不支持xml格式文件读取的,但是databricks开源了一个jar,支持xml文件的读写,浪尖这里给大家介绍一下用法。
与此类似浪尖以前在星球里也讲过如何读取tar文件,思路跟这个差不多。
导入依赖包
小版本已经到了0.9了
<dependency> <groupId>com.databricks</groupId> <artifactId>spark-xml_2.11</artifactId> <version>0.9.0</version> </dependency>

XML文件示例
下面是一个关于书籍的XML文件示例:
  
    
    
  
<?xml version="1.0"?> <catalog> <book id="bk101"> <author>Gambardella, Matthew</author> <title>XML Developer's Guide</title> <genre>Computer</genre> <price>44.95</price> <publish_date>2000-10-01</publish_date> <description>

An in-depth look at creating applications with XML.This manual describes Oracle XML DB, and how you can use it to store, generate, manipulate, manage, and query XML data in the database.

After introducing you to the heart of Oracle XML DB, namely the XMLType framework and Oracle XML DB repository, the manual provides a brief introduction to design criteria to consider when planning your Oracle XML DB application. It provides examples of how and where you can use Oracle XML DB.

The manual then describes ways you can store and retrieve XML data using Oracle XML DB, APIs for manipulating XMLType data, and ways you can view, generate, transform, and search on existing XML data. The remainder of the manual discusses how to use Oracle XML DB repository, including versioning and security, how to access and manipulate repository resources using protocols, SQL, PL/SQL, or Java, and how to manage your Oracle XML DB application using Oracle Enterprise Manager. It also introduces you to XML messaging and Oracle Streams Advanced Queuing XMLType support. </description> </book> <book id="bk102"> <author>Ralls, Kim</author> <title>Midnight Rain</title> <genre>Fantasy</genre> <price>5.95</price> <publish_date>2000-12-16</publish_date> <description>A former architect battles corporate zombies, an evil sorceress, and her own childhood to become queen of the world.</description> </book> <book id="bk103"> <author>Corets, Eva</author> <title>Maeve Ascendant</title> <genre>Fantasy</genre> <price>5.95</price> <publish_date>2000-11-17</publish_date> <description>After the collapse of a nanotechnology society in England, the young survivors lay the foundation for a new society.</description> </book> <book id="bk104"> <author>Corets, Eva</author> <title>Oberon's Legacy</title> <genre>Fantasy</genre> <price>5.95</price> <publish_date>2001-03-10</publish_date> <description>In post-apocalypse England, the mysterious agent known only as Oberon helps to create a new life for the inhabitants of London. Sequel to Maeve Ascendant.</description> </book> <book id="bk105"> <author>Corets, Eva</author> <title>The Sundered Grail</title> <genre>Fantasy</genre> <price>5.95</price> <publish_date>2001-09-10</publish_date> <description>The two daughters of Maeve, half-sisters, battle one another for control of England. Sequel to Oberon's Legacy.</description> </book> <book id="bk106"> <author>Randall, Cynthia</author> <title>Lover Birds</title> <genre>Romance</genre> <price>4.95</price> <publish_date>2000-09-02</publish_date> <description>When Carla meets Paul at an ornithology conference, tempers fly as feathers get ruffled.</description> </book> <book id="bk107"> <author>Thurman, Paula</author> <title>Splish Splash</title> <genre>Romance</genre> <price>4.95</price> <publish_date>2000-11-02</publish_date> <description>A deep sea diver finds true love twenty thousand leagues beneath the sea.</description> </book> <book id="bk108"> <author>Knorr, Stefan</author> <title>Creepy Crawlies</title> <genre>Horror</genre> <price>4.95</price> <publish_date>2000-12-06</publish_date> <description>An anthology of horror stories about roaches, centipedes, scorpions and other insects.</description> </book> <book id="bk109"> <author>Kress, Peter</author> <title>Paradox Lost</title> <genre>Science Fiction</genre> <price>6.95</price> <publish_date>2000-11-02</publish_date> <description>After an inadvertant trip through a Heisenberg Uncertainty Device, James Salway discovers the problems of being quantum.</description> </book> <book id="bk110"> <author>O'Brien, Tim</author> <title>Microsoft .NET: The Programming Bible</title> <genre>Computer</genre> <price>36.95</price> <publish_date>2000-12-09</publish_date> <description>Microsoft's .NET initiative is explored in detail in this deep programmer's reference.</description> </book> <book id="bk111"> <author>O'Brien, Tim</author> <title>MSXML3: A Comprehensive Guide</title> <genre>Computer</genre> <price>36.95</price> <publish_date>2000-12-01</publish_date> <description>The Microsoft MSXML3 parser is covered in detail, with attention to XML DOM interfaces, XSLT processing, SAX and more.</description> </book> <book id="bk112"> <author>Galos, Mike</author> <title>Visual Studio 7: A Comprehensive Guide</title> <genre>Computer</genre> <price>49.95</price> <publish_date>2001-04-16</publish_date> <description>Microsoft Visual Studio 7 is explored in depth, looking at how Visual Basic, Visual C++, C#, and ASP+ are integrated into a comprehensive development environment.</description> </book> </catalog>

读取XML
浪尖以前讲过关于spark sql自定义数据源的加载方式吧?在format函数里指定加载数据源的格式,其中一种情况会加载你指定package路径下的一个叫做DefaultSource.scala的类;还有一种情况,使用短名称,比如csv,avro这些来标记。
  
    
    
  
package com.vivo.study.xml import org.apache.spark.sql.SparkSession import org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema import org.apache.spark.sql.types.StructType object ReadBooksXMLWithNestedArray { def main(args: Array[String]): Unit = { val spark = SparkSession.builder().master("local[1]") .appName("SparkByExample") .getOrCreate() val df = spark.sqlContext.read .format("com.databricks.spark.xml") .option("rowTag", "book") .load("data/books_complex.xml") df.printSchema() df.show() df.foreach(row=>{ println(""+row.getAs("author")+","+row.getAs("_id")) println(row.getStruct(4).getAs("country")) println(row.getStruct(4).getClass) val arr = row.getStruct(7).getList(0) for (i<-0 to arr.size-1){ val b = arr.get(i).asInstanceOf[GenericRowWithSchema] println(""+b.getAs("name") +","+b.getAs("location")) } }) } }
输出的schema如下:
  
    
    
  
root |-- _id: string (nullable = true) |-- author: string (nullable = true) |-- description: string (nullable = true) |-- genre: string (nullable = true) |-- otherInfo: struct (nullable = true) | |-- address: struct (nullable = true) | | |-- addressline1: string (nullable = true) | | |-- city: string (nullable = true) | | |-- state: string (nullable = true) | |-- country: string (nullable = true) | |-- language: string (nullable = true) | |-- pagesCount: long (nullable = true) |-- price: double (nullable = true) |-- publish_date: string (nullable = true) |-- stores: struct (nullable = true) | |-- store: array (nullable = true) | | |-- element: struct (containsNull = true) | | | |-- location: string (nullable = true) | | | |-- name: string (nullable = true) |-- title: string (nullable = true)
大家从上面的案例应该留意以下几点:
  1. 我们并没有指定schema信息,但是却打印出来了schema信息,说明spark sql自己推断出了xml格式文件的schema。
  2. 嵌套深层数组类型的数据格式,并且带schema的,他的读取方式。浪尖这里也给出了案例。
  3. rowTag就是 xml文件的row tag,其实还有一个root tag就是xml文件的root tag。
  4. _id 字段是属于XML自身的字段,为了区分加了前缀 下划线 _ 。当然前缀 是下划线你假如看不惯的话,那就完全可以通过attributePrefix属性来修改。这些属性由于不属于用户,假如不关心,可以直接禁止掉,参数是excludeAttribute。
写测XML
格式很简单,但是写XML的概率比较低。比较推荐的是parquet和orc。
  
    
    
  
df2.write .format("com.databricks.spark.xml") .option("rootTag", "books") .option("rowTag", "book") .save("src/main/resources/books_new.xml")
 
明确的指定schema
推断schema总是要耗性能的,而且假设你知道schema明确的指定,也方便你管理和别人接手。
  
    
    
  
package com.vivo.study.xml import org.apache.spark.sql.SparkSession import org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema import org.apache.spark.sql.types._ object ReadBooksXMLWithNestedArrayStruct { def main(args: Array[String]): Unit = { val spark = SparkSession.builder().master("local[1]") .appName("langjian") .getOrCreate() val customSchema = StructType(Array( StructField("_id", StringType, nullable = true), StructField("author", StringType, nullable = true), StructField("description", StringType, nullable = true), StructField("genre", StringType ,nullable = true), StructField("price", DoubleType, nullable = true), StructField("publish_date", StringType, nullable = true), StructField("title", StringType, nullable = true), StructField("otherInfo",StructType(Array( StructField("pagesCount", StringType, nullable = true), StructField("language", StringType, nullable = true), StructField("country", StringType, nullable = true), StructField("address", StructType(Array( StructField("addressline1", StringType, nullable = true), StructField("city", StringType, nullable = true), StructField("state", StringType, nullable = true) )) )) )), StructField("stores",StructType(Array( StructField("store",ArrayType( StructType(Array( StructField("location",StringType,true), StructField("name",StringType,true) )) )) ))) )) val df = spark.sqlContext.read .format("com.databricks.spark.xml") .option("rowTag", "book") .schema(customSchema) .load("data/books_complex.xml") df.printSchema() df.show() df.foreach(row=>{ println(""+row.getAs("author")+","+row.getAs("_id")) println(row.getAs[GenericRowWithSchema]("otherInfo").getAs("country")) println(row.getStruct(7).getClass) val arr = row.getStruct(8).getList(0) for (i<-0 to arr.size-1){ val b = arr.get(i).asInstanceOf[GenericRowWithSchema] println(""+b.getAs("name") +","+b.getAs("location")) } }) } }
提示以下,看看我这里第三个println里如何解析嵌套型数据结构的。
对应的打印schema
  
    
    
  
root |-- _id: string (nullable = true) |-- author: string (nullable = true) |-- description: string (nullable = true) |-- genre: string (nullable = true) |-- price: double (nullable = true) |-- publish_date: string (nullable = true) |-- title: string (nullable = true) |-- otherInfo: struct (nullable = true) | |-- pagesCount: string (nullable = true) | |-- language: string (nullable = true) | |-- country: string (nullable = true) | |-- address: struct (nullable = true) | | |-- addressline1: string (nullable = true) | | |-- city: string (nullable = true) | | |-- state: string (nullable = true) |-- stores: struct (nullable = true) | |-- store: array (nullable = true) | | |-- element: struct (containsNull = true) | | | |-- location: string (nullable = true) | | | |-- name: string (nullable = true)
这个小案例的数据如下:
  
    
    
  
<?xml version="1.0"?> <catalog> <book id="bk101"> <author>Gambardella, Matthew</author> <title>XML Developer's Guide</title> <genre>Computer</genre> <price>44.95</price> <publish_date>2000-10-01</publish_date> <description>

An in-depth look at creating applications with XML.This manual describes Oracle XML DB, and how you can use it to store, generate, manipulate, manage, and query XML data in the database.

After introducing you to the heart of Oracle XML DB, namely the XMLType framework and Oracle XML DB repository, the manual provides a brief introduction to design criteria to consider when planning your Oracle XML DB application. It provides examples of how and where you can use Oracle XML DB.

The manual then describes ways you can store and retrieve XML data using Oracle XML DB, APIs for manipulating XMLType data, and ways you can view, generate, transform, and search on existing XML data. The remainder of the manual discusses how to use Oracle XML DB repository, including versioning and security, how to access and manipulate repository resources using protocols, SQL, PL/SQL, or Java, and how to manage your Oracle XML DB application using Oracle Enterprise Manager. It also introduces you to XML messaging and Oracle Streams Advanced Queuing XMLType support. </description> <otherInfo> <pagesCount>100</pagesCount> <language>english</language> <country>India</country> <address> <addressline1>3417 south plaza dr</addressline1> <city>Costa mesa</city> <state>CA</state> </address> </otherInfo> <stores> <store> <name>Costco</name> <location>usa</location> </store> <store> <name>Target</name> <location>UK</location> </store> </stores> </book> <book id="bk102"> <author>Ralls, Kim</author> <title>Midnight Rain</title> <genre>Fantasy</genre> <price>5.95</price> <publish_date>2000-12-16</publish_date> <description>A former architect battles corporate zombies, an evil sorceress, and her own childhood to become queen of the world.</description> <otherInfo> <pagesCount>100</pagesCount> <language>english</language> <country>India</country> <address> <addressline1>3417 south plaza dr</addressline1> <city>Costa mesa</city> <state>CA</state> </address> </otherInfo> <stores> <store> <name>Costco</name> </store> <store> <name>Target</name> </store> <store> <name>Walmart</name> </store> </stores> </book> <book id="bk103"> <author>Corets, Eva</author> <title>Maeve Ascendant</title> <genre>Fantasy</genre> <price>5.95</price> <publish_date>2000-11-17</publish_date> <description>After the collapse of a nanotechnology society in England, the young survivors lay the foundation for a new society.</description> <otherInfo> <pagesCount>100</pagesCount> <language>english</language> <country>India</country> <address> <addressline1>3417 south plaza dr</addressline1> <city>Costa mesa</city> <state>CA</state> </address> </otherInfo> <stores> <store> <name>Costco</name> </store> </stores> </book> <book id="bk104"> <author>Corets, Eva</author> <title>Oberon's Legacy</title> <genre>Fantasy</genre> <price>5.95</price> <publish_date>2001-03-10</publish_date> <description>In post-apocalypse England, the mysterious agent known only as Oberon helps to create a new life for the inhabitants of London. Sequel to Maeve Ascendant.</description> <otherInfo> <pagesCount>100</pagesCount> <language>english</language> <country>India</country> <address> <addressline1>3417 south plaza dr</addressline1> <city>Costa mesa</city> <state>CA</state> </address> </otherInfo> <stores> <store> <name>Costco</name> </store> </stores> </book> <book id="bk105"> <author>Corets, Eva</author> <title>The Sundered Grail</title> <genre>Fantasy</genre> <price>5.95</price> <publish_date>2001-09-10</publish_date> <description>The two daughters of Maeve, half-sisters, battle one another for control of England. Sequel to Oberon's Legacy.</description> <otherInfo> <pagesCount>100</pagesCount> <language>english</language> <country>India</country> <address> <addressline1>3417 south plaza dr</addressline1> <city>Costa mesa</city> <state>CA</state> </address> </otherInfo> <stores> <store> <name>Costco</name> </store> </stores> </book> <book id="bk106"> <author>Randall, Cynthia</author> <title>Lover Birds</title> <genre>Romance</genre> <price>4.95</price> <publish_date>2000-09-02</publish_date> <description>When Carla meets Paul at an ornithology conference, tempers fly as feathers get ruffled.</description> <otherInfo> <pagesCount>100</pagesCount> <language>english</language> <country>India</country> <address> <addressline1>3417 south plaza dr</addressline1> <city>Costa mesa</city> <state>CA</state> </address> </otherInfo> <stores> <store> <name>Costco</name> </store> </stores> </book> <book id="bk107"> <author>Thurman, Paula</author> <title>Splish Splash</title> <genre>Romance</genre> <price>4.95</price> <publish_date>2000-11-02</publish_date> <description>A deep sea diver finds true love twenty thousand leagues beneath the sea.</description> <otherInfo> <pagesCount>100</pagesCount> <language>english</language> <country>India</country> <address> <addressline1>3417 south plaza dr</addressline1> <city>Costa mesa</city> <state>CA</state> </address> </otherInfo> <stores> <store> <name>Costco</name> </store> </stores> </book> <book id="bk108"> <author>Knorr, Stefan</author> <title>Creepy Crawlies</title> <genre>Horror</genre> <price>4.95</price> <publish_date>2000-12-06</publish_date> <description>An anthology of horror stories about roaches, centipedes, scorpions and other insects.</description> <otherInfo> <pagesCount>100</pagesCount> <language>english</language> <country>India</country> <address> <addressline1>3417 south plaza dr</addressline1> <city>Costa mesa</city> <state>CA</state> </address> </otherInfo> <stores> <store> <name>Costco</name> </store> </stores> </book> <book id="bk109"> <author>Kress, Peter</author> <title>Paradox Lost</title> <genre>Science Fiction</genre> <price>6.95</price> <publish_date>2000-11-02</publish_date> <description>After an inadvertant trip through a Heisenberg Uncertainty Device, James Salway discovers the problems of being quantum.</description> <otherInfo> <pagesCount>100</pagesCount> <language>english</language> <country>India</country> <address> <addressline1>3417 south plaza dr</addressline1> <city>Costa mesa</city> <state>CA</state> </address> </otherInfo> <stores> <store> <name>Costco</name> </store> </stores> </book> <book id="bk110"> <author>O'Brien, Tim</author> <title>Microsoft .NET: The Programming Bible</title> <genre>Computer</genre> <price>36.95</price> <publish_date>2000-12-09</publish_date> <description>Microsoft's .NET initiative is explored in detail in this deep programmer's reference.</description> <otherInfo> <pagesCount>100</pagesCount> <language>english</language> <country>India</country> <address> <addressline1>3417 south plaza dr</addressline1> <city>Costa mesa</city> <state>CA</state> </address> </otherInfo> <stores> <store> <name>Costco</name> </store> </stores> </book> <book id="bk111"> <author>O'Brien, Tim</author> <title>MSXML3: A Comprehensive Guide</title> <genre>Computer</genre> <price>36.95</price> <publish_date>2000-12-01</publish_date> <description>The Microsoft MSXML3 parser is covered in detail, with attention to XML DOM interfaces, XSLT processing, SAX and more.</description> <otherInfo> <pagesCount>100</pagesCount> <language>english</language> <country>India</country> <address> <addressline1>3417 south plaza dr</addressline1> <city>Costa mesa</city> <state>CA</state> </address> </otherInfo> <stores> <store> <name>Costco</name> </store> </stores> </book> <book id="bk112"> <author>Galos, Mike</author> <title>Visual Studio 7: A Comprehensive Guide</title> <genre>Computer</genre> <price>49.95</price> <publish_date>2001-04-16</publish_date> <description>Microsoft Visual Studio 7 is explored in depth, looking at how Visual Basic, Visual C++, C#, and ASP+ are integrated into a comprehensive development environment.</description> <otherInfo> <pagesCount>100</pagesCount> <language>english</language> <country>India</country> <address> <addressline1>3417 south plaza dr</addressline1> <city>Costa mesa</city> <state>CA</state> </address> </otherInfo> <stores> <store> <name>Costco</name> </store> </stores> </book> </catalog>

提炼
XML读取的操作API本身没啥大的特别,但是大家都比较了解XML格式的文件一个文件会很大吗?假设不会很大,那么这么多小文件如何处理呢?XML格式数据源如何处理数据源处的分区呢?下次浪尖带着大家揭秘一下源码。