A library for reading and writing Avro data from Spark SQL.
RequirementsThis documentation is for version 3.2.0 of this library, which supports Spark 2.0+. For documentation on earlier versions of this library, see the links below.
This library has different versions for Spark 1.2, 1.3, 1.4 through 1.6, and 2.0+:
Spark VersionCompatible version of Avro Data Source for Spark1.20.2.0
1.31.0.0
1.4-1.62.0.1
2.0+3.2.0 (this version)Linking
This library is cross-published for Scala 2.11, so 2.11 users should replace 2.10 with 2.11 in the commands listed below.
You can link against this library in your program at the following coordinates:
Using SBT:
libraryDependencies += "com.databricks" %% "spark-avro" % "3.2.0"Using Maven:
<dependency> <groupId>com.databricks</groupId> <artifactId>spark-avro_2.10</artifactId> <version>3.2.0</version></dependency>With spark-shell or spark-submit
This library can also be added to Spark jobs launched through spark-shell or spark-submit by using the --packagescommand line option. For example, to include it when starting the spark shell:
$ bin/spark-shell --packages com.databricks:spark-avro_2.11:3.2.0Unlike using --jars, using --packages ensures that this library and its dependencies will be added to the classpath. The --packages argument can also be used with bin/spark-submit.
FeaturesAvro Data Source for Spark supports reading and writing of Avro data from Spark SQL.
- Automatic schema conversion: It supports most conversions between Spark SQL and Avro records, making Avro a first-class citizen in Spark.
- Partitioning: This library allows developers to easily read and write partitioned data witout any extra configuration. Just pass the columns you want to partition on, just like you would for Parquet.
- Compression: You can specify the type of compression to use when writing Avro out to disk. The supported types areuncompressed, snappy, and deflate. You can also specify the deflate level.
- Specifying record names: You can specify the record name and namespace to use by passing a map of parameters withrecordName and recordNamespace.