楼主: Lisrelchen
1074 4

【Apache Spark】spark-avro:A library for reading and writing Avro data from Spar [推广有奖]

  • 0关注
  • 62粉丝

VIP

院士

67%

还不是VIP/贵宾

-

TA的文库  其他...

Bayesian NewOccidental

Spatial Data Analysis

东西方数据挖掘

威望
0
论坛币
49957 个
通用积分
79.5487
学术水平
253 点
热心指数
300 点
信用等级
208 点
经验
41518 点
帖子
3256
精华
14
在线时间
766 小时
注册时间
2006-5-4
最后登录
2022-11-6

+2 论坛币
k人 参与回答

经管之家送您一份

应届毕业生专属福利!

求职就业群
赵安豆老师微信:zhaoandou666

经管之家联合CDA

送您一个全额奖学金名额~ !

感谢您参与论坛问题回答

经管之家送您两个论坛币!

+2 论坛币
Avro Data Source for Apache Spark

A library for reading and writing Avro data from Spark SQL.

Requirements

This documentation is for version 3.2.0 of this library, which supports Spark 2.0+. For documentation on earlier versions of this library, see the links below.

This library has different versions for Spark 1.2, 1.3, 1.4 through 1.6, and 2.0+:

Spark VersionCompatible version of Avro Data Source for Spark
1.20.2.0
1.31.0.0
1.4-1.62.0.1
2.0+3.2.0 (this version)Linking

This library is cross-published for Scala 2.11, so 2.11 users should replace 2.10 with 2.11 in the commands listed below.

You can link against this library in your program at the following coordinates:

Using SBT:

libraryDependencies += "com.databricks" %% "spark-avro" % "3.2.0"

Using Maven:

<dependency>    <groupId>com.databricks</groupId>    <artifactId>spark-avro_2.10</artifactId>    <version>3.2.0</version></dependency>
With spark-shell or spark-submit

This library can also be added to Spark jobs launched through spark-shell or spark-submit by using the --packagescommand line option. For example, to include it when starting the spark shell:

$ bin/spark-shell --packages com.databricks:spark-avro_2.11:3.2.0

Unlike using --jars, using --packages ensures that this library and its dependencies will be added to the classpath. The --packages argument can also be used with bin/spark-submit.

Features

Avro Data Source for Spark supports reading and writing of Avro data from Spark SQL.

  • Automatic schema conversion: It supports most conversions between Spark SQL and Avro records, making Avro a first-class citizen in Spark.
  • Partitioning: This library allows developers to easily read and write partitioned data witout any extra configuration. Just pass the columns you want to partition on, just like you would for Parquet.
  • Compression: You can specify the type of compression to use when writing Avro out to disk. The supported types areuncompressed, snappy, and deflate. You can also specify the deflate level.
  • Specifying record names: You can specify the record name and namespace to use by passing a map of parameters withrecordName and recordNamespace.
二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

关键词:Apache Spark apache Spark Park SPAR

本帖被以下文库推荐

沙发
Lisrelchen 发表于 2017-4-18 03:59:35 |只看作者 |坛友微信交流群
  1. Java API

  2. import org.apache.spark.sql.*;
  3. import org.apache.spark.sql.functions;

  4. SparkSession spark = SparkSession.builder().master("local").getOrCreate();

  5. // Creates a DataFrame from a specified file
  6. Dataset<Row> df = spark.read().format("com.databricks.spark.avro")
  7.   .load("src/test/resources/episodes.avro");

  8. // Saves the subset of the Avro records read in
  9. df.filter(functions.expr("doctor > 5")).write()
  10.   .format("com.databricks.spark.avro")
  11.   .save("/tmp/output");
复制代码

使用道具

藤椅
Lisrelchen 发表于 2017-4-18 03:59:56 |只看作者 |坛友微信交流群
  1. Python API

  2. # Creates a DataFrame from a specified directory
  3. df = spark.read.format("com.databricks.spark.avro").load("src/test/resources/episodes.avro")

  4. #  Saves the subset of the Avro records read in
  5. subset = df.where("doctor > 5")
  6. subset.write.format("com.databricks.spark.avro").save("/tmp/output")
复制代码

使用道具

板凳
MouJack007 发表于 2017-4-18 07:08:48 |只看作者 |坛友微信交流群
谢谢楼主分享!

使用道具

报纸
MouJack007 发表于 2017-4-18 07:09:20 |只看作者 |坛友微信交流群

使用道具

您需要登录后才可以回帖 登录 | 我要注册

本版微信群
加好友,备注jltj
拉您入交流群

京ICP备16021002-2号 京B2-20170662号 京公网安备 11010802022788号 论坛法律顾问:王进律师 知识产权保护声明   免责及隐私声明

GMT+8, 2024-4-30 19:47