人大经济论坛 › 论坛 › 计量经济学与统计论坛五区 › 计量经济学与统计软件 › winbugs及其他软件专版 › [Github]Mastering Apache Spark 2.x - Second Edition

CDA数据分析研究院

商业数据分析与大数据领航教育品牌



经管云课堂

经管/金融/财会/社科/名师公开课



学术培训

Stata 空间计量 SSCI Python

贵宾：通行论坛特权+数据库权限
+案例库+下载特权 VIP：论坛特权+更多下载次数
+ccerdata数据库+更高阅读权限+……

发帖

楼主: Lisrelchen

1359 9

[Github]Mastering Apache Spark 2.x - Second Edition [推广有奖]

0关注
62粉丝

VIP

院士

67%

还不是VIP/贵宾

TA的文库 其他...

Bayesian NewOccidental

Spatial Data Analysis

东西方数据挖掘

威望: 0 级
论坛币: 49957 个
通用积分: 79.5487
学术水平: 253 点
热心指数: 300 点
信用等级: 208 点
经验: 41518 点
帖子: 3256
精华: 14
在线时间: 766 小时
注册时间: 2006-5-4
最后登录: 2022-11-6

楼主

Lisrelchen 发表于 2017-8-22 08:08:11 |只看作者 |坛友微信交流群|倒序 |AI写论文

是否 +2 论坛币

k人参与回答

经管之家送您一份

应届毕业生专属福利!

求职就业群

赵安豆老师微信：zhaoandou666

经管之家联合CDA

送您一个全额奖学金名额~ !

立即领取

感谢您参与论坛问题回答

经管之家送您两个论坛币！

+2 论坛币

Mastering Apache Spark 2.x - Second Edition

This is the code repository for Mastering Apache Spark 2.x - Second Edition, published by Packt. It contains all the supporting project files necessary to work through the book from start to finish.

About the Book

Apache Spark is an in-memory cluster based parallel processing system that provides a wide range of functionality like graph processing, machine learning, stream processing and SQL. This book aims to take your limited knowledge of Spark to the next level by teaching you how to expand Spark functionality and implement your data flows and machine/deep learning programs on top of the platform.

Instructions and Navigation

All of the code is organized into folders. Each folder starts with a number followed by the application name. For example, Chapter02.

The code will look like the following:

import org.apache.spark.SparkContextimport org.apache.spark.SparkContext._import org.apache.spark.SparkConf

You will need the following to work with the examples in this book:

A laptop or PC with at least 6 GB main memory running Windows, macOS, or Linux
VirtualBox 5.1.22 or above
Hortonworks HDP Sandbox V2.6 or above
Eclipse Neon or above
Maven
Eclipse Maven Plugin
Eclipse Scala Plugin
Eclipse Git Plugin

Related Products

本帖隐藏的内容

https://github.com/PacktPublishing/Mastering-Apache-Spark-2x

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

分享0 收藏0 回帖

关键词：Apache Spark Mastering Edition apache GitHub

本帖被以下文库推荐

· Data Science NewOccidental|主题: 1233, 订阅: 120

使用道具举报

沙发

Lisrelchen 发表于 2017-8-22 08:10:39 |只看作者 |坛友微信交流群

case class Client(
age: Long,
countryCode: String,
familyName: String,
id: String,
name: String
)
val clientDs = spark.read.json("/Users/romeokienzler/Documents/romeo/Dropbox/arbeit/spark/sparkbuch/mywork/chapter3/client.json").as[Client]
val clientDsBig = spark.read.json("/Users/romeokienzler/Documents/romeo/Dropbox/arbeit/spark/sparkbuch/mywork/chapter3/client_big.json").as[Client]
case class Account(
balance: Long,
id: String,
clientId: String
)
val accountDs = spark.read.json("/Users/romeokienzler/Documents/romeo/Dropbox/arbeit/spark/sparkbuch/mywork/chapter3/account.json").as[Account]
val accountDsBig = spark.read.json("/Users/romeokienzler/Documents/romeo/Dropbox/arbeit/spark/sparkbuch/mywork/chapter3/account_big.json").as[Account]
clientDs.createOrReplaceTempView("client")
clientDsBig.createOrReplaceTempView("clientbig")
accountDs.createOrReplaceTempView("account")
accountDsBig.createOrReplaceTempView("accountbig")
//clientDs.write.parquet("/Users/romeokienzler/Documents/romeo/Dropbox/arbeit/spark/sparkbuch/mywork/chapter3/client.parquet")
//clientDsBig.write.parquet("/Users/romeokienzler/Documents/romeo/Dropbox/arbeit/spark/sparkbuch/mywork/chapter3/client_big.parquet")
//accountDs.write.parquet("/Users/romeokienzler/Documents/romeo/Dropbox/arbeit/spark/sparkbuch/mywork/chapter3/account.parquet")
//accountDsBig.write.parquet("/Users/romeokienzler/Documents/romeo/Dropbox/arbeit/spark/sparkbuch/mywork/chapter3/account_big.parquet")
//clientDsBig.write.csv("/Users/romeokienzler/Documents/romeo/Dropbox/arbeit/spark/sparkbuch/mywork/chapter3/client_big.csv")
//accountDsBig.write.csv("/Users/romeokienzler/Documents/romeo/Dropbox/arbeit/spark/sparkbuch/mywork/chapter3/account_big.csv")
val clientDsParquet = spark.read.parquet("/Users/romeokienzler/Documents/romeo/Dropbox/arbeit/spark/sparkbuch/mywork/chapter3/client.parquet").as[Client]
val clientDsBigParquet = spark.read.parquet("/Users/romeokienzler/Documents/romeo/Dropbox/arbeit/spark/sparkbuch/mywork/chapter3/client_big.parquet").as[Client]
val accountDsParquet = spark.read.parquet("/Users/romeokienzler/Documents/romeo/Dropbox/arbeit/spark/sparkbuch/mywork/chapter3/account.parquet").as[Account]
val accountDsBigParquet = spark.read.parquet("/Users/romeokienzler/Documents/romeo/Dropbox/arbeit/spark/sparkbuch/mywork/chapter3/account_big.parquet").as[Account]
val clientDsBigCsv = spark.read.csv("/Users/romeokienzler/Documents/romeo/Dropbox/arbeit/spark/sparkbuch/mywork/chapter3/client_big.csv").as[Client]
clientDsParquet.createOrReplaceTempView("clientparquet")
clientDsBigParquet.createOrReplaceTempView("clientbigparquet")
accountDsParquet.createOrReplaceTempView("accountparquet")
accountDsBigParquet.createOrReplaceTempView("accountbigparquet")
spark.sql("select c.familyName from clientbigparquet c inner join accountbigparquet a on c.id=a.clientId").explain
== Physical Plan ==
*Project [familyName#79]
+- *BroadcastHashJoin [id#80], [clientId#106], Inner, BuildRight
:- *Project [familyName#79, id#80]
: +- *Filter isnotnull(id#80)
: +- *BatchedScan parquet [familyName#79,id#80] Format: ParquetFormat, InputPaths: file:/Users/romeokienzler/Documents/romeo/Dropbox/arbeit/spark/sparkbuch/mywork/chapter3/client_b..., PushedFilters: [IsNotNull(id)], ReadSchema: struct<familyName:string,id:string>
+- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, true]))
+- *Project [clientId#106]
+- *Filter isnotnull(clientId#106)
+- *BatchedScan parquet [clientId#106] Format: ParquetFormat, InputPaths: file:/Users/romeokienzler/Documents/romeo/Dropbox/arbeit/spark/sparkbuch/mywork/chapter3/account_..., PushedFilters: [IsNotNull(clientId)], ReadSchema: struct<clientId:string>
spark.sql("select c.familyName from clientbig c inner join accountbig a on c.id=a.clientId").explain
spark.sql("select count(*) from clientbigparquet c inner join accountbigparquet a on c.id=a.clientId").explain
== Physical Plan ==
*HashAggregate(keys=[], functions=[count(1)])
+- Exchange SinglePartition
+- *HashAggregate(keys=[], functions=[partial_count(1)])
+- *Project
+- *BroadcastHashJoin [id#199], [clientId#225], Inner, BuildRight
:- *Project [id#199]
: +- *Filter isnotnull(id#199)
: +- *BatchedScan parquet [id#199] Format: ParquetFormat, InputPaths: file:/Users/romeokienzler/Documents/romeo/Dropbox/arbeit/spark/sparkbuch/mywork/chapter3/client_b..., PushedFilters: [IsNotNull(id)], ReadSchema: struct<id:string>
+- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, true]))
+- *Project [clientId#225]
+- *Filter isnotnull(clientId#225)
+- *BatchedScan parquet [clientId#225] Format: ParquetFormat, InputPaths: file:/Users/romeokienzler/Documents/romeo/Dropbox/arbeit/spark/sparkbuch/mywork/chapter3/account_..., PushedFilters: [IsNotNull(clientId)], ReadSchema: struct<clientId:string>
spark.sql("select count(*) from clientbig c inner join accountbigparquet a on c.id=a.clientId").explain
---------
private[sql] case class JDBCRelation(
parts: Array[Partition], jdbcOptions: JDBCOptions)(@transient val sparkSession: SparkSession)
extends BaseRelation
with PrunedFilteredScan
with InsertableRelation {
--
override def buildScan(requiredColumns: Array[String], filters: Array[Filter]): RDD[Row] = {
// Rely on a type erasure hack to pass RDD[InternalRow] back as RDD[Row]
JDBCRDD.scanTable(
sparkSession.sparkContext,
schema,
requiredColumns,
filters,
parts,
jdbcOptions).asInstanceOf[RDD[Row]]
}
--
JDBCRDD.scala
"We'll skip the scanTable method for now since it just parameterizes and creates a
new JDBCRDD object. So the most interesting method in JDBCRDD is compute which it inherrits from the abstract RDD class. Through the compute method ApacheSpark tells this RDD please go out of lazy mode and materialize yourself. We'll show you two important fractions of this methods after we had a look at the method signature TODO contract? "
override def compute(thePart: Partition, context: TaskContext): Iterator[InternalRow] = {
"Here you can see that the return type is of Iterator which allows a lazy underlying data source to be read lazy as well. As we can see soon this is the case for this particular implementation as well"
val sqlText = s"SELECT $columnList FROM ${options.table} $myWhereClause"
stmt = conn.prepareStatement(sqlText,
ResultSet.TYPE_FORWARD_ONLY, ResultSet.CONCUR_READ_ONLY)
stmt.setFetchSize(options.fetchSize)
rs = stmt.executeQuery()
"Note that the SQL statement created and stored in the sqlText constant is referencing two interesting variables. columnList and myWhereClause. Both are derived from the requiredColumns and filter arguments passed to the JDBCRelation class. Therefore this data source can be called a smart source because the underlying storage technology - a SQL data base in this case - can be told to only return columns and rows which are actually requested. And as already mentione, the data source supports passing lazy data access patterns to be pushed to the underlying data base as well. Here you can see that the JDBC result set is wrapped into a typed InternalRow iterator Iterator[InternalRow]]. Since this matches the return type of the comput method we are done here."
val rowsIterator = JdbcUtils.resultSetToSparkInternalRows(rs, schema, inputMetrics)
CompletionIterator[InternalRow, Iterator[InternalRow]](rowsIterator, close())
------------
Catalog
"Internally ApacheSpark uses the class org.apache.spark.sql.catalyst.catalog.SessionCatalog for managing temporary views as well as persistent tables. Temporary views are stored int the SparkSession object as persistent tables are stored in an external meta-store. The abstract base class org.apache.spark.sql.catalyst.catalog.ExternalCatalog is extended for varoius meta store providers. TODO One is for using ApacheDerby and another one is for the ApacheHive metastore but anyone could extend this class and make ApacheSpark to use another meta store as well."
----------
Unresolved plan
"An unresolved plan basically is the first tree created from either SQL statements or the relation API of DataFrames and Datasets. It is mainly composed of subtypes of LeafExpression object which are bound together by Expression object therefore forming a tree of TreeNode objects since all these objects are subtypes of TreeNode. Overall this data structure is a LogicalPlan which is therefor reflected as a LogicalPlan object. Note that the LogicalPlan extends QueryPlan and QueryPlan itself is a TreeNode again. In other words, a LogicalPlan is nothing else than a set of TreeNode objects "
//TODO insert inheritance tree of treenode B05868_03_01.png B05868_03_02.png

复制代码

使用道具举报

藤椅

西门高 发表于 2017-8-22 08:13:02 |只看作者 |坛友微信交流群

谢谢分享

使用道具举报

板凳

MouJack007 发表于 2017-8-22 08:56:38 |只看作者 |坛友微信交流群

谢谢楼主分享！

使用道具举报

报纸

MouJack007 发表于 2017-8-22 08:58:27 |只看作者 |坛友微信交流群

使用道具举报

地板

lianqu 发表于 2017-8-22 09:20:34 |只看作者 |坛友微信交流群

使用道具举报

7楼

钱学森64 发表于 2017-8-22 11:21:33 |只看作者 |坛友微信交流群

谢谢分享

使用道具举报

8楼

seoulcityyxx 发表于 2017-8-22 15:10:33 |只看作者 |坛友微信交流群

过来看一看

使用道具举报

9楼

zhw199107 发表于 2018-1-16 10:32:25 |只看作者 |坛友微信交流群

谢谢分享

使用道具举报

10楼

qmathews 发表于 2020-5-11 17:11:57 |只看作者 |坛友微信交流群

谢谢楼主分享！

使用道具举报

返回列表

发帖

本版微信群

加好友,备注jltj
拉您入交流群

手机版 |

意见反馈 |

帮助 |

新手入门 |

用户手册 |

友情链接 |

如有投资本站、合作意向或投放广告，请联系：13661292478（刘老师）

联系客服

邮箱：service@pinggu.org 投诉或不良信息处理：（010-68466864）

京ICP备16021002-2号京B2-20170662号京公网安备 11010802022788号论坛法律顾问：王进律师知识产权保护声明免责及隐私声明

[Github]Mastering Apache Spark 2.x - Second Edition [推广有奖]

经管之家送您一份

经管之家联合CDA

感谢您参与论坛问题回答

本帖隐藏的内容

扫码加我 拉你入群

相关帖子

本帖被以下文库推荐

本版微信群

扫码加我拉你入群