This is the code repository for Mastering Apache Spark 2.x - Second Edition, published by Packt. It contains all the supporting project files necessary to work through the book from start to finish.
About the Book
Apache Spark is an in-memory cluster based parallel processing system that provides a wide range of functionality like graph processing, machine learning, stream processing and SQL. This book aims to take your limited knowledge of Spark to the next level by teaching you how to expand Spark functionality and implement your data flows and machine/deep learning programs on top of the platform.
Instructions and Navigation
All of the code is organized into folders. Each folder starts with a number followed by the application name. For example, Chapter02.
val clientDsParquet = spark.read.parquet("/Users/romeokienzler/Documents/romeo/Dropbox/arbeit/spark/sparkbuch/mywork/chapter3/client.parquet").as[Client]
val clientDsBigParquet = spark.read.parquet("/Users/romeokienzler/Documents/romeo/Dropbox/arbeit/spark/sparkbuch/mywork/chapter3/client_big.parquet").as[Client]
val accountDsParquet = spark.read.parquet("/Users/romeokienzler/Documents/romeo/Dropbox/arbeit/spark/sparkbuch/mywork/chapter3/account.parquet").as[Account]
val accountDsBigParquet = spark.read.parquet("/Users/romeokienzler/Documents/romeo/Dropbox/arbeit/spark/sparkbuch/mywork/chapter3/account_big.parquet").as[Account]
val clientDsBigCsv = spark.read.csv("/Users/romeokienzler/Documents/romeo/Dropbox/arbeit/spark/sparkbuch/mywork/chapter3/client_big.csv").as[Client]
// Rely on a type erasure hack to pass RDD[InternalRow] back as RDD[Row]
JDBCRDD.scanTable(
sparkSession.sparkContext,
schema,
requiredColumns,
filters,
parts,
jdbcOptions).asInstanceOf[RDD[Row]]
}
--
JDBCRDD.scala
"We'll skip the scanTable method for now since it just parameterizes and creates a
new JDBCRDD object. So the most interesting method in JDBCRDD is compute which it inherrits from the abstract RDD class. Through the compute method ApacheSpark tells this RDD please go out of lazy mode and materialize yourself. We'll show you two important fractions of this methods after we had a look at the method signature TODO contract? "
"Here you can see that the return type is of Iterator which allows a lazy underlying data source to be read lazy as well. As we can see soon this is the case for this particular implementation as well"
val sqlText = s"SELECT $columnList FROM ${options.table} $myWhereClause"
"Note that the SQL statement created and stored in the sqlText constant is referencing two interesting variables. columnList and myWhereClause. Both are derived from the requiredColumns and filter arguments passed to the JDBCRelation class. Therefore this data source can be called a smart source because the underlying storage technology - a SQL data base in this case - can be told to only return columns and rows which are actually requested. And as already mentione, the data source supports passing lazy data access patterns to be pushed to the underlying data base as well. Here you can see that the JDBC result set is wrapped into a typed InternalRow iterator Iterator[InternalRow]]. Since this matches the return type of the comput method we are done here."
val rowsIterator = JdbcUtils.resultSetToSparkInternalRows(rs, schema, inputMetrics)
"Internally ApacheSpark uses the class org.apache.spark.sql.catalyst.catalog.SessionCatalog for managing temporary views as well as persistent tables. Temporary views are stored int the SparkSession object as persistent tables are stored in an external meta-store. The abstract base class org.apache.spark.sql.catalyst.catalog.ExternalCatalog is extended for varoius meta store providers. TODO One is for using ApacheDerby and another one is for the ApacheHive metastore but anyone could extend this class and make ApacheSpark to use another meta store as well."
----------
Unresolved plan
"An unresolved plan basically is the first tree created from either SQL statements or the relation API of DataFrames and Datasets. It is mainly composed of subtypes of LeafExpression object which are bound together by Expression object therefore forming a tree of TreeNode objects since all these objects are subtypes of TreeNode. Overall this data structure is a LogicalPlan which is therefor reflected as a LogicalPlan object. Note that the LogicalPlan extends QueryPlan and QueryPlan itself is a TreeNode again. In other words, a LogicalPlan is nothing else than a set of TreeNode objects "
//TODO insert inheritance tree of treenode B05868_03_01.png B05868_03_02.png