We have been getting a lot of questions about thre relationship between SparkContext, SQLContext, and HiveContext in Spark 1.x. It was really strange to have "HiveContext" as an entry point when people want to use the DataFrame API. In Spark 2.0, we are introducing SparkSession, a new entry point that subsumes SQLContext and HiveContext. For backward compatibiilty, the two are preserved. SparkSession has many features, and here we demonstrate some of the more important ones.
While this notebook is written in Scala, similar (actually almost identical) APIs exist in Python and Java.
To read the companion blog post, click here: https://databricks.com/blog/2016/05/11/spark-2-0-technical-preview-easier-faster-and-smarter.html
[color=rgb(169, 169, 169) !important][url=][/url]
[color=rgb(169, 169, 169) !important][url=][/url]
Creating a SparkSession
A SparkSession can be created using a builder pattern. The builder will automatically reuse an existing SparkContext if one exists; and create a SparkContext if it does not exist. Configuration options set in the builder are automatically propagated over to Spark and Hadoop during I/O.
[color=rgb(169, 169, 169) !important][url=][/url]
[color=rgb(169, 169, 169) !important][url=][/url][color=rgba(0, 0, 0, 0.239216)]>
// A SparkSession can be created using a builder patternimport org.apache.spark.sql.SparkSessionval sparkSession = SparkSession.builder .master("local") .appName("my-spark-app") .config("spark.some.config.option", "config-value") .getOrCreate()
import org.apache.spark.sql.SparkSessionsparkSession: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@46d6b87c
[color=rgb(169, 169, 169) !important][url=][/url]
[color=rgb(169, 169, 169) !important][url=][/url]
In Databricks notebooks and Spark REPL, the SparkSession has been created automatically and assigned to variable "spark".
[color=rgb(169, 169, 169) !important][url=][/url]
[color=rgb(169, 169, 169) !important][url=][/url][color=rgba(0, 0, 0, 0.239216)]>
spark
res9: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@46d6b87c
[color=rgb(169, 169, 169) !important][url=][/url]
[color=rgb(169, 169, 169) !important][url=][/url]
Unified entry point for reading data
SparkSession is the entry point for reading data, similar to the old SQLContext.read.
[color=rgb(169, 169, 169) !important][url=][/url]
[color=rgb(169, 169, 169) !important][url=][/url][color=rgba(0, 0, 0, 0.239216)]>
val jsonData = spark.read.json("/home/webinar/person.json")
jsonData: org.apache.spark.sql.DataFrame = [email: string, iq: bigint ... 1 more field]
[color=rgb(169, 169, 169) !important][url=][/url]
[color=rgb(169, 169, 169) !important][url=][/url][color=rgba(0, 0, 0, 0.239216)]>
display(jsonData)
matei@databricks.com
180
Matei Zaharia
rxin@databricks.com
80
Reynold Xin
iq
name
[url=][/url]
[color=rgb(169, 169, 169) !important][url=][/url]
[color=rgb(169, 169, 169) !important][url=][/url]
Running SQL queries
SparkSession can be used to execute SQL queries over data, getting the results back as a DataFrame (i.e. Dataset[Row]).
[color=rgb(169, 169, 169) !important][url=][/url]
[color=rgb(169, 169, 169) !important][url=][/url][color=rgba(0, 0, 0, 0.239216)]>
display(spark.sql("select * from person"))
matei@databricks.com
180
Matei Zaharia
rxin@databricks.com
80
Reynold Xin
iq
name
[url=][/url]
[color=rgb(169, 169, 169) !important][url=][/url]
[color=rgb(169, 169, 169) !important][url=][/url]
Working with config options
SparkSession can also be used to set runtime configuration options, which can toggle optimizer behavior or I/O (i.e. Hadoop) behavior.
[color=rgb(169, 169, 169) !important][url=][/url]
[color=rgb(169, 169, 169) !important][url=][/url][color=rgba(0, 0, 0, 0.239216)]>
spark.conf.set("spark.some.config", "abcd")
res12: org.apache.spark.sql.RuntimeConfig = org.apache.spark.sql.RuntimeConfig@55d93752
[color=rgb(169, 169, 169) !important][url=][/url]
[color=rgb(169, 169, 169) !important][url=][/url][color=rgba(0, 0, 0, 0.239216)]>
spark.conf.get("spark.some.config")
res13: String = abcd
[color=rgb(169, 169, 169) !important][url=][/url]
[color=rgb(169, 169, 169) !important][url=][/url]
And config options set can also be used in SQL using variable substitution.
[color=rgb(169, 169, 169) !important][url=][/url]
[color=rgb(169, 169, 169) !important][url=][/url][color=rgba(0, 0, 0, 0.239216)]>
%sql select "${spark.some.config}"
abcd
abcd
[url=][/url]
[color=rgb(169, 169, 169) !important][url=][/url]
[color=rgb(169, 169, 169) !important][url=][/url]
Working with metadata directly
SparkSession also includes a "catalog" method that contains methods to work with the metastore (i.e. data catalog). Methods there return Datasets so you can use the same Dataset API to play with them.
[color=rgb(169, 169, 169) !important][url=][/url]
[color=rgb(169, 169, 169) !important][url=][/url][color=rgba(0, 0, 0, 0.239216)]>
// To get a list of tables in the current databaseval tables = spark.catalog.listTables()
tables: org.apache.spark.sql.Dataset[org.apache.spark.sql.catalog.Table] = [name: string, database: string ... 3 more fields]
[color=rgb(169, 169, 169) !important][url=][/url]
[color=rgb(169, 169, 169) !important][url=][/url][color=rgba(0, 0, 0, 0.239216)]>
display(tables)
person
default
null
MANAGED
false
smart
default
null
MANAGED
false
name
database
description
tableType
isTemporary
[url=][/url]
[color=rgb(169, 169, 169) !important][url=][/url]
[color=rgb(169, 169, 169) !important][url=][/url][color=rgba(0, 0, 0, 0.239216)]>