请选择 进入手机版 | 继续访问电脑版
楼主: janyiyi
871 0

SparkSession - a new entry point [推广有奖]

  • 3关注
  • 17粉丝

讲师

27%

还不是VIP/贵宾

-

威望
0
论坛币
3206 个
通用积分
5056.6799
学术水平
539 点
热心指数
537 点
信用等级
538 点
经验
10157 点
帖子
300
精华
2
在线时间
89 小时
注册时间
2010-10-3
最后登录
2024-3-29

janyiyi 发表于 2016-9-19 14:24:54 |显示全部楼层 |坛友微信交流群

+2 论坛币
k人 参与回答

经管之家送您一份

应届毕业生专属福利!

求职就业群
赵安豆老师微信:zhaoandou666

经管之家联合CDA

送您一个全额奖学金名额~ !

感谢您参与论坛问题回答

经管之家送您两个论坛币!

+2 论坛币

We have been getting a lot of questions about thre relationship between SparkContext, SQLContext, and HiveContext in Spark 1.x. It was really strange to have "HiveContext" as an entry point when people want to use the DataFrame API. In Spark 2.0, we are introducing SparkSession, a new entry point that subsumes SQLContext and HiveContext. For backward compatibiilty, the two are preserved. SparkSession has many features, and here we demonstrate some of the more important ones.

While this notebook is written in Scala, similar (actually almost identical) APIs exist in Python and Java.

To read the companion blog post, click here: https://databricks.com/blog/2016/05/11/spark-2-0-technical-preview-easier-faster-and-smarter.html















[color=rgb(169, 169, 169) !important][url=][/url]



[color=rgb(169, 169, 169) !important][url=][/url]

Creating a SparkSession

A SparkSession can be created using a builder pattern. The builder will automatically reuse an existing SparkContext if one exists; and create a SparkContext if it does not exist. Configuration options set in the builder are automatically propagated over to Spark and Hadoop during I/O.















[color=rgb(169, 169, 169) !important][url=][/url]



[color=rgb(169, 169, 169) !important][url=][/url]
[color=rgba(0, 0, 0, 0.239216)]>





// A SparkSession can be created using a builder patternimport org.apache.spark.sql.SparkSessionval sparkSession = SparkSession.builder  .master("local")  .appName("my-spark-app")  .config("spark.some.config.option", "config-value")  .getOrCreate()










import org.apache.spark.sql.SparkSessionsparkSession: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@46d6b87c








[color=rgb(169, 169, 169) !important][url=][/url]



[color=rgb(169, 169, 169) !important][url=][/url]

In Databricks notebooks and Spark REPL, the SparkSession has been created automatically and assigned to variable "spark".















[color=rgb(169, 169, 169) !important][url=][/url]



[color=rgb(169, 169, 169) !important][url=][/url]
[color=rgba(0, 0, 0, 0.239216)]>





spark










res9: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@46d6b87c








[color=rgb(169, 169, 169) !important][url=][/url]



[color=rgb(169, 169, 169) !important][url=][/url]

Unified entry point for reading data

SparkSession is the entry point for reading data, similar to the old SQLContext.read.















[color=rgb(169, 169, 169) !important][url=][/url]



[color=rgb(169, 169, 169) !important][url=][/url]
[color=rgba(0, 0, 0, 0.239216)]>





val jsonData = spark.read.json("/home/webinar/person.json")










jsonData: org.apache.spark.sql.DataFrame = [email: string, iq: bigint ... 1 more field]








[color=rgb(169, 169, 169) !important][url=][/url]



[color=rgb(169, 169, 169) !important][url=][/url]
[color=rgba(0, 0, 0, 0.239216)]>





display(jsonData)












matei@databricks.com

180

Matei Zaharia


rxin@databricks.com

80

Reynold Xin


email

iq

name





[url=][/url]














[color=rgb(169, 169, 169) !important][url=][/url]



[color=rgb(169, 169, 169) !important][url=][/url]

Running SQL queries

SparkSession can be used to execute SQL queries over data, getting the results back as a DataFrame (i.e. Dataset[Row]).















[color=rgb(169, 169, 169) !important][url=][/url]



[color=rgb(169, 169, 169) !important][url=][/url]
[color=rgba(0, 0, 0, 0.239216)]>





display(spark.sql("select * from person"))












matei@databricks.com

180

Matei Zaharia


rxin@databricks.com

80

Reynold Xin


email

iq

name





[url=][/url]














[color=rgb(169, 169, 169) !important][url=][/url]



[color=rgb(169, 169, 169) !important][url=][/url]

Working with config options

SparkSession can also be used to set runtime configuration options, which can toggle optimizer behavior or I/O (i.e. Hadoop) behavior.















[color=rgb(169, 169, 169) !important][url=][/url]



[color=rgb(169, 169, 169) !important][url=][/url]
[color=rgba(0, 0, 0, 0.239216)]>





spark.conf.set("spark.some.config", "abcd")










res12: org.apache.spark.sql.RuntimeConfig = org.apache.spark.sql.RuntimeConfig@55d93752








[color=rgb(169, 169, 169) !important][url=][/url]



[color=rgb(169, 169, 169) !important][url=][/url]
[color=rgba(0, 0, 0, 0.239216)]>





spark.conf.get("spark.some.config")










res13: String = abcd








[color=rgb(169, 169, 169) !important][url=][/url]



[color=rgb(169, 169, 169) !important][url=][/url]

And config options set can also be used in SQL using variable substitution.















[color=rgb(169, 169, 169) !important][url=][/url]



[color=rgb(169, 169, 169) !important][url=][/url]
[color=rgba(0, 0, 0, 0.239216)]>





%sql select "${spark.some.config}"












abcd


abcd





[url=][/url]














[color=rgb(169, 169, 169) !important][url=][/url]



[color=rgb(169, 169, 169) !important][url=][/url]

Working with metadata directly

SparkSession also includes a "catalog" method that contains methods to work with the metastore (i.e. data catalog). Methods there return Datasets so you can use the same Dataset API to play with them.















[color=rgb(169, 169, 169) !important][url=][/url]



[color=rgb(169, 169, 169) !important][url=][/url]
[color=rgba(0, 0, 0, 0.239216)]>





// To get a list of tables in the current databaseval tables = spark.catalog.listTables()










tables: org.apache.spark.sql.Dataset[org.apache.spark.sql.catalog.Table] = [name: string, database: string ... 3 more fields]








[color=rgb(169, 169, 169) !important][url=][/url]



[color=rgb(169, 169, 169) !important][url=][/url]
[color=rgba(0, 0, 0, 0.239216)]>





display(tables)












person

default

null

MANAGED

false


smart

default

null

MANAGED

false


name

database

description

tableType

isTemporary





[url=][/url]














[color=rgb(169, 169, 169) !important][url=][/url]



[color=rgb(169, 169, 169) !important][url=][/url]
[color=rgba(0, 0, 0, 0.239216)]>







二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

关键词:session Sparks ENTRY Parks Point important notebook backward features between

您需要登录后才可以回帖 登录 | 我要注册

本版微信群
加好友,备注cda
拉您进交流群

京ICP备16021002-2号 京B2-20170662号 京公网安备 11010802022788号 论坛法律顾问:王进律师 知识产权保护声明   免责及隐私声明

GMT+8, 2024-3-29 23:24