楼主: ReneeBK
894 9

How to use SparkSession in Apache Spark 2.0 [推广有奖]

  • 1关注
  • 62粉丝

VIP

学术权威

14%

还不是VIP/贵宾

-

TA的文库  其他...

R资源总汇

Panel Data Analysis

Experimental Design

威望
1
论坛币
49417 个
通用积分
52.1704
学术水平
370 点
热心指数
273 点
信用等级
335 点
经验
57815 点
帖子
4006
精华
21
在线时间
582 小时
注册时间
2005-5-8
最后登录
2023-11-26

相似文件 换一批

+2 论坛币
k人 参与回答

经管之家送您一份

应届毕业生专属福利!

求职就业群
赵安豆老师微信:zhaoandou666

经管之家联合CDA

送您一个全额奖学金名额~ !

感谢您参与论坛问题回答

经管之家送您两个论坛币!

+2 论坛币
  1. Generally, a session is an interaction between two or more entities. In computer parlance, its usage is prominent in the realm of networked computers on the internet. First with TCP session, then with login session, followed by HTTP and user session, so no surprise that we now have SparkSession, introduced in Apache Spark 2.0.

  2. Beyond a time-bounded interaction, SparkSession provides a single point of entry to interact with underlying Spark functionality and allows programming Spark with DataFrame and Dataset APIs. Most importantly, it curbs the number of concepts and constructs a developer has to juggle while interacting with Spark.

  3. In this blog and its accompanying Databricks notebook, we will explore SparkSession functionality in Spark 2.0.
复制代码

本帖隐藏的内容

How to use SparkSession in Apache Spark 2.0.pdf (428.77 KB)


二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

关键词:Apache Spark session apache Sparks Spark

沙发
ReneeBK 发表于 2017-5-28 02:05:52 |只看作者 |坛友微信交流群
  1. Creating a SparkSession
  2. In previous versions of Spark, you had to create a SparkConf and SparkContext to interact with Spark, as shown here:

  3. //set up the spark configuration and create contexts
  4. val sparkConf = new SparkConf().setAppName("SparkSessionZipsExample").setMaster("local")
  5. // your handle to SparkContext to access other context like SQLContext
  6. val sc = new SparkContext(sparkConf).set("spark.some.config.option", "some-value")
  7. val sqlContext = new org.apache.spark.sql.SQLContext(sc)
复制代码

使用道具

藤椅
ReneeBK 发表于 2017-5-28 02:08:18 |只看作者 |坛友微信交流群
  1. Whereas in Spark 2.0 the same effects can be achieved through SparkSession, without expliciting creating SparkConf, SparkContext or SQLContext, as they’re encapsulated within the SparkSession. Using a builder design pattern, it instantiates a SparkSession object if one does not already exist, along with its associated underlying contexts.

  2. // Create a SparkSession. No need to create SparkContext
  3. // You automatically get it as part of the SparkSession
  4. val warehouseLocation = "file:${system:user.dir}/spark-warehouse"
  5. val spark = SparkSession
  6.    .builder()
  7.    .appName("SparkSessionZipsExample")
  8.    .config("spark.sql.warehouse.dir", warehouseLocation)
  9.    .enableHiveSupport()
  10.    .getOrCreate()
  11. At this point you can use the spark variable as your instance object to access its public methods and instances for the duration of your Spark job.
复制代码

使用道具

板凳
ReneeBK 发表于 2017-5-28 02:08:52 |只看作者 |坛友微信交流群
  1. Configuring Spark’s Runtime Properties
  2. Once the SparkSession is instantiated, you can configure Spark’s runtime config properties. For example, in this code snippet, we can alter the existing runtime config options. Since configMap is a collection, you can use all of Scala’s iterable methods to access the data.

  3. //set new runtime options
  4. spark.conf.set("spark.sql.shuffle.partitions", 6)
  5. spark.conf.set("spark.executor.memory", "2g")
  6. //get all settings
  7. val configMap:Map[String, String] = spark.conf.getAll()
复制代码

使用道具

报纸
ReneeBK 发表于 2017-5-28 02:09:29 |只看作者 |坛友微信交流群
  1. Accessing Catalog Metadata
  2. Often, you may want to access and peruse the underlying catalog metadata. SparkSession exposes “catalog” as a public instance that contains methods that work with the metastore (i.e data catalog). Since these methods return a Dataset, you can use Dataset API to access or view data. In this snippet, we access table names and list of databases.

  3. //fetch metadata data from the catalog
  4. spark.catalog.listDatabases.show(false)
  5. spark.catalog.listTables.show(false)
复制代码

使用道具

地板
ReneeBK 发表于 2017-5-28 02:10:28 |只看作者 |坛友微信交流群
  1. Creating Datasets and Dataframes
  2. There are a number of ways to create DataFrames and Datasets using SparkSession APIs
  3. One quick way to generate a Dataset is by using the spark.range method. When learning to manipulate Dataset with its API, this quick method proves useful. For example,

  4. //create a Dataset using spark.range starting from 5 to 100, with increments of 5
  5. val numDS = spark.range(5, 100, 5)
  6. // reverse the order and display first 5 items
  7. numDS.orderBy(desc("id")).show(5)
  8. //compute descriptive stats and display them
  9. numDs.describe().show()
  10. // create a DataFrame using spark.createDataFrame from a List or Seq
  11. val langPercentDF = spark.createDataFrame(List(("Scala", 35), ("Python", 30), ("R", 15), ("Java", 20)))
  12. //rename the columns
  13. val lpDF = langPercentDF.withColumnRenamed("_1", "language").withColumnRenamed("_2", "percent")
  14. //order the DataFrame in descending order of percentage
  15. lpDF.orderBy(desc("percent")).show(false)
复制代码

使用道具

7
ReneeBK 发表于 2017-5-28 02:11:23 |只看作者 |坛友微信交流群
  1. Reading JSON Data with SparkSession API
  2. Like any Scala object you can use spark, the SparkSession object, to access its public methods and instance fields. I can read JSON or CVS or TXT file, or I can read a parquet table. For example, in this code snippet, we will read a JSON file of zip codes, which returns a DataFrame, a collection of generic Rows.

  3. // read the json file and create the dataframe
  4. val jsonFile = args(0)
  5. val zipsDF = spark.read.json(jsonFile)
  6. //filter all cities whose population > 40K
  7. zipsDF.filter(zipsDF.col("pop") > 40000).show(10)
复制代码

使用道具

8
ReneeBK 发表于 2017-5-28 02:11:44 |只看作者 |坛友微信交流群
  1. Using Spark SQL with SparkSession
  2. Through SparkSession, you can access all of the Spark SQL functionality as you would through SQLContext. In the code sample below, we create a table against which we issue SQL queries.

  3. // Now create an SQL table and issue SQL queries against it without
  4. // using the sqlContext but through the SparkSession object.
  5. // Creates a temporary view of the DataFrame
  6. zipsDF.createOrReplaceTempView("zips_table")
  7. zipsDF.cache()
  8. val resultsDF = spark.sql("SELECT city, pop, state, zip FROM zips_table")
  9. resultsDF.show(10)
复制代码

使用道具

9
ReneeBK 发表于 2017-5-28 02:13:39 |只看作者 |坛友微信交流群
  1. Saving and Reading from Hive table with SparkSession
  2. Next, we are going to create a Hive table and issue queries against it using SparkSession object as you would with a HiveContext.

  3. //drop the table if exists to get around existing table error
  4. spark.sql("DROP TABLE IF EXISTS zips_hive_table")
  5. //save as a hive table
  6. spark.table("zips_table").write.saveAsTable("zips_hive_table")
  7. //make a similar query against the hive table
  8. val resultsHiveDF = spark.sql("SELECT city, pop, state, zip FROM zips_hive_table WHERE pop > 40000")
  9. resultsHiveDF.show(10)
复制代码

使用道具

10
白衣白衣 发表于 2017-5-29 20:16:39 |只看作者 |坛友微信交流群
我看看谢谢啊
已有 1 人评分论坛币 收起 理由
Nicolle + 20 精彩帖子

总评分: 论坛币 + 20   查看全部评分

使用道具

您需要登录后才可以回帖 登录 | 我要注册

本版微信群
加好友,备注jltj
拉您入交流群

京ICP备16021002-2号 京B2-20170662号 京公网安备 11010802022788号 论坛法律顾问:王进律师 知识产权保护声明   免责及隐私声明

GMT+8, 2024-5-22 20:09