第57课sparksql on hive 配置及案例实战 [推广有奖]

硕士生

34%

还不是VIP/贵宾

楼主

无量天尊Spark 发表于 2016-6-13 11:39:18 |AI写论文

是否 +2 论坛币

k人参与回答

应届毕业生专属福利!

求职就业群

赵安豆老师微信：zhaoandou666

送您一个全额奖学金名额~ !

经管之家送您两个论坛币！

+2 论坛币

一、spark on hive 配置

切换到spar的conf目录下使用vi hive-site.xml创建hive-site.xml.并填写如下内容

<configuration>
<property>
<name>hive.metastore.uris</name>
<value>thrift://master:9083</value>
<description>thrift URI for the remote metastore.Used by metastore client to connect to remote metastore. </description>
</property>
</configuration>

复制代码

因为用sparksql操作hive实际上是把hive 当做数据仓库。数据仓库肯定有元数据和数据本身。要访问真正的数据就要访问他的元数据。所以只需要配置hive.metastore.uris 即可。（不需在每台机器上配置）

二、启动集群

1）启动dfs 服务start-dfs.sh

2）启动hive 数据仓库服务 hive --service metastore >metastore.log2>& 1&

3）启动spark服务start-all.sh

4）启动sparkshell ./spark-shell –masterspark://master:7077

三、案例实战

1）Spark on hive 实战在spark-shell 模式下

val hiveContext= new org.apache.spark.sql.hive.HiveContext(sc)
hiveContext.sql(“use hive”) //使用hive 数据库
hiveContext.sql("show tables").collect.foreach(println) // 查询数据库中的表
hiveContext.sql(“select count(*) from sogouq1”).collect.foreach(println)(注意此时使用的是spark的引擎)
hiveContext.sql(“select count(*) from sogouq2 where website like '%baidu%'”).collect.foreach(println)
hiveContext.sql(“select count(*) from sogouq2 where s_seq=1 and c_seq=1 and website like '%baidu%'”).collect.foreach(println)

复制代码

2)不基于hive 的实战代码，在spark-shell 模式下

scala> sqlContext
res8: org.apache.spark.sql.SQLContext = org.apache.spark.sql.hive.HiveContext@35920655（可以创建很多hiveContext，hivecongtext连接的是数据仓库，程序本身（spark中的job），job在程序中并行运行，如果都hive的数据，如果用句柄，对句柄的占用比较麻烦，所有用多个实例。从查询的角度来说，有多个实例也很正常）
val df =sqlcontext.read.json(“library/examples/src/main/resources/people.json”) //读取json 数据
df.show()
df.printSchema
df.select(“name”).show()
df.select(df(“name”),df(“age”)+1).show()

复制代码

注：本学习笔记来自DT大数据梦工厂微信公众号：DT_Spark 每晚8点YY永久直播频道：68917580