Spark Cookbook - 经管之家

0关注
6粉丝

已卖：1554份资源

副教授

31%

还不是VIP/贵宾

-

TA的文库 其他...

Complex Data Analysis

东西方金融数据分析

eBook with Data and Code

0%

威望: 0 级
论坛币: 11734 个
通用积分: 2.2450
学术水平: 119 点
热心指数: 115 点
信用等级: 114 点
经验: 8940 点
帖子: 173
精华: 10
在线时间: 30 小时
注册时间: 2006-9-19
最后登录: 2022-11-3

楼主

NewOccidental 发表于 2015-9-4 08:36:23 |AI写论文

是否 +2 论坛币

k人参与回答

经管之家送您一份

应届毕业生专属福利!

求职就业群

赵安豆老师微信：zhaoandou666

经管之家联合CDA

送您一个全额奖学金名额~ !

立即领取

感谢您参与论坛问题回答

经管之家送您两个论坛币！

+2 论坛币

What You Will Learn
Install and configure Apache Spark with various cluster managers
Set up development environments
Perform interactive queries using Spark SQL
Get to grips with real-time streaming analytics using Spark Streaming
Master supervised learning and unsupervised learning using MLlib
Build a recommendation engine using MLlib
Develop a set of common applications or project types, and solutions that solve complex big data problems
Use Apache Spark as your single big data compute platform and master its libraries
In Detail
By introducing in-memory persistent storage, Apache Spark eliminates the need to store intermediate data in filesystems, thereby increasing processing speed by up to 100 times.
This book will focus on how to analyze large and complex sets of data. Starting with installing and configuring Apache Spark with various cluster managers, you will cover setting up development environments. You will then cover various recipes to perform interactive queries using Spark SQL and real-time streaming with various sources such as Twitter Stream and Apache Kafka. You will then focus on machine learning, including supervised learning, unsupervised learning, and recommendation engine algorithms. After mastering graph processing using GraphX, you will cover various recipes for cluster optimization and troubleshooting.
Authors
Rishi Yadav
Rishi Yadav has 17 years of experience in designing and developing enterprise applications. He is an open source software expert and advises American companies on big data trends. Rishi was honored as one of Silicon Valley's 40 under 40 in 2014. He finished his bachelor's degree at the prestigious Indian Institute of Technology (IIT) Delhi in 1998.
About 10 years ago, Rishi started InfoObjects, a company that helps data-driven businesses gain new insights into data.
InfoObjects combines the power of open source and big data to solve business challenges for its clients and has a special focus on Apache Spark. The company has been on the Inc. 5000 list of the fastest growing companies for 4 years in a row. InfoObjects has also been awarded with the #1 best place to work in the Bay Area in 2014 and 2015.
Rishi is an open source contributor and active blogger.

复制代码

本帖隐藏的内容

Spark Cookbook.pdf (5.23 MB, 需要: 20 个论坛币)

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

分享0 收藏5 回帖

关键词：Cookbook Spark Book Cook Park streaming learning complex common engine

本帖被以下文库推荐

· 经典计算机教材文库|主题: 1311, 订阅: 219
· 编程语言(Coding Languages)|主题: 3936, 订阅: 126
· Apache Spark NewOccidental|主题: 195, 订阅: 7
· 東西方精品圖書|主题: 896, 订阅: 110
· Data Science NewOccidental|主题: 1233, 订阅: 120

加关注串个门加好友发消息 0关注 463 粉丝巨擘 Nicolle 当前离线阅读权限 255 威望 16 级论坛币 12403139 个通用积分 1638.8018 学术水平 3305 点热心指数 3329 点信用等级 3095 点经验 476993 点帖子 23839 精华 91 在线时间 9878 小时注册时间 2005-4-23 最后登录 2022-3-6 雷达卡	沙发 Nicolle(未真实交易用户) 发表于 2015-9-12 12:12:29 提示: 作者被禁止或删除内容自动屏蔽

	回复举报

藤椅

sunnyyyyy123(真实交易用户) 发表于 2015-9-12 12:34:07

Loading data from HDFS

How to do it...
Let's do the word count, which counts the number of occurrences of each word. In this recipe,
we will load data from HDFS:
1. Create the words directory by using the following command:
$ mkdir words
2. Change the directory to words:
$ cd words
3. Create the sh.txt text file and enter "to be or not to be" in it:
$ echo "to be or not to be" > sh.txt
4. Start the Spark shell:
$ spark-shell
5. Load the words directory as the RDD:
scala> val words = sc.textFile("hdfs://localhost:9000/user/hduser/
words")
The sc.textFile method also supports passing an additional argument for the number of partitions. By default, Spark creates
one partition for each InputSplit class, which roughly corresponds to one block.
You can ask for a higher number of partitions. It works really well for compute-intensive jobs such as in machine learning. As
one partition cannot contain more than one block, having fewer partitions than blocks is not allowed.
6. Count the number of lines (the result will be 1):
scala> words.count
7. Divide the line (or lines) into multiple words:
scala> val wordsFlatMap = words.flatMap(_.split("\\W+"))
8. Convert word to (word,1)—that is, output 1 as a value for each occurrence of word as a key:
scala> val wordsMap = wordsFlatMap.map( w => (w,1))
9. Use the reduceByKey method to add the number of occurrences of each word as a
key (this function works on two consecutive values at a time, represented by a and b):
scala> val wordCount = wordsMap.reduceByKey( (a,b) => (a+b))

复制代码

已有 1 人评分	论坛币	收起理由
Nicolle	+ 20	精彩帖子

总评分: 论坛币 + 20 查看全部评分

板凳

ntsean(真实交易用户) 发表于 2015-9-12 12:43:17

Loading data from HDFS

How to do it...
Let's do the word count, which counts the number of occurrences of each word. In this recipe,
we will load data from HDFS:
1. Create the words directory by using the following command:
$ mkdir words
2. Change the directory to words:
$ cd words
3. Create the sh.txt text file and enter "to be or not to be" in it:
$ echo "to be or not to be" > sh.txt
4. Start the Spark shell:
$ spark-shell
5. Load the words directory as the RDD:
scala> val words = sc.textFile("hdfs://localhost:9000/user/hduser/
words")
The sc.textFile method also supports passing an additional argument for the number of partitions. By default, Spark creates
one partition for each InputSplit class, which roughly corresponds to one block.
You can ask for a higher number of partitions. It works really well for compute-intensive jobs such as in machine learning. As
one partition cannot contain more than one block, having fewer partitions than blocks is not allowed.
6. Count the number of lines (the result will be 1):
scala> words.count
7. Divide the line (or lines) into multiple words:
scala> val wordsFlatMap = words.flatMap(_.split("\\W+"))
8. Convert word to (word,1)—that is, output 1 as a value for each occurrence of word as a key:
scala> val wordsMap = wordsFlatMap.map( w => (w,1))
9. Use the reduceByKey method to add the number of occurrences of each word as a key (this function works on two consecutive values at a time, represented by a and b):
scala> val wordCount = wordsMap.reduceByKey( (a,b) => (a+b))

复制代码

已有 1 人评分	论坛币	收起理由
Nicolle	+ 20	精彩帖子

总评分: 论坛币 + 20 查看全部评分

报纸

Crsky7(未真实交易用户) 发表于 2015-9-12 12:51:07

Loading data from HDFS using a custom InputFormat

How to do it...
We are going to load text data in key-value format and load it in Spark using
KeyValueTextInputFormat:
1. Create the currency directory by using the following command:
$ mkdir currency
2. Change the current directory to currency:
$ cd currency
3. Create the na.txt text file and enter currency values in key-value format delimited
by tab (key: country, value: currency):
$ vi na.txt
United States of America US Dollar
Canada Canadian Dollar
Mexico Peso
You can create more files for each continent.
4. Upload the currency folder to HDFS:
$ hdfs dfs -put currency /user/hduser/currency
5. Start the Spark shell:
$ spark-shell
6. Import statements:
scala> import org.apache.hadoop.io.Text
scala> import org.apache.hadoop.mapreduce.lib.input.
KeyValueTextInputFormat
7. Load the currency directory as the RDD:
val currencyFile = sc.newAPIHadoopFile("hdfs://localhost:9000/
user/hduser/currency",classOf[KeyValueTextInputFormat],classOf[Tex
t],classOf[Text])
8. Convert it from tuple of (Text,Text) to tuple of (String,String):
val currencyRDD = currencyFile.map( t => (t._1.toString,t._2.
toString))
9. Count the number of elements in the RDD:
scala> currencyRDD.count
10. Print the values:
scala> currencyRDD.collect.foreach(println)

复制代码

已有 1 人评分	论坛币	收起理由
Nicolle	+ 20	精彩帖子

总评分: 论坛币 + 20 查看全部评分

地板

东风夏日(真实交易用户) 发表于 2015-9-12 13:20:41

Loading data from Apache Cassandra

How to do it...
Perform the following steps to load data from Cassandra:
1. Create a keyspace named people in Cassandra using the CQL shell:
cqlsh> CREATE KEYSPACE people WITH replication = {'class':
'SimpleStrategy', 'replication_factor': 1 };
2. Create a column family (from CQL 3.0 onwards, it can also be called a table) person
in newer versions of Cassandra:
cqlsh> create columnfamily person(id int primary key,first_name
varchar,last_name varchar);
3. Insert a few records in the column family:
cqlsh> insert into person(id,first_name,last_name)
values(1,'Barack','Obama');
cqlsh> insert into person(id,first_name,last_name)
values(2,'Joe','Smith');
4. Add Cassandra connector dependency to SBT:
"com.datastax.spark" %% "spark-cassandra-connector" % 1.2.0
6. Now start the Spark shell.
7. Set the spark.cassandra.connection.host property in the Spark shell:
scala> sc.getConf.set("spark.cassandra.connection.host",
"localhost")
8. Import Cassandra-specific libraries:
scala> import com.datastax.spark.connector._
9. Load the person column family as an RDD:
scala> val personRDD = sc.cassandraTable("people","person")
10. Count the number of records in the RDD:
scala> personRDD.count
11. Print data in the RDD:
scala> personRDD.collect.foreach(println)
12. Retrieve the first row:
scala> val firstRow = personRDD.first
13. Get the column names:
scala> firstRow.columnNames
14. Cassandra can also be accessed through Spark SQL. It has a wrapper around
SQLContext called CassandraSQLContext; let's load it:
scala> val cc = new org.apache.spark.sql.cassandra.
CassandraSQLContext(sc)
15. Load the person data as SchemaRDD:
scala> val p = cc.sql("select * from people.person")
16. Retrieve the person data:
scala> p.collect.foreach(println)

复制代码

7楼

pzh_hzp(未真实交易用户) 发表于 2015-9-12 13:48:30

Loading data from relational databases

How to do it...
Perform the following steps to load data from relational databases:
1. Create a table named person in MySQL using the following DDL:
CREATE TABLE 'person' (
'person_id' int(11) NOT NULL AUTO_INCREMENT,
'first_name' varchar(30) DEFAULT NULL,
'last_name' varchar(30) DEFAULT NULL,
'gender' char(1) DEFAULT NULL,
PRIMARY KEY ('person_id');
)
2. Insert some data:
Insert into person values('Barack','Obama','M');
Insert into person values('Bill','Clinton','M');
Insert into person values('Hillary','Clinton','F');

复制代码

已有 1 人评分	论坛币	收起理由
Nicolle	+ 20	精彩帖子

总评分: 论坛币 + 20 查看全部评分

加关注串个门加好友发消息 677 关注 49粉丝禁止访问先生 ryoeng 当前离线阅读权限 0 威望 0 级论坛币 11663 个通用积分 1661.3321 学术水平 202 点热心指数 266 点信用等级 117 点经验 146224 点帖子 1324 精华 0 在线时间 1204 小时注册时间 2014-12-23 最后登录 2024-4-18 雷达卡	8楼 ryoeng(真实交易用户) 发表于 2015-9-12 14:33:49 提示: 作者被禁止或删除内容自动屏蔽

	回复举报

加关注串个门加好友发消息 6关注 76粉丝禁止访问 auirzxp 当前离线阅读权限 0 威望 1 级论坛币 229692 个通用积分 25268.5833 学术水平 4223 点热心指数 4861 点信用等级 4173 点经验 4496 点帖子 13492 精华 0 在线时间 12559 小时注册时间 2007-1-3 最后登录 2024-4-8 雷达卡	9楼 auirzxp(未真实交易用户) 发表于 2015-9-12 14:49:36 提示: 作者被禁止或删除内容自动屏蔽

	回复举报

10楼

li_mao(未真实交易用户) 发表于 2015-9-12 15:22:32

看看

Spark Cookbook [推广有奖]

经管之家送您一份

经管之家联合CDA

感谢您参与论坛问题回答

本帖隐藏的内容

扫码加我拉你入群

相关帖子

本帖被以下文库推荐

Loading data from HDFS

Loading data from HDFS

Loading data from HDFS using a custom InputFormat

Loading data from Apache Cassandra

Loading data from relational databases

浏览过的帖子

浏览过的版块

初级热心勋章

中级热心勋章

高级热心勋章

特级热心勋章

初级信用勋章

本版微信群

Spark Cookbook [推广有奖]

经管之家送您一份

经管之家联合CDA

感谢您参与论坛问题回答

本帖隐藏的内容

扫码加我 拉你入群

相关帖子

本帖被以下文库推荐

Loading data from HDFS

Loading data from HDFS

Loading data from HDFS using a custom InputFormat

Loading data from Apache Cassandra

Loading data from relational databases

浏览过的帖子

浏览过的版块

初级热心勋章

中级热心勋章

高级热心勋章

特级热心勋章

初级信用勋章

本版微信群

扫码加我拉你入群