楼主: NewOccidental
4823 40

Spark Cookbook [推广有奖]

  • 0关注
  • 6粉丝

副教授

31%

还不是VIP/贵宾

-

TA的文库  其他...

Complex Data Analysis

东西方金融数据分析

eBook with Data and Code

威望
0
论坛币
11534 个
通用积分
1.4350
学术水平
119 点
热心指数
115 点
信用等级
114 点
经验
8940 点
帖子
173
精华
10
在线时间
30 小时
注册时间
2006-9-19
最后登录
2022-11-3

相似文件 换一批

+2 论坛币
k人 参与回答

经管之家送您一份

应届毕业生专属福利!

求职就业群
赵安豆老师微信:zhaoandou666

经管之家联合CDA

送您一个全额奖学金名额~ !

感谢您参与论坛问题回答

经管之家送您两个论坛币!

+2 论坛币
  1. What You Will Learn

  2. Install and configure Apache Spark with various cluster managers
  3. Set up development environments
  4. Perform interactive queries using Spark SQL
  5. Get to grips with real-time streaming analytics using Spark Streaming
  6. Master supervised learning and unsupervised learning using MLlib
  7. Build a recommendation engine using MLlib
  8. Develop a set of common applications or project types, and solutions that solve complex big data problems
  9. Use Apache Spark as your single big data compute platform and master its libraries
  10. In Detail

  11. By introducing in-memory persistent storage, Apache Spark eliminates the need to store intermediate data in filesystems, thereby increasing processing speed by up to 100 times.

  12. This book will focus on how to analyze large and complex sets of data. Starting with installing and configuring Apache Spark with various cluster managers, you will cover setting up development environments. You will then cover various recipes to perform interactive queries using Spark SQL and real-time streaming with various sources such as Twitter Stream and Apache Kafka. You will then focus on machine learning, including supervised learning, unsupervised learning, and recommendation engine algorithms. After mastering graph processing using GraphX, you will cover various recipes for cluster optimization and troubleshooting.

  13. Authors

  14. Rishi Yadav

  15. Rishi Yadav has 17 years of experience in designing and developing enterprise applications. He is an open source software expert and advises American companies on big data trends. Rishi was honored as one of Silicon Valley's 40 under 40 in 2014. He finished his bachelor's degree at the prestigious Indian Institute of Technology (IIT) Delhi in 1998.

  16. About 10 years ago, Rishi started InfoObjects, a company that helps data-driven businesses gain new insights into data.

  17. InfoObjects combines the power of open source and big data to solve business challenges for its clients and has a special focus on Apache Spark. The company has been on the Inc. 5000 list of the fastest growing companies for 4 years in a row. InfoObjects has also been awarded with the #1 best place to work in the Bay Area in 2014 and 2015.

  18. Rishi is an open source contributor and active blogger.
复制代码

本帖隐藏的内容

Spark Cookbook.pdf (5.23 MB, 需要: 20 个论坛币)

二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

关键词:Cookbook Spark Book Cook Park streaming learning complex common engine

已有 1 人评分经验 收起 理由
np84 + 100 精彩帖子

总评分: 经验 + 100   查看全部评分

本帖被以下文库推荐

沙发
Nicolle 学生认证  发表于 2015-9-12 12:12:29 |只看作者 |坛友微信交流群
提示: 作者被禁止或删除 内容自动屏蔽

使用道具

藤椅
sunnyyyyy123 发表于 2015-9-12 12:34:07 |只看作者 |坛友微信交流群

Loading data from HDFS

  1. How to do it...
  2. Let's do the word count, which counts the number of occurrences of each word. In this recipe,
  3. we will load data from HDFS:
  4. 1. Create the words directory by using the following command:
  5. $ mkdir words
  6. 2. Change the directory to words:
  7. $ cd words
  8. 3. Create the sh.txt text file and enter "to be or not to be" in it:
  9. $ echo "to be or not to be" > sh.txt
  10. 4. Start the Spark shell:
  11. $ spark-shell
  12. 5. Load the words directory as the RDD:
  13. scala> val words = sc.textFile("hdfs://localhost:9000/user/hduser/
  14. words")
  15. The sc.textFile method also supports passing an additional argument for the number of partitions. By default, Spark creates
  16. one partition for each InputSplit class, which roughly corresponds to one block.
  17. You can ask for a higher number of partitions. It works really well for compute-intensive jobs such as in machine learning. As
  18. one partition cannot contain more than one block, having fewer partitions than blocks is not allowed.
  19. 6. Count the number of lines (the result will be 1):
  20. scala> words.count
  21. 7. Divide the line (or lines) into multiple words:
  22. scala> val wordsFlatMap = words.flatMap(_.split("\\W+"))
  23. 8. Convert word to (word,1)—that is, output 1 as a value for each occurrence of word as a key:
  24. scala> val wordsMap = wordsFlatMap.map( w => (w,1))
  25. 9. Use the reduceByKey method to add the number of occurrences of each word as a
  26. key (this function works on two consecutive values at a time, represented by a and b):
  27. scala> val wordCount = wordsMap.reduceByKey( (a,b) => (a+b))
复制代码

已有 1 人评分论坛币 收起 理由
Nicolle + 20 精彩帖子

总评分: 论坛币 + 20   查看全部评分

使用道具

板凳
ntsean 发表于 2015-9-12 12:43:17 |只看作者 |坛友微信交流群

Loading data from HDFS

  1. How to do it...
  2. Let's do the word count, which counts the number of occurrences of each word. In this recipe,
  3. we will load data from HDFS:
  4. 1. Create the words directory by using the following command:
  5. $ mkdir words
  6. 2. Change the directory to words:
  7. $ cd words
  8. 3. Create the sh.txt text file and enter "to be or not to be" in it:
  9. $ echo "to be or not to be" > sh.txt
  10. 4. Start the Spark shell:
  11. $ spark-shell
  12. 5. Load the words directory as the RDD:
  13. scala> val words = sc.textFile("hdfs://localhost:9000/user/hduser/
  14. words")
  15. The sc.textFile method also supports passing an additional argument for the number of partitions. By default, Spark creates
  16. one partition for each InputSplit class, which roughly corresponds to one block.
  17. You can ask for a higher number of partitions. It works really well for compute-intensive jobs such as in machine learning. As
  18. one partition cannot contain more than one block, having fewer partitions than blocks is not allowed.
  19. 6. Count the number of lines (the result will be 1):
  20. scala> words.count
  21. 7. Divide the line (or lines) into multiple words:
  22. scala> val wordsFlatMap = words.flatMap(_.split("\\W+"))
  23. 8. Convert word to (word,1)—that is, output 1 as a value for each occurrence of word as a key:
  24. scala> val wordsMap = wordsFlatMap.map( w => (w,1))
  25. 9. Use the reduceByKey method to add the number of occurrences of each word as a key (this function works on two consecutive values at a time, represented by a and b):
  26. scala> val wordCount = wordsMap.reduceByKey( (a,b) => (a+b))
复制代码

已有 1 人评分论坛币 收起 理由
Nicolle + 20 精彩帖子

总评分: 论坛币 + 20   查看全部评分

使用道具

报纸
Crsky7 发表于 2015-9-12 12:51:07 |只看作者 |坛友微信交流群

Loading data from HDFS using a custom InputFormat

  1. How to do it...
  2. We are going to load text data in key-value format and load it in Spark using
  3. KeyValueTextInputFormat:
  4. 1. Create the currency directory by using the following command:
  5. $ mkdir currency
  6. 2. Change the current directory to currency:
  7. $ cd currency
  8. 3. Create the na.txt text file and enter currency values in key-value format delimited
  9. by tab (key: country, value: currency):
  10. $ vi na.txt
  11. United States of America US Dollar
  12. Canada Canadian Dollar
  13. Mexico Peso
  14. You can create more files for each continent.
  15. 4. Upload the currency folder to HDFS:
  16. $ hdfs dfs -put currency /user/hduser/currency
  17. 5. Start the Spark shell:
  18. $ spark-shell
  19. 6. Import statements:
  20. scala> import org.apache.hadoop.io.Text
  21. scala> import org.apache.hadoop.mapreduce.lib.input.
  22. KeyValueTextInputFormat
  23. 7. Load the currency directory as the RDD:
  24. val currencyFile = sc.newAPIHadoopFile("hdfs://localhost:9000/
  25. user/hduser/currency",classOf[KeyValueTextInputFormat],classOf[Tex
  26. t],classOf[Text])
  27. 8. Convert it from tuple of (Text,Text) to tuple of (String,String):
  28. val currencyRDD = currencyFile.map( t => (t._1.toString,t._2.
  29. toString))
  30. 9. Count the number of elements in the RDD:
  31. scala> currencyRDD.count
  32. 10. Print the values:
  33. scala> currencyRDD.collect.foreach(println)
复制代码

已有 1 人评分论坛币 收起 理由
Nicolle + 20 精彩帖子

总评分: 论坛币 + 20   查看全部评分

使用道具

地板
东风夏日 发表于 2015-9-12 13:20:41 |只看作者 |坛友微信交流群

Loading data from Apache Cassandra

  1. How to do it...
  2. Perform the following steps to load data from Cassandra:
  3. 1. Create a keyspace named people in Cassandra using the CQL shell:
  4. cqlsh> CREATE KEYSPACE people WITH replication = {'class':
  5. 'SimpleStrategy', 'replication_factor': 1 };
  6. 2. Create a column family (from CQL 3.0 onwards, it can also be called a table) person
  7. in newer versions of Cassandra:
  8. cqlsh> create columnfamily person(id int primary key,first_name
  9. varchar,last_name varchar);
  10. 3. Insert a few records in the column family:
  11. cqlsh> insert into person(id,first_name,last_name)
  12. values(1,'Barack','Obama');
  13. cqlsh> insert into person(id,first_name,last_name)
  14. values(2,'Joe','Smith');
  15. 4. Add Cassandra connector dependency to SBT:
  16. "com.datastax.spark" %% "spark-cassandra-connector" % 1.2.0
  17. 6. Now start the Spark shell.
  18. 7. Set the spark.cassandra.connection.host property in the Spark shell:
  19. scala> sc.getConf.set("spark.cassandra.connection.host",
  20. "localhost")
  21. 8. Import Cassandra-specific libraries:
  22. scala> import com.datastax.spark.connector._
  23. 9. Load the person column family as an RDD:
  24. scala> val personRDD = sc.cassandraTable("people","person")
  25. 10. Count the number of records in the RDD:
  26. scala> personRDD.count
  27. 11. Print data in the RDD:
  28. scala> personRDD.collect.foreach(println)
  29. 12. Retrieve the first row:
  30. scala> val firstRow = personRDD.first
  31. 13. Get the column names:
  32. scala> firstRow.columnNames
  33. 14. Cassandra can also be accessed through Spark SQL. It has a wrapper around
  34. SQLContext called CassandraSQLContext; let's load it:
  35. scala> val cc = new org.apache.spark.sql.cassandra.
  36. CassandraSQLContext(sc)
  37. 15. Load the person data as SchemaRDD:
  38. scala> val p = cc.sql("select * from people.person")
  39. 16. Retrieve the person data:
  40. scala> p.collect.foreach(println)
复制代码

使用道具

7
pzh_hzp 发表于 2015-9-12 13:48:30 |只看作者 |坛友微信交流群

Loading data from relational databases

  1. How to do it...
  2. Perform the following steps to load data from relational databases:
  3. 1. Create a table named person in MySQL using the following DDL:
  4. CREATE TABLE 'person' (
  5. 'person_id' int(11) NOT NULL AUTO_INCREMENT,
  6. 'first_name' varchar(30) DEFAULT NULL,
  7. 'last_name' varchar(30) DEFAULT NULL,
  8. 'gender' char(1) DEFAULT NULL,
  9. PRIMARY KEY ('person_id');
  10. )
  11. 2. Insert some data:
  12. Insert into person values('Barack','Obama','M');
  13. Insert into person values('Bill','Clinton','M');
  14. Insert into person values('Hillary','Clinton','F');
复制代码

已有 1 人评分论坛币 收起 理由
Nicolle + 20 精彩帖子

总评分: 论坛币 + 20   查看全部评分

使用道具

8
ryoeng 在职认证  发表于 2015-9-12 14:33:49 |只看作者 |坛友微信交流群
提示: 作者被禁止或删除 内容自动屏蔽

使用道具

9
auirzxp 学生认证  发表于 2015-9-12 14:49:36 |只看作者 |坛友微信交流群

使用道具

10
li_mao 发表于 2015-9-12 15:22:32 |只看作者 |坛友微信交流群
看看

使用道具

您需要登录后才可以回帖 登录 | 我要注册

本版微信群
加好友,备注cda
拉您进交流群

京ICP备16021002-2号 京B2-20170662号 京公网安备 11010802022788号 论坛法律顾问:王进律师 知识产权保护声明   免责及隐私声明

GMT+8, 2024-4-25 22:32