Install and configure Apache Spark with various cluster managers
Set up development environments
Perform interactive queries using Spark SQL
Get to grips with real-time streaming analytics using Spark Streaming
Master supervised learning and unsupervised learning using MLlib
Build a recommendation engine using MLlib
Develop a set of common applications or project types, and solutions that solve complex big data problems
Use Apache Spark as your single big data compute platform and master its libraries
In Detail
By introducing in-memory persistent storage, Apache Spark eliminates the need to store intermediate data in filesystems, thereby increasing processing speed by up to 100 times.
This book will focus on how to analyze large and complex sets of data. Starting with installing and configuring Apache Spark with various cluster managers, you will cover setting up development environments. You will then cover various recipes to perform interactive queries using Spark SQL and real-time streaming with various sources such as Twitter Stream and Apache Kafka. You will then focus on machine learning, including supervised learning, unsupervised learning, and recommendation engine algorithms. After mastering graph processing using GraphX, you will cover various recipes for cluster optimization and troubleshooting.
Authors
Rishi Yadav
Rishi Yadav has 17 years of experience in designing and developing enterprise applications. He is an open source software expert and advises American companies on big data trends. Rishi was honored as one of Silicon Valley's 40 under 40 in 2014. He finished his bachelor's degree at the prestigious Indian Institute of Technology (IIT) Delhi in 1998.
About 10 years ago, Rishi started InfoObjects, a company that helps data-driven businesses gain new insights into data.
InfoObjects combines the power of open source and big data to solve business challenges for its clients and has a special focus on Apache Spark. The company has been on the Inc. 5000 list of the fastest growing companies for 4 years in a row. InfoObjects has also been awarded with the #1 best place to work in the Bay Area in 2014 and 2015.
Rishi is an open source contributor and active blogger.
Let's do the word count, which counts the number of occurrences of each word. In this recipe,
we will load data from HDFS:
1. Create the words directory by using the following command:
$ mkdir words
2. Change the directory to words:
$ cd words
3. Create the sh.txt text file and enter "to be or not to be" in it:
$ echo "to be or not to be" > sh.txt
4. Start the Spark shell:
$ spark-shell
5. Load the words directory as the RDD:
scala> val words = sc.textFile("hdfs://localhost:9000/user/hduser/
words")
The sc.textFile method also supports passing an additional argument for the number of partitions. By default, Spark creates
one partition for each InputSplit class, which roughly corresponds to one block.
You can ask for a higher number of partitions. It works really well for compute-intensive jobs such as in machine learning. As
one partition cannot contain more than one block, having fewer partitions than blocks is not allowed.
6. Count the number of lines (the result will be 1):
scala> words.count
7. Divide the line (or lines) into multiple words:
scala> val wordsFlatMap = words.flatMap(_.split("\\W+"))
8. Convert word to (word,1)—that is, output 1 as a value for each occurrence of word as a key:
scala> val wordsMap = wordsFlatMap.map( w => (w,1))
9. Use the reduceByKey method to add the number of occurrences of each word as a key (this function works on two consecutive values at a time, represented by a and b):
scala> val wordCount = wordsMap.reduceByKey( (a,b) => (a+b))