Book Description
When people want a way to process Big Data at speed, Spark is invariably the solution. With its ease of development (in comparison to the relative complexity of Hadoop), it’s unsurprising that it’s becoming popular with data analysts and engineers everywhere.
Beginning with the fundamentals, we’ll show you how to get set up with Spark with minimum fuss. You’ll then get to grips with some simple APIs before investigating machine learning and graph processing – throughout we’ll make sure you know exactly how to apply your knowledge.
You will also learn how to use the Spark shell, how to load data before finding out how to build and run your own Spark applications. Discover how to manipulate your RDD and get stuck into a range of DataFrame APIs. As if that’s not enough, you’ll also learn some useful Machine Learning algorithms with the help of Spark MLlib and integrating Spark with R. We’ll also make sure you’re confident and prepared for graph processing, as you learn more about the GraphX API
https://bbs.pinggu.org/thread-5060262-1-1.html
What You Will Learn
- Install and set up Spark in your cluster
- Prototype distributed applications with Spark's interactive shell
- Perform data wrangling using the new DataFrame APIs
- Get to know the different ways to interact with Spark's distributed representation of data (RDDs)
- Query Spark with a SQL-like query syntax
- See how Spark works with Big Data
- Implement machine learning systems with highly scalable algorithms
- Use R, the popular statistical language, to work with Spark
- Apply interesting graph algorithms and graph processing with GraphX
Authors :Krishna Sankar
Krishna Sankar is a Senior Specialist—AI Data Scientist with Volvo Cars focusing on Autonomous Vehicles. His earlier stints include Chief Data Scientist at http://cadenttech.tv/, Principal Architect/Data Scientist at Tata America Intl. Corp., Director of Data Science at a bioinformatics startup, and as a Distinguished Engineer at Cisco. He has been speaking at various conferences including ML tutorials at Strata SJC and London 2016, Spark Summit [goo.gl/ab30lD], Strata-Spark Camp, OSCON, PyCon, and PyData, writes about Robots Rules of Order [goo.gl/5yyRv6], Big Data Analytics—Best of the Worst [goo.gl/ImWCaz], predicting NFL, Spark [http://goo.gl/E4kqMD], Data Science [http://goo.gl/9pyJMH], Machine Learning [http://goo.gl/SXF53n], Social Media Analysis [http://goo.gl/D9YpVQ] as well as has been a guest lecturer at the Naval Postgraduate School. His occasional blogs can be found at https://doubleclix.wordpress.com/. His other passion is flying drones (working towards Drone Pilot License (FAA UAS Pilot) and Lego Robotics—you will find him at the St.Louis FLL World Competition as Robots Design Judge.
Table of Contents1: INSTALLING SPARK AND SETTING UP YOUR CLUSTER
Directory organization and convention
Installing the prebuilt distribution
Building Spark from source
Spark topology
A single machine
Running Spark on EC2
Deploying Spark with Chef (Opscode)
Deploying Spark on Mesos
Spark on YARN
Spark standalone mode
References
Summary
2: USING THE SPARK SHELL
The Spark shell
Loading a simple text file
Interactively loading data from S3
Summary
3: BUILDING AND RUNNING A SPARK APPLICATION
Building Spark applications
Data wrangling with iPython
Developing Spark with Eclipse
Developing Spark with other IDEs
Building your Spark job with Maven
Building your Spark job with something else
References
Summary
4: CREATING A SPARKSESSION OBJECT
SparkSession versus SparkContext
Building a SparkSession object
SparkContext - metadata
Shared Java and Scala APIs
Python
iPython
Reference
Summary
5: LOADING AND SAVING DATA IN SPARK
Spark abstractions
Data modalities
Data modalities and Datasets/DataFrames/RDDs
Loading data into an RDD
Saving your data
References
Summary
6: MANIPULATING YOUR RDD
Manipulating your RDD in Scala and Java
Manipulating your RDD in Python
References
Summary
7: SPARK 2.0 CONCEPTS
Code and Datasets for the rest of the book
The data scientist and Spark features
Spark v2.0 and beyond
Apache Spark - evolution
Apache Spark - the full stack
The art of a big data store - Parquet
References
Summary
8: SPARK SQL
The Spark SQL architecture
Spark SQL how-to in a nutshell
Spark SQL programming
References
Summary
9: FOUNDATIONS OF DATASETS/DATAFRAMES – THE PROVERBIAL WORKHORSE FOR DATASCIENTISTS
Datasets - a quick introduction
Dataset APIs - an overview
Dataset interfaces and functions
References
Summary
10: SPARK WITH BIG DATA
Parquet - an efficient and interoperable big data format
HBase
Reference
Summary
11: MACHINE LEARNING WITH SPARK ML PIPELINES
Spark's machine learning algorithm table
Spark machine learning APIs - ML pipelines and MLlib
ML pipelines
Spark ML examples
The API organization
Basic statistics
Linear regression
Classification
Clustering
Recommendation
Hyper parameters
The final thing
References
Summary
12: GRAPHX
Graphs and graph processing - an introduction
Spark GraphX
GraphX - computational model
The first example - graph
Building graphs
The GraphX API landscape
Structural APIs
Community, affiliation, and strengths
Algorithms
Partition strategy
Case study - AlphaGo tweets analytics
References
Summary