[ETL] Scala and Spark for Big Data Analytics [推广有奖]

0关注
1粉丝

硕士生

还不是VIP/贵宾

威望: 0 级
论坛币: 291 个
通用积分: 0.0600
学术水平: 0 点
热心指数: 0 点
信用等级: 0 点
经验: 1090 点
帖子: 95
精华: 0
在线时间: 140 小时
注册时间: 2008-9-3
最后登录: 2023-2-11

楼主

cwh2008 发表于 2017-10-28 18:04:23 |只看作者 |坛友微信交流群|倒序 |AI写论文

相似文件

换一批

是否 +2 论坛币

k人参与回答

经管之家送您一份

应届毕业生专属福利!

求职就业群

赵安豆老师微信：zhaoandou666

经管之家联合CDA

送您一个全额奖学金名额~ !

立即领取

感谢您参与论坛问题回答

经管之家送您两个论坛币！

+2 论坛币

The continued growth in data coupled with the need to make increasingly
complex decisions against that data is creating massive hurdles that prevent
organizations from deriving insights in a timely manner using traditional
analytical approaches. The field of big data has become so related to these
frameworks that its scope is defined by what these frameworks can handle.
Whether you're scrutinizing the clickstream from millions of visitors to
optimize online ad placements, or sifting through billions of transactions to
identify signs of fraud, the need for advanced analytics, such as machine
learning and graph processing, to automatically glean insights from enormous
volumes of data is more evident than ever.
Apache Spark, the de facto standard for big data processing, analytics, and
data sciences across all academia and industries, provides both machine
learning and graph processing libraries, allowing companies to tackle
complex problems easily with the power of highly scalable and clustered
computers. Spark's promise is to take this a little further to make writing
distributed programs using Scala feel like writing regular programs for Spark.
Spark will be great in giving ETL pipelines huge boosts in performance and
easing some of the pain that feeds the MapReduce programmer's daily chant
of despair to the Hadoop gods.
In this book, we used Spark and Scala for the endeavor to bring state-of-the-
art advanced data analytics with machine learning, graph processing,
streaming, and SQL to Spark, with their contributions to MLlib, ML, SQL,
GraphX, and other libraries.
We started with Scala and then moved to the Spark part, and finally, covered
some advanced topics for big data analytics with Spark and Scala. In the
appendix, we will see how to extend your Scala knowledge for SparkR,
PySpark, Apache Zeppelin, and in-memory Alluxio. This book isn't meant to
be read from cover to cover. Skip to a chapter that looks like something
you're trying to accomplish or that simply ignites your interest.

Chapter 1 , Introduction to Scala, will teach big data analytics using the Scala-
based APIs of Spark. Spark itself is written with Scala and naturally, as a
starting point, we will discuss a brief introduction to Scala, such as the basic
aspects of its history, purposes, and how to install Scala on Windows, Linux,
and Mac OS. After that, the Scala web framework will be discussed in brief.
Then, we will provide a comparative analysis of Java and Scala. Finally, we
will dive into Scala programming to get started with Scala.
Chapter 2 , Object-Oriented Scala, says that the object-oriented programming
(OOP) paradigm provides a whole new layer of abstraction. In short, this
chapter discusses some of the greatest strengths of OOP languages:
discoverability, modularity, and extensibility. In particular, we will see how
to deal with variables in Scala; methods, classes, and objects in Scala;
packages and package objects; traits and trait linearization; and Java
interoperability.
Chapter 3 , Functional Programming Concepts, showcases the functional
programming concepts in Scala. More specifically, we will learn several
topics, such as why Scala is an arsenal for the data scientist, why it is
important to learn the Spark paradigm, pure functions, and higher-order
functions (HOFs). A real-life use case using HOFs will be shown too. Then,
we will see how to handle exceptions in higher-order functions outside of
collections using the standard library of Scala. Finally, we will look at how
functional Scala affects an object's mutability.
Chapter4 , Collection APIs, introduces one of the features that attract most
Scala users--the Collections API. It's very powerful and flexible, and has lots
of operations coupled. We will also demonstrate the capabilities of the Scala
Collection API and how it can be used in order to accommodate different
types of data and solve a wide range of different problems. In this chapter, we
will cover Scala collection APIs, types and hierarchy, some performance
characteristics, Java interoperability, and Scala implicits.
Chapter 5 , Tackle Big Data - Spark Comes to the Party, outlines data analysis
and big data; we see the challenges that big data poses, how they are dealt
with by distributed computing, and the approaches suggested by functional
programming. We introduce Google's MapReduce, Apache Hadoop, and
finally, Apache Spark, and see how they embraced this approach and these
techniques. We will look into the evolution of Apache Spark: why Apache
Spark was created in the first place and the value it can bring to the
challenges of big data analytics and processing.
Chapter 6 , Start Working with Spark - REPL and RDDs, covers how Spark
works; then, we introduce RDDs, the basic abstractions behind Apache
Spark, and see that they are simply distributed collections exposing Scala-like
APIs. We will look at the deployment options for Apache Spark and run it
locally as a Spark shell. We will learn the internals of Apache Spark, what
RDDs are, DAGs and lineages of RDDs, Transformations, and Actions.
Chapter 7 , Special RDD Operations, focuses on how RDDs can be tailored to
meet different needs, and how these RDDs provide new functionalities (and
dangers!) Moreover, we investigate other useful objects that Spark provides,
such as broadcast variables and Accumulators. We will learn aggregation
techniques, shuffling.
Chapter 8 , Introduce a Little Structure - SparkSQL, teaches how to use Spark
for the analysis of structured data as a higher-level abstraction of RDDs and
how Spark SQL's APIs make querying structured data simple yet robust.
Moreover, we introduce datasets and look at the differences between datasets,
DataFrames, and RDDs. We will also learn to join operations and window
functions to do complex data analysis using DataFrame APIs.
Chapter 9 , Stream Me Up, Scotty - Spark Streaming, takes you through Spark
Streaming and how we can take advantage of it to process streams of data
using the Spark API. Moreover, in this chapter, the reader will learn various
ways of processing real-time streams of data using a practical example to
consume and process tweets from Twitter. We will look at integration with
Apache Kafka to do real-time processing. We will also look at structured
streaming, which can provide real-time queries to your applications.