Preface
This book covers the application of Hadoop and its ecosystem of tools to solve business
problems. Hadoop has fast emerged as the leading big data platform and finds applications
in many industries where massive datasets or big data has to be stored and analyzed.
Hadoop lowers the cost of investment in the storage. It supports the generation of new
business insights, which was not possible earlier because of the massive volumes and
computing capacity required to process such information. This book covers several
business cases to build solutions to business problems. Each solution covered in this book
has been built using Hadoop and HDFS and the set of tools from the Hadoop ecosystem.
What this book covers
Chapter 1, Hadoop and Big Data, goes over how Hadoop has played a pivotal role in
making several Internet businesses successful with big data from its beginnings in the
previous decade. This chapter covers a brief history and the story of the evolution of
Hadoop. It covers the Hadoop architecture and the MapReduce data processing
framework. It introduces basic Hadoop programming in Java and provides a detailed
overview of the business cases covered in the following chapters of this book. This
chapter builds the foundation for understanding the rest of the book.
Chapter 2, A 360-Degree View of the Customer, covers building a 360-degree view of
the customer. A good 360-degree view requires the integration of data from various
sources. The data sources are database management systems storing master data and
transactional data. Other data sources might include data captured from social media
feeds. In this chapter, we will be integrating data from CRM systems, web logs, and
Twitter feeds to build the 360-degree view and present it using a simple web interface. We
will learn about Apache Sqoop and Apache Hive in the process of building our solution.
Chapter 3, Building a Fraud Detection System, covers the building of a real-time fraud
detection system. This system predicts whether a financial transaction could be fraudulent
by applying a clustering algorithm on a stream of transactions. We will learn about the
architecture of the system and the coding steps involved in building the system. We will
learn about Apache Spark in the process of building our solution.
Chapter 4, Marketing Campaign Planning, shows how to build a system that can improve
the effectiveness of marketing campaigns. This system is a batch analytics system that
uses historical campaign-response data to predict who is going to respond to a marketing
folder. We will see how we can build a predictive model and use it to predict who is going
to respond to which folder in our marketing campaign. We will learn about BigML in the
process of building our solution.
Chapter 5, Churn Detection, explains how to use Hadoop to predict which customers are
likely to move over to another company. We will cover the business case of a mobile
telecom provider who would like to detect the customers who are likely to churn. These
customers are given special incentives so that they can stay with the same provider. We
will apply Bayes’ Theorem to calculate the likelihood of churn. The model for churn
detection will be built using Hadoop. We will learn about writing MapReduce programs in
Java in the process of building our solution.
Chapter 6, Analyze Sensor Data Using Hadoop, is about how to build a system to analyze
sensor data. Nowadays, sensors are considered an important source of big data. We will
learn how Hadoop and big-data technologies can be helpful in the Internet of Things (IoT)
domain. IoT is a network of connected devices that generate data through sensors. We will
build a system to monitor the quality of the environment, such as humidity and
temperature, in a factory. We will introduce Apache Kafka, Grafana, and OpenTSDB tools
in the process of building the solution.
https://www.iteblog.com
Chapter 7, Building a Data Lake, takes you through building a data lake using Hadoop and
several other tools to import data in a data lake and provide secure access to the data. Data
lakes are a popular business case for Hadoop. In a data lake, we store data from multiple
sources to build a single source of data for the enterprise and build a security layer around
it. We will learn about Apache Ranger, Apache Flume, and Apache Zeppelin in the
process of building our solution.
Chapter 8, Future Directions, covers four separate topics that are relevant to Hadoop-
based projects. These topics are building a Hadoop solutions team, Hadoop on the cloud,
NoSQL databases, and in-memory databases. This chapter does not include any coding
examples, unlike the other chapters. These fours topics have been covered in the essay
form so that you can explore them further.