Hadoop is a distributed system with a large ecosystem, which is growing
at an exponential rate, and hence it becomes important to get a grip on
things and do a deep dive into the functioning of a Hadoop cluster in
production. Whether you are new to Hadoop or a seasoned Hadoop
specialist, this recipe book contains recipes to deep dive into Hadoop
cluster configuration and optimization.
What this book covers
Chapter 1, Hadoop Architecture and Deployment, covers Hadoop's
architecture, its components, various installation modes and important
daemons, and the services that make Hadoop a robust system. This chapter
covers single-node and multinode clusters.
Chapter 2, Maintaining Hadoop Cluster – HDFS, wraps the storage layer
HDFS, block size, replication, cluster health, Quota configuration, rack
awareness, and communication channel between nodes.
Chapter 3, Maintaining Hadoop Cluster – YARN and MapReduce, talks
about the processing layer in Hadoop and the resource management
framework YARN. This chapter covers how to configure YARN
components, submit jobs, configure job history server, and YARN
fundamentals.
Chapter 4, High Availability, covers high availability for a Namenode and
Resourcemanager, ZooKeeper configuration, HDFS storage-based
policies, HDFS snapshots, and rolling upgrades.
Chapter 5, Schedulers, talks about YARN schedulers such as fair and
capacity scheduler, with detailed recipes on configuring Queues, Queue
ACLs, configuration of users and groups, and other Queue administration
commands.
Chapter 6, Backup and Recovery, covers Hadoop metastore, backup and
restore procedures on a Namenode, configuration of a secondary
Namenode, and various ways of recovering lost Namenodes. This chapter
also talks about configuring HDFS and YARN logs for troubleshooting.
Chapter 7, Data Ingestion and Workflow, talks about Hive configuration
and its various modes of operation. This chapter also covers setting up
Hive with the credential store and highly available access using
ZooKeeper. The recipes in this chapter give details about the process of
loading data into Hive, partitioning, bucketing concepts, and configuration
with an external metastore. It also covers Oozie installation and Flume
configuration for log ingestion.
Chapter 8, Performance Tuning, covers the performance tuning aspects of
29
HDFS, YARN containers, the operating system, and network parameters,
as well as optimizing the cluster for production by comparing benchmarks
for various configurations.
Chapter 9, Hbase and RDBMS, talks about HBase cluster configuration,
best practices, HBase tuning, backup, and restore. It also covers migration
of data from MySQL to HBase and the procedure to upgrade HBase to the
latest release.
Chapter 10, Cluster Planning, covers Hadoop cluster planning and the best
practices for designing clusters are, in terms of disk storage, network,
servers, and placement policy. This chapter also covers costing and the
impact of SLA driver workloads on cluster planning.
Chapter 11, Troubleshooting, Diagnostics, and Best Practices, talks about
the troubleshooting steps for a Namenode and Datanode, and diagnoses
communication errors. It also covers details on logs and how to parse them
for errors to extract important key points on issues faced.
Chapter 12, Security, covers Hadoop security in terms of data encryption,
in-transit encryption, ssl configuration, and, more importantly, configuring
Kerberos for the Hadoop cluster. This chapter also covers auditing and a
recipe on securing ZooKeeper.