针对数据分析介绍分布式计算涉及的大量概念、工具、技术,纵览Hadoop生态系统
作者:Benjamin Bengfort Jenny Kim
内 容 提 要
通过提供分布式数据存储和并行计算框架,Hadoop 已经从一个集群计算的抽象演化成了
一个大数据的操作系统。本书旨在通过以可读且直观的方式提供集群计算和分析的概览,为数
据科学家深入了解特定主题领域铺平道路,从数据科学家的视角介绍Hadoop 集群计算和分析。
本书分为两大部分,第一部分从非常高的层次介绍分布式计算,讨论如何在集群上运行计算;
第二部分则重点关注数据科学家应该了解的工具和技术,意在为各种分析和大规模数据管理提
供动力。
2018年 4 月第 1 版
目录:
第一部分 分布式计算入门
第1 章 数据产品时代 .......................................................................................................................2
1.1 什么是数据产品 .........................................................................................................................2
1.2 使用Hadoop 构建大规模数据产品 ..........................................................................................4
1.2.1 利用大型数据集 ............................................................................................................4
1.2.2 数据产品中的Hadoop ...................................................................................................5
1.3 数据科学流水线和Hadoop 生态系统 ......................................................................................6
1.4 小结 .............................................................................................................................................8
第2 章 大数据操作系统 ..................................................................................................................9
2.1 基本概念 ...................................................................................................................................10
2.2 Hadoop 架构 .............................................................................................................................11
2.2.1 Hadoop 集群 .................................................................................................................12
2.2.2 HDFS ............................................................................................................................14
2.2.3 YARN............................................................................................................................15
2.3 使用分布式文件系统 ...............................................................................................................16
2.3.1 基本的文件系统操作 ..................................................................................................16
2.3.2 HDFS 文件权限 ...........................................................................................................18
2.3.3 其他HDFS 接口 ..........................................................................................................19
2.4 使用分布式计算 .......................................................................................................................20
2.4.1 MapReduce:函数式编程模型 ...................................................................................20
2.4.2 MapReduce:集群上的实现 .......................................................................................22
2.4.3 不止一个MapReduce:作业链 ..................................................................................27
2.5 向YARN 提交MapReduce 作业 ............................................................................................28
2.6 小结 ...........................................................................................................................................30
第3 章 Python 框架和Hadoop Streaming .............................................................................31
3.1 Hadoop Streaming .....................................................................................................................32
3.1.1 使用Streaming 在CSV 数据上运行计算 ..................................................................34
3.1.2 执行Streaming 作业 ....................................................................................................38
3.2 Python 的MapReduce 框架 .....................................................................................................39
3.2.1 短语计数 ......................................................................................................................42
3.2.2 其他框架 ......................................................................................................................45
3.3 MapReduce 进阶 .......................................................................................................................46
3.3.1 combiner .......................................................................................................................46
3.3.2 partitioner ......................................................................................................................47
3.3.3 作业链 ..........................................................................................................................47
3.4 小结 ...........................................................................................................................................50
第4 章 Spark 内存计算 .................................................................................................................52
4.1 Spark 基础.................................................................................................................................53
4.1.1 Spark 栈 ........................................................................................................................54
4.1.2 RDD ..............................................................................................................................55
4.1.3 使用RDD 编程 ............................................................................................................56
4.2 基于PySpark 的交互性Spark .................................................................................................59
4.3 编写Spark 应用程序................................................................................................................61
4.4 小结 ...........................................................................................................................................67
第5 章 分布式分析和模式 ............................................................................................................69
5.1 键计算 .......................................................................................................................................70
5.1.1 复合键 ..........................................................................................................................71
5.1.2 键空间模式 ..................................................................................................................74
5.1.3 pair 与stripe .................................................................................................................78
5.2 设计模式 ...................................................................................................................................80
5.2.1 概要 ..............................................................................................................................81
5.2.2 索引 ..............................................................................................................................85
5.2.3 过滤 ..............................................................................................................................90
5.3 迈向最后一英里分析 ...............................................................................................................95
5.3.1 模型拟合 ......................................................................................................................96
5.3.2 模型验证 ......................................................................................................................97
5.4 小结 ...........................................................................................................................................98
第二部分 大数据科学的工作流和工具
第6 章 数据挖掘和数据仓储......................................................................................................102
6.1 Hive 结构化数据查询 ............................................................................................................103
6.1.1 Hive 命令行接口(CLI) ...........................................................................................103
6.1.2 Hive 查询语言 ............................................................................................................104
6.1.3 Hive 数据分析 ............................................................................................................108
6.2 HBase ......................................................................................................................................113
6.2.1 NoSQL 与列式数据库 ...............................................................................................114
6.2.2 HBase 实时分析 .........................................................................................................116
6.3 小结 .........................................................................................................................................122
第7 章 数据采集 ............................................................................................................................123
7.1 使用Sqoop 导入关系数据 .....................................................................................................124
7.1.1 从MySQL 导入HDFS ..............................................................................................124
7.1.2 从MySQL 导入Hive.................................................................................................126
7.1.3 从MySQL 导入HBase ..............................................................................................128
7.2 使用Flume 获取流式数据 .....................................................................................................130
7.2.1 Flume 数据流 .............................................................................................................130
7.2.2 使用Flume 获取产品印象数据 ................................................................................133
7.3 小结 .........................................................................................................................................136
第8 章 使用高级API 进行分析 .................................................................................................137
8.1 Pig............................................................................................................................................137
8.1.1 Pig Latin ......................................................................................................................138
8.1.2 数据类型 ....................................................................................................................142
8.1.3 关系运算符 ................................................................................................................142
8.1.4 用户定义函数 ............................................................................................................143
8.1.5 Pig 小结 ......................................................................................................................144
8.2 Spark 高级API .......................................................................................................................144
8.2.1 Spark SQL...................................................................................................................146
8.2.2 DataFrame ...................................................................................................................148
8.3 小结 .........................................................................................................................................153
第9 章 机器学习 ............................................................................................................................154
9.1 使用Spark 进行可扩展的机器学习......................................................................................154
9.1.1 协同过滤 ....................................................................................................................156
9.1.2 分类 ............................................................................................................................161
9.1.3 聚类 ............................................................................................................................163
9.2 小结 .........................................................................................................................................166
第10 章 总结:分布式数据科学实战 ......................................................................................167
10.1 数据产品生命周期 ...............................................................................................................168
10.1.1 数据湖泊 .................................................................................................................169
10.1.2 数据采集 .................................................................................................................171
10.1.3 计算数据存储 .........................................................................................................172
10.2 机器学习生命周期 ...............................................................................................................173
10.3 小结 .......................................................................................................................................175
附录A 创建Hadoop 伪分布式开发环境 ................................................................................176
附录B 安装Hadoop 生态系统产品 .........................................................................................184
术语表 ..................................................................................................................................................193
链接:https://pan.baidu.com/s/1EvASESLjODkUkB1eqNiJTQ
提取码:xkj5


雷达卡






京公网安备 11010802022788号







