楼主: linxiaochen
166 4

[Hadoop] 【电子书下载】Hadoop数据分析.pdf [推广有奖]

  • 5关注
  • 1粉丝

博士生

86%

还不是VIP/贵宾

-

威望
0
论坛币
20052 个
学术水平
0 点
热心指数
1 点
信用等级
0 点
经验
268 点
帖子
353
精华
0
在线时间
369 小时
注册时间
2012-5-1
最后登录
2019-5-22

linxiaochen 发表于 2019-5-14 23:14:58 |显示全部楼层
[图灵程序设计丛书].Hadoop数据分析.pdf
针对数据分析介绍分布式计算涉及的大量概念、工具、技术,纵览Hadoop生态系统
作者:Benjamin Bengfort   Jenny Kim

内 容 提 要
通过提供分布式数据存储和并行计算框架,Hadoop 已经从一个集群计算的抽象演化成了
一个大数据的操作系统。本书旨在通过以可读且直观的方式提供集群计算和分析的概览,为数
据科学家深入了解特定主题领域铺平道路,从数据科学家的视角介绍Hadoop 集群计算和分析。
本书分为两大部分,第一部分从非常高的层次介绍分布式计算,讨论如何在集群上运行计算;
第二部分则重点关注数据科学家应该了解的工具和技术,意在为各种分析和大规模数据管理提
供动力。
2018年 4 月第 1 版


Hadoop数据分析.png

目录:
第一部分 分布式计算入门
第1 章 数据产品时代 .......................................................................................................................2
1.1 什么是数据产品 .........................................................................................................................2
1.2 使用Hadoop 构建大规模数据产品 ..........................................................................................4
1.2.1 利用大型数据集 ............................................................................................................4
1.2.2 数据产品中的Hadoop ...................................................................................................5
1.3 数据科学流水线和Hadoop 生态系统 ......................................................................................6
1.4 小结 .............................................................................................................................................8
第2 章 大数据操作系统 ..................................................................................................................9
2.1 基本概念 ...................................................................................................................................10
2.2 Hadoop 架构 .............................................................................................................................11
2.2.1 Hadoop 集群 .................................................................................................................12
2.2.2 HDFS ............................................................................................................................14
2.2.3 YARN............................................................................................................................15
2.3 使用分布式文件系统 ...............................................................................................................16
2.3.1 基本的文件系统操作 ..................................................................................................16
2.3.2 HDFS 文件权限 ...........................................................................................................18
2.3.3 其他HDFS 接口 ..........................................................................................................19
2.4 使用分布式计算 .......................................................................................................................20
2.4.1 MapReduce:函数式编程模型 ...................................................................................20

2.4.2 MapReduce:集群上的实现 .......................................................................................22
2.4.3 不止一个MapReduce:作业链 ..................................................................................27
2.5 向YARN 提交MapReduce 作业 ............................................................................................28
2.6 小结 ...........................................................................................................................................30
第3 章 Python 框架和Hadoop Streaming .............................................................................31
3.1 Hadoop Streaming .....................................................................................................................32
3.1.1 使用Streaming 在CSV 数据上运行计算 ..................................................................34
3.1.2 执行Streaming 作业 ....................................................................................................38
3.2 Python 的MapReduce 框架 .....................................................................................................39
3.2.1 短语计数 ......................................................................................................................42
3.2.2 其他框架 ......................................................................................................................45
3.3 MapReduce 进阶 .......................................................................................................................46
3.3.1 combiner .......................................................................................................................46
3.3.2 partitioner ......................................................................................................................47
3.3.3 作业链 ..........................................................................................................................47
3.4 小结 ...........................................................................................................................................50
第4 章 Spark 内存计算 .................................................................................................................52
4.1 Spark 基础.................................................................................................................................53
4.1.1 Spark 栈 ........................................................................................................................54
4.1.2 RDD ..............................................................................................................................55
4.1.3 使用RDD 编程 ............................................................................................................56
4.2 基于PySpark 的交互性Spark .................................................................................................59
4.3 编写Spark 应用程序................................................................................................................61
4.4 小结 ...........................................................................................................................................67
第5 章 分布式分析和模式 ............................................................................................................69
5.1 键计算 .......................................................................................................................................70
5.1.1 复合键 ..........................................................................................................................71
5.1.2 键空间模式 ..................................................................................................................74
5.1.3 pair 与stripe .................................................................................................................78
5.2 设计模式 ...................................................................................................................................80
5.2.1 概要 ..............................................................................................................................81
5.2.2 索引 ..............................................................................................................................85
5.2.3 过滤 ..............................................................................................................................90

5.3 迈向最后一英里分析 ...............................................................................................................95
5.3.1 模型拟合 ......................................................................................................................96
5.3.2 模型验证 ......................................................................................................................97
5.4 小结 ...........................................................................................................................................98

第二部分 大数据科学的工作流和工具
第6 章 数据挖掘和数据仓储......................................................................................................102
6.1 Hive 结构化数据查询 ............................................................................................................103
6.1.1 Hive 命令行接口(CLI) ...........................................................................................103
6.1.2 Hive 查询语言 ............................................................................................................104
6.1.3 Hive 数据分析 ............................................................................................................108
6.2 HBase ......................................................................................................................................113
6.2.1 NoSQL 与列式数据库 ...............................................................................................114
6.2.2 HBase 实时分析 .........................................................................................................116
6.3 小结 .........................................................................................................................................122
第7 章 数据采集 ............................................................................................................................123
7.1 使用Sqoop 导入关系数据 .....................................................................................................124
7.1.1 从MySQL 导入HDFS ..............................................................................................124
7.1.2 从MySQL 导入Hive.................................................................................................126
7.1.3 从MySQL 导入HBase ..............................................................................................128
7.2 使用Flume 获取流式数据 .....................................................................................................130
7.2.1 Flume 数据流 .............................................................................................................130
7.2.2 使用Flume 获取产品印象数据 ................................................................................133
7.3 小结 .........................................................................................................................................136
第8 章 使用高级API 进行分析 .................................................................................................137
8.1 Pig............................................................................................................................................137
8.1.1 Pig Latin ......................................................................................................................138
8.1.2 数据类型 ....................................................................................................................142
8.1.3 关系运算符 ................................................................................................................142
8.1.4 用户定义函数 ............................................................................................................143
8.1.5 Pig 小结 ......................................................................................................................144
8.2 Spark 高级API .......................................................................................................................144
8.2.1 Spark SQL...................................................................................................................146
8.2.2 DataFrame ...................................................................................................................148
8.3 小结 .........................................................................................................................................153

第9 章 机器学习 ............................................................................................................................154
9.1 使用Spark 进行可扩展的机器学习......................................................................................154
9.1.1 协同过滤 ....................................................................................................................156
9.1.2 分类 ............................................................................................................................161
9.1.3 聚类 ............................................................................................................................163
9.2 小结 .........................................................................................................................................166

第10 章 总结:分布式数据科学实战 ......................................................................................167
10.1 数据产品生命周期 ...............................................................................................................168
10.1.1 数据湖泊 .................................................................................................................169
10.1.2 数据采集 .................................................................................................................171
10.1.3 计算数据存储 .........................................................................................................172
10.2 机器学习生命周期 ...............................................................................................................173
10.3 小结 .......................................................................................................................................175
附录A 创建Hadoop 伪分布式开发环境 ................................................................................176
附录B 安装Hadoop 生态系统产品 .........................................................................................184
术语表 ..................................................................................................................................................193


链接:https://pan.baidu.com/s/1EvASESLjODkUkB1eqNiJTQ
提取码:xkj5



stata SPSS
hifinecon 发表于 2019-5-15 09:37:20 来自手机 |显示全部楼层
linxiaochen 发表于 2019-5-14 23:14
[图灵程序设计丛书].Hadoop数据分析.pdf
针对数据分析介绍分布式计算涉及的大量概念、工具、技术,纵览Ha ...

回复

使用道具 举报

qgjtso111 发表于 2019-5-15 13:12:28 |显示全部楼层
需要解压密码,那里有
回复

使用道具 举报

heiyaodai 发表于 2019-5-15 22:12:26 |显示全部楼层
谢谢分享
回复

使用道具 举报

linxiaochen 发表于 2019-5-19 21:44:14 |显示全部楼层
qgjtso111 发表于 2019-5-15 13:12
需要解压密码,那里有
https://fgk.pw/i/jXC6QKZ2604
回复

使用道具 举报

您需要登录后才可以回帖 登录 | 我要注册

京ICP备16021002-2号 京B2-20170662号 京公网安备 11010802022788号 论坛法律顾问:王进律师 知识产权保护声明   免责及隐私声明

GMT+8, 2019-5-22 23:34