楼主: hooli
4268 21

[Data Science]Learning Spark [推广有奖]

  • 2关注
  • 3粉丝

博士生

97%

还不是VIP/贵宾

-

威望
0
论坛币
26034 个
通用积分
2.2654
学术水平
88 点
热心指数
77 点
信用等级
47 点
经验
10596 点
帖子
148
精华
2
在线时间
328 小时
注册时间
2015-2-6
最后登录
2022-3-1

楼主
hooli 在职认证  发表于 2015-3-28 09:59:37 |只看作者 |坛友微信交流群|倒序 |AI写论文
相似文件 换一批

+2 论坛币
k人 参与回答

经管之家送您一份

应届毕业生专属福利!

求职就业群
赵安豆老师微信:zhaoandou666

经管之家联合CDA

送您一个全额奖学金名额~ !

感谢您参与论坛问题回答

经管之家送您两个论坛币!

+2 论坛币
Table of Contents
Foreword. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
1. Introduction to Data Analysis with Spark. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
What Is Apache Spark? 1
A Unified Stack 2
Spark Core 3
Spark SQL 3
Spark Streaming 3
MLlib 4
GraphX 4
Cluster Managers 4
Who Uses Spark, and for What? 4
Data Science Tasks 5
Data Processing Applications 6
A Brief History of Spark 6
Spark Versions and Releases 7
Storage Layers for Spark 7
2. Downloading Spark and Getting Started. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Downloading Spark 9
Introduction to Spark’s Python and Scala Shells 11
Introduction to Core Spark Concepts 14
Standalone Applications 17
Initializing a SparkContext 17
Building Standalone Applications 18
Conclusion 21
iii
3. Programming with RDDs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
RDD Basics 23
Creating RDDs 25
RDD Operations 26
Transformations 27
Actions 28
Lazy Evaluation 29
Passing Functions to Spark 30
Python 30
Scala 31
Java 32
Common Transformations and Actions 34
Basic RDDs 34
Converting Between RDD Types 42
Persistence (Caching) 44
Conclusion 46
4. Working with Key/Value Pairs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
Motivation 47
Creating Pair RDDs 48
Transformations on Pair RDDs 49
Aggregations 51
Grouping Data 57
Joins 58
Sorting Data 59
Actions Available on Pair RDDs 60
Data Partitioning (Advanced) 61
Determining an RDD’s Partitioner 64
Operations That Benefit from Partitioning 65
Operations That Affect Partitioning 65
Example: PageRank 66
Custom Partitioners 68
Conclusion 70
5. Loading and Saving Your Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
Motivation 71
File Formats 72
Text Files 73
JSON 74
Comma-Separated Values and Tab-Separated Values 77
SequenceFiles 80
Object Files 83
iv | Table of Contents
Hadoop Input and Output Formats 84
File Compression 87
Filesystems 89
Local/“Regular” FS 89
Amazon S3 90
HDFS 90
Structured Data with Spark SQL 91
Apache Hive 91
JSON 92
Databases 93
Java Database Connectivity 93
Cassandra 94
HBase 96
Elasticsearch 97
Conclusion 98
6. Advanced Spark Programming. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
Introduction 99
Accumulators 100
Accumulators and Fault Tolerance 103
Custom Accumulators 103
Broadcast Variables 104
Optimizing Broadcasts 106
Working on a Per-Partition Basis 107
Piping to External Programs 109
Numeric RDD Operations 113
Conclusion 115
7. Running on a Cluster. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
Introduction 117
Spark Runtime Architecture 117
The Driver 118
Executors 119
Cluster Manager 119
Launching a Program 120
Summary 120
Deploying Applications with spark-submit 121
Packaging Your Code and Dependencies 123
A Java Spark Application Built with Maven 124
A Scala Spark Application Built with sbt 126
Dependency Conflicts 128
Scheduling Within and Between Spark Applications 128
Table of Contents | v
Cluster Managers 129
Standalone Cluster Manager 129
Hadoop YARN 133
Apache Mesos 134
Amazon EC2 135
Which Cluster Manager to Use? 138
Conclusion 139
8. Tuning and Debugging Spark. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
Configuring Spark with SparkConf 141
Components of Execution: Jobs, Tasks, and Stages 145
Finding Information 150
Spark Web UI 150
Driver and Executor Logs 154
Key Performance Considerations 155
Level of Parallelism 155
Serialization Format 156
Memory Management 157
Hardware Provisioning 158
Conclusion 160
9. Spark SQL. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
Linking with Spark SQL 162
Using Spark SQL in Applications 164
Initializing Spark SQL 164
Basic Query Example 165
SchemaRDDs 166
Caching 169
Loading and Saving Data 170
Apache Hive 170
Parquet 171
JSON 172
From RDDs 174
JDBC/ODBC Server 175
Working with Beeline 177
Long-Lived Tables and Queries 178
User-Defined Functions 178
Spark SQL UDFs 178
Hive UDFs 179
Spark SQL Performance 180
Performance Tuning Options 180
Conclusion 182
vi | Table of Contents
10. Spark Streaming. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
A Simple Example 184
Architecture and Abstraction 186
Transformations 189
Stateless Transformations 190
Stateful Transformations 192
Output Operations 197
Input Sources 199
Core Sources 199
Additional Sources 200
Multiple Sources and Cluster Sizing 204
24/7 Operation 205
Checkpointing 205
Driver Fault Tolerance 206
Worker Fault Tolerance 207
Receiver Fault Tolerance 207
Processing Guarantees 208
Streaming UI 208
Performance Considerations 209
Batch and Window Sizes 209
Level of Parallelism 210
Garbage Collection and Memory Usage 210
Conclusion 211
11. Machine Learning with MLlib. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
Overview 213
System Requirements 214
Machine Learning Basics 215
Example: Spam Classification 216
Data Types 218
Working with Vectors 219
Algorithms 220
Feature Extraction 221
Statistics 223
Classification and Regression 224
Clustering 229
Collaborative Filtering and Recommendation 230
Dimensionality Reduction 232
Model Evaluation 234
Tips and Performance Considerations 234
Preparing Features 234
Configuring Algorithms 235
Table of Contents | vii
Caching RDDs to Reuse 235
Recognizing Sparsity 235
Level of Parallelism 236
Pipeline API 236
Conclusion 237
Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239


本帖隐藏的内容

OReilly.Learning.Spark.2015.1.pdf (7.82 MB, 需要: 100 个论坛币)




二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

关键词:Data Science Learning earning Science Learn Big Data Machine Learning Big Data Analysis Spark

已有 2 人评分经验 论坛币 学术水平 热心指数 信用等级 收起 理由
Lisrelchen + 100 + 100 + 5 + 5 精彩帖子
Nicolle + 100 + 5 + 5 + 5 精彩帖子

总评分: 经验 + 200  论坛币 + 100  学术水平 + 10  热心指数 + 10  信用等级 + 5   查看全部评分

本帖被以下文库推荐

沙发
Lisrelchen 发表于 2015-3-28 10:14:50 |只看作者 |坛友微信交流群
[em23]

使用道具

藤椅
auirzxp 学生认证  发表于 2015-3-28 10:26:27 |只看作者 |坛友微信交流群

使用道具

板凳
mike68097 发表于 2015-3-28 10:33:31 |只看作者 |坛友微信交流群

使用道具

报纸
Nicolle 学生认证  发表于 2015-3-28 10:33:52 |只看作者 |坛友微信交流群
提示: 作者被禁止或删除 内容自动屏蔽

使用道具

地板
auirzxp 学生认证  发表于 2015-3-28 10:39:04 |只看作者 |坛友微信交流群

使用道具

7
bfzldh 学生认证  发表于 2015-3-28 11:28:01 |只看作者 |坛友微信交流群
[em23][em23][em23][em23]

使用道具

8
bailihongchen 发表于 2015-3-28 12:58:23 |只看作者 |坛友微信交流群
thanks for sharing

使用道具

9
complicated 在职认证  发表于 2015-3-30 09:33:26 |只看作者 |坛友微信交流群
datadatadaydaydatadata

使用道具

10
ntsean 发表于 2015-3-31 06:55:41 |只看作者 |坛友微信交流群
thanks

使用道具

您需要登录后才可以回帖 登录 | 我要注册

本版微信群
加好友,备注cda
拉您进交流群

京ICP备16021002-2号 京B2-20170662号 京公网安备 11010802022788号 论坛法律顾问:王进律师 知识产权保护声明   免责及隐私声明

GMT+8, 2024-4-24 09:58