[Data Science]Learning Spark

2关注
3粉丝

已卖：340份资源

博士生

97%

还不是VIP/贵宾

-

0%

威望: 0 级
论坛币: 26034 个
通用积分: 2.2654
学术水平: 88 点
热心指数: 77 点
信用等级: 47 点
经验: 10596 点
帖子: 148
精华: 2
在线时间: 328 小时
注册时间: 2015-2-6
最后登录: 2022-3-1

楼主

hooli

发表于 2015-3-28 09:59:37 |AI写论文

是否 +2 论坛币

k人参与回答

经管之家送您一份

应届毕业生专属福利!

求职就业群

赵安豆老师微信：zhaoandou666

经管之家联合CDA

送您一个全额奖学金名额~ !

立即领取

感谢您参与论坛问题回答

经管之家送您两个论坛币！

+2 论坛币

Table of Contents
Foreword. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
1. Introduction to Data Analysis with Spark. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
What Is Apache Spark? 1
A Unified Stack 2
Spark Core 3
Spark SQL 3
Spark Streaming 3
MLlib 4
GraphX 4
Cluster Managers 4
Who Uses Spark, and for What? 4
Data Science Tasks 5
Data Processing Applications 6
A Brief History of Spark 6
Spark Versions and Releases 7
Storage Layers for Spark 7
2. Downloading Spark and Getting Started. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Downloading Spark 9
Introduction to Spark’s Python and Scala Shells 11
Introduction to Core Spark Concepts 14
Standalone Applications 17
Initializing a SparkContext 17
Building Standalone Applications 18
Conclusion 21
iii
3. Programming with RDDs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
RDD Basics 23
Creating RDDs 25
RDD Operations 26
Transformations 27
Actions 28
Lazy Evaluation 29
Passing Functions to Spark 30
Python 30
Scala 31
Java 32
Common Transformations and Actions 34
Basic RDDs 34
Converting Between RDD Types 42
Persistence (Caching) 44
Conclusion 46
4. Working with Key/Value Pairs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
Motivation 47
Creating Pair RDDs 48
Transformations on Pair RDDs 49
Aggregations 51
Grouping Data 57
Joins 58
Sorting Data 59
Actions Available on Pair RDDs 60
Data Partitioning (Advanced) 61
Determining an RDD’s Partitioner 64
Operations That Benefit from Partitioning 65
Operations That Affect Partitioning 65
Example: PageRank 66
Custom Partitioners 68
Conclusion 70
5. Loading and Saving Your Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
Motivation 71
File Formats 72
Text Files 73
JSON 74
Comma-Separated Values and Tab-Separated Values 77
SequenceFiles 80
Object Files 83
iv | Table of Contents
Hadoop Input and Output Formats 84
File Compression 87
Filesystems 89
Local/“Regular” FS 89
Amazon S3 90
HDFS 90
Structured Data with Spark SQL 91
Apache Hive 91
JSON 92
Databases 93
Java Database Connectivity 93
Cassandra 94
HBase 96
Elasticsearch 97
Conclusion 98
6. Advanced Spark Programming. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
Introduction 99
Accumulators 100
Accumulators and Fault Tolerance 103
Custom Accumulators 103
Broadcast Variables 104
Optimizing Broadcasts 106
Working on a Per-Partition Basis 107
Piping to External Programs 109
Numeric RDD Operations 113
Conclusion 115
7. Running on a Cluster. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
Introduction 117
Spark Runtime Architecture 117
The Driver 118
Executors 119
Cluster Manager 119
Launching a Program 120
Summary 120
Deploying Applications with spark-submit 121
Packaging Your Code and Dependencies 123
A Java Spark Application Built with Maven 124
A Scala Spark Application Built with sbt 126
Dependency Conflicts 128
Scheduling Within and Between Spark Applications 128
Table of Contents | v
Cluster Managers 129
Standalone Cluster Manager 129
Hadoop YARN 133
Apache Mesos 134
Amazon EC2 135
Which Cluster Manager to Use? 138
Conclusion 139
8. Tuning and Debugging Spark. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
Configuring Spark with SparkConf 141
Components of Execution: Jobs, Tasks, and Stages 145
Finding Information 150
Spark Web UI 150
Driver and Executor Logs 154
Key Performance Considerations 155
Level of Parallelism 155
Serialization Format 156
Memory Management 157
Hardware Provisioning 158
Conclusion 160
9. Spark SQL. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
Linking with Spark SQL 162
Using Spark SQL in Applications 164
Initializing Spark SQL 164
Basic Query Example 165
SchemaRDDs 166
Caching 169
Loading and Saving Data 170
Apache Hive 170
Parquet 171
JSON 172
From RDDs 174
JDBC/ODBC Server 175
Working with Beeline 177
Long-Lived Tables and Queries 178
User-Defined Functions 178
Spark SQL UDFs 178
Hive UDFs 179
Spark SQL Performance 180
Performance Tuning Options 180
Conclusion 182
vi | Table of Contents
10. Spark Streaming. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
A Simple Example 184
Architecture and Abstraction 186
Transformations 189
Stateless Transformations 190
Stateful Transformations 192
Output Operations 197
Input Sources 199
Core Sources 199
Additional Sources 200
Multiple Sources and Cluster Sizing 204
24/7 Operation 205
Checkpointing 205
Driver Fault Tolerance 206
Worker Fault Tolerance 207
Receiver Fault Tolerance 207
Processing Guarantees 208
Streaming UI 208
Performance Considerations 209
Batch and Window Sizes 209
Level of Parallelism 210
Garbage Collection and Memory Usage 210
Conclusion 211
11. Machine Learning with MLlib. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
Overview 213
System Requirements 214
Machine Learning Basics 215
Example: Spam Classification 216
Data Types 218
Working with Vectors 219
Algorithms 220
Feature Extraction 221
Statistics 223
Classification and Regression 224
Clustering 229
Collaborative Filtering and Recommendation 230
Dimensionality Reduction 232
Model Evaluation 234
Tips and Performance Considerations 234
Preparing Features 234
Configuring Algorithms 235
Table of Contents | vii
Caching RDDs to Reuse 235
Recognizing Sparsity 235
Level of Parallelism 236
Pipeline API 236
Conclusion 237
Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239

本帖隐藏的内容

OReilly.Learning.Spark.2015.1.pdf (7.82 MB, 需要: 100 个论坛币)

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

分享0 收藏5 回帖

关键词：Data Science Learning earning Science Learn Big Data Machine Learning Big Data Analysis Spark

本帖被以下文库推荐

· 编程语言(Coding Languages)|主题: 3936, 订阅: 126
· Apache Spark NewOccidental|主题: 195, 订阅: 7
· Data Science NewOccidental|主题: 1233, 订阅: 120
· 2万+全球顶级名校/投行英文文献 |主题: 21710, 订阅: 2697

沙发

Lisrelchen(未真实交易用户) 发表于 2015-3-28 10:14:50

[em23]

加关注串个门加好友发消息 6关注 76粉丝禁止访问 auirzxp 当前离线阅读权限 0 威望 1 级论坛币 229692 个通用积分 25268.5833 学术水平 4223 点热心指数 4861 点信用等级 4173 点经验 4496 点帖子 13492 精华 0 在线时间 12559 小时注册时间 2007-1-3 最后登录 2024-4-8 雷达卡	藤椅 auirzxp(未真实交易用户) 发表于 2015-3-28 10:26:27 提示: 作者被禁止或删除内容自动屏蔽

	回复举报

板凳

mike68097(未真实交易用户) 发表于 2015-3-28 10:33:31

加关注串个门加好友发消息 0关注 463 粉丝巨擘 Nicolle 当前离线阅读权限 255 威望 16 级论坛币 12403139 个通用积分 1638.8018 学术水平 3305 点热心指数 3329 点信用等级 3095 点经验 476993 点帖子 23839 精华 91 在线时间 9878 小时注册时间 2005-4-23 最后登录 2022-3-6 雷达卡	报纸 Nicolle(未真实交易用户) 发表于 2015-3-28 10:33:52 提示: 作者被禁止或删除内容自动屏蔽

	回复举报

加关注串个门加好友发消息 6关注 76粉丝禁止访问 auirzxp 当前离线阅读权限 0 威望 1 级论坛币 229692 个通用积分 25268.5833 学术水平 4223 点热心指数 4861 点信用等级 4173 点经验 4496 点帖子 13492 精华 0 在线时间 12559 小时注册时间 2007-1-3 最后登录 2024-4-8 雷达卡	地板 auirzxp(未真实交易用户) 发表于 2015-3-28 10:39:04 提示: 作者被禁止或删除内容自动屏蔽

	回复举报

7楼

bfzldh(未真实交易用户)

发表于 2015-3-28 11:28:01

[em23][em23][em23][em23]

8楼

bailihongchen(未真实交易用户) 发表于 2015-3-28 12:58:23

thanks for sharing

9楼

complicated(未真实交易用户)

发表于 2015-3-30 09:33:26

datadatadaydaydatadata

10楼

ntsean(真实交易用户) 发表于 2015-3-31 06:55:41

thanks

[Data Science]Learning Spark [推广有奖]

经管之家送您一份

经管之家联合CDA

感谢您参与论坛问题回答

本帖隐藏的内容

扫码加我拉你入群

相关帖子

本帖被以下文库推荐

浏览过的帖子

浏览过的版块

初级热心勋章

中级热心勋章

高级热心勋章

特级热心勋章

初级信用勋章

本版微信群

[Data Science]Learning Spark [推广有奖]

经管之家送您一份

经管之家联合CDA

感谢您参与论坛问题回答

本帖隐藏的内容

扫码加我 拉你入群

相关帖子

本帖被以下文库推荐

浏览过的帖子

浏览过的版块

初级热心勋章

中级热心勋章

高级热心勋章

特级热心勋章

初级信用勋章

本版微信群

扫码加我拉你入群