楼主: tongjixuejiajp
2719 4

[数据挖掘理论与案例] 2014 新书 斯坦福教授 big data hadoop mapreduce 《mining of massive data>> [推广有奖]

  • 0关注
  • 0粉丝

已卖:59份资源

高中生

22%

还不是VIP/贵宾

-

威望
0
论坛币
598 个
通用积分
0
学术水平
2 点
热心指数
3 点
信用等级
2 点
经验
194 点
帖子
7
精华
0
在线时间
38 小时
注册时间
2014-12-22
最后登录
2015-3-7

楼主
tongjixuejiajp 发表于 2014-12-23 09:20:39 |AI写论文

+2 论坛币
k人 参与回答

经管之家送您一份

应届毕业生专属福利!

求职就业群
赵安豆老师微信:zhaoandou666

经管之家联合CDA

送您一个全额奖学金名额~ !

感谢您参与论坛问题回答

经管之家送您两个论坛币!

+2 论坛币
2014 新书 斯坦福教授 《mining of massive data>> hadoop, mapreduce, all kinds of mining techniques

1 Data Mining 1
1.1 What is Data Mining? . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Statistical Modeling . . . . . . . . . . . . . . . . . . . . . 1
1.1.2 Machine Learning . . . . . . . . . . . . . . . . . . . . . . 2
1.1.3 Computational Approaches to Modeling . . . . . . . . . . 2
1.1.4 Summarization . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.5 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Statistical Limits on Data Mining . . . . . . . . . . . . . . . . . . 4
1.2.1 Total Information Awareness . . . . . . . . . . . . . . . . 5
1.2.2 Bonferroni’s Principle . . . . . . . . . . . . . . . . . . . . 5
1.2.3 An Example of Bonferroni’s Principle . . . . . . . . . . . 6
1.2.4 Exercises for Section 1.2 . . . . . . . . . . . . . . . . . . . 7
1.3 Things Useful to Know . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3.1 Importance of Words in Documents . . . . . . . . . . . . 7
1.3.2 Hash Functions . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3.3 Indexes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.3.4 Secondary Storage . . . . . . . . . . . . . . . . . . . . . . 11
1.3.5 The Base of Natural Logarithms . . . . . . . . . . . . . . 12
1.3.6 Power Laws . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.3.7 Exercises for Section 1.3 . . . . . . . . . . . . . . . . . . . 15
1.4 Outline of the Book . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.5 Summary of Chapter 1 . . . . . . . . . . . . . . . . . . . . . . . . 17
1.6 References for Chapter 1 . . . . . . . . . . . . . . . . . . . . . . . 18
2 MapReduce and the New Software Stack 21
2.1 Distributed File Systems . . . . . . . . . . . . . . . . . . . . . . . 22
2.1.1 Physical Organization of Compute Nodes . . . . . . . . . 22
2.1.2 Large-Scale File-System Organization . . . . . . . . . . . 23
2.2 MapReduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.2.1 The Map Tasks . . . . . . . . . . . . . . . . . . . . . . . . 25
2.2.2 Grouping by Key . . . . . . . . . . . . . . . . . . . . . . . 26
2.2.3 The Reduce Tasks . . . . . . . . . . . . . . . . . . . . . . 27
2.2.4 Combiners . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
viiviii CONTENTS
2.2.5 Details of MapReduce Execution . . . . . . . . . . . . . . 28
2.2.6 Coping With Node Failures . . . . . . . . . . . . . . . . . 29
2.2.7 Exercises for Section 2.2 . . . . . . . . . . . . . . . . . . . 30
2.3 Algorithms Using MapReduce . . . . . . . . . . . . . . . . . . . . 30
2.3.1 Matrix-Vector Multiplication by MapReduce . . . . . . . 31
2.3.2 If the Vector v Cannot Fit in Main Memory . . . . . . . . 31
2.3.3 Relational-Algebra Operations . . . . . . . . . . . . . . . 32
2.3.4 Computing Selections by MapReduce . . . . . . . . . . . 35
2.3.5 Computing Projections by MapReduce . . . . . . . . . . . 36
2.3.6 Union, Intersection, and Difference by MapReduce . . . . 36
2.3.7 Computing Natural Join by MapReduce . . . . . . . . . . 37
2.3.8 Grouping and Aggregation by MapReduce . . . . . . . . . 37
2.3.9 Matrix Multiplication . . . . . . . . . . . . . . . . . . . . 38
2.3.10 Matrix Multiplication with One MapReduce Step . . . . . 39
2.3.11 Exercises for Section 2.3 . . . . . . . . . . . . . . . . . . . 40
2.4 Extensions to MapReduce . . . . . . . . . . . . . . . . . . . . . . 41
2.4.1 Workflow Systems . . . . . . . . . . . . . . . . . . . . . . 41
2.4.2 Recursive Extensions to MapReduce . . . . . . . . . . . . 42
2.4.3 Pregel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
2.4.4 Exercises for Section 2.4 . . . . . . . . . . . . . . . . . . . 46
2.5 The Communication Cost Model . . . . . . . . . . . . . . . . . . 46
2.5.1 Communication-Cost for Task Networks . . . . . . . . . . 47
2.5.2 Wall-Clock Time . . . . . . . . . . . . . . . . . . . . . . . 49
2.5.3 Multiway Joins . . . . . . . . . . . . . . . . . . . . . . . . 49
2.5.4 Exercises for Section 2.5 . . . . . . . . . . . . . . . . . . . 52
2.6 Complexity Theory for MapReduce . . . . . . . . . . . . . . . . . 54
2.6.1 Reducer Size and Replication Rate . . . . . . . . . . . . . 54
2.6.2 An Example: Similarity Joins . . . . . . . . . . . . . . . . 55
2.6.3 A Graph Model for MapReduce Problems . . . . . . . . . 57
2.6.4 Mapping Schemas . . . . . . . . . . . . . . . . . . . . . . 58
2.6.5 When Not All Inputs Are Present . . . . . . . . . . . . . 60
2.6.6 Lower Bounds on Replication Rate . . . . . . . . . . . . . 61
2.6.7 Case Study: Matrix Multiplication . . . . . . . . . . . . . 62
2.6.8 Exercises for Section 2.6 . . . . . . . . . . . . . . . . . . . 66
2.7 Summary of Chapter 2 . . . . . . . . . . . . . . . . . . . . . . . . 67
2.8 References for Chapter 2 . . . . . . . . . . . . . . . . . . . . . . . 69
3 Finding Similar Items 73
3.1 Applications of Near-Neighbor Search . . . . . . . . . . . . . . . 73
3.1.1 Jaccard Similarity of Sets . . . . . . . . . . . . . . . . . . 74
3.1.2 Similarity of Documents . . . . . . . . . . . . . . . . . . . 74
3.1.3 Collaborative Filtering as a Similar-Sets Problem . . . . . 75
3.1.4 Exercises for Section 3.1 . . . . . . . . . . . . . . . . . . . 77
3.2 Shingling of Documents . . . . . . . . . . . . . . . . . . . . . . . 77
3.2.1 k-Shingles . . . . . . . . . . . . . . . . . . . . . . . . . . . 77CONTENTS ix
3.2.2 Choosing the Shingle Size . . . . . . . . . . . . . . . . . . 78
3.2.3 Hashing Shingles . . . . . . . . . . . . . . . . . . . . . . . 79
3.2.4 Shingles Built from Words . . . . . . . . . . . . . . . . . . 79
3.2.5 Exercises for Section 3.2 . . . . . . . . . . . . . . . . . . . 80
3.3 Similarity-Preserving Summaries of Sets . . . . . . . . . . . . . . 80
3.3.1 Matrix Representation of Sets . . . . . . . . . . . . . . . . 81
3.3.2 Minhashing . . . . . . . . . . . . . . . . . . . . . . . . . . 81
3.3.3 Minhashing and Jaccard Similarity . . . . . . . . . . . . . 82
3.3.4 Minhash Signatures . . . . . . . . . . . . . . . . . . . . . 83
3.3.5 Computing Minhash Signatures . . . . . . . . . . . . . . . 83
3.3.6 Exercises for Section 3.3 . . . . . . . . . . . . . . . . . . . 86
3.4 Locality-Sensitive Hashing for Documents . . . . . . . . . . . . . 87
3.4.1 LSH for Minhash Signatures . . . . . . . . . . . . . . . . 88
3.4.2 Analysis of the Banding Technique . . . . . . . . . . . . . 89
3.4.3 Combining the Techniques . . . . . . . . . . . . . . . . . . 91
3.4.4 Exercises for Section 3.4 . . . . . . . . . . . . . . . . . . . 91
3.5 Distance Measures . . . . . . . . . . . . . . . . . . . . . . . . . . 92
3.5.1 Definition of a Distance Measure . . . . . . . . . . . . . . 92
3.5.2 Euclidean Distances . . . . . . . . . . . . . . . . . . . . . 93
3.5.3 Jaccard Distance . . . . . . . . . . . . . . . . . . . . . . . 94
3.5.4 Cosine Distance . . . . . . . . . . . . . . . . . . . . . . . . 95
3.5.5 Edit Distance . . . . . . . . . . . . . . . . . . . . . . . . . 95
3.5.6 Hamming Distance . . . . . . . . . . . . . . . . . . . . . . 96
3.5.7 Exercises for Section 3.5 . . . . . . . . . . . . . . . . . . . 97
3.6 The Theory of Locality-Sensitive Functions . . . . . . . . . . . . 99
3.6.1 Locality-Sensitive Functions . . . . . . . . . . . . . . . . . 99
3.6.2 Locality-Sensitive Families for Jaccard Distance . . . . . . 100
3.6.3 Amplifying a Locality-Sensitive Family . . . . . . . . . . . 101
3.6.4 Exercises for Section 3.6 . . . . . . . . . . . . . . . . . . . 103
3.7 LSH Families for Other Distance Measures . . . . . . . . . . . . . 104
3.7.1 LSH Families for Hamming Distance . . . . . . . . . . . . 104
3.7.2 Random Hyperplanes and the Cosine Distance . . . . . . 105
3.7.3 Sketches . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
3.7.4 LSH Families for Euclidean Distance . . . . . . . . . . . . 107
3.7.5 More LSH Families for Euclidean Spaces . . . . . . . . . . 108
3.7.6 Exercises for Section 3.7 . . . . . . . . . . . . . . . . . . . 109
3.8 Applications of Locality-Sensitive Hashing . . . . . . . . . . . . . 110
3.8.1 Entity Resolution . . . . . . . . . . . . . . . . . . . . . . . 110
3.8.2 An Entity-Resolution Example . . . . . . . . . . . . . . . 111
3.8.3 Validating Record Matches . . . . . . . . . . . . . . . . . 112
3.8.4 Matching Fingerprints . . . . . . . . . . . . . . . . . . . . 113
3.8.5 A LSH Family for Fingerprint Matching . . . . . . . . . . 114
3.8.6 Similar News Articles . . . . . . . . . . . . . . . . . . . . 115
3.8.7 Exercises for Section 3.8 . . . . . . . . . . . . . . . . . . . 117
3.9 Methods for High Degrees of Similarity . . . . . . . . . . . . . . 118x CONTENTS
3.9.1 Finding Identical Items . . . . . . . . . . . . . . . . . . . 118
3.9.2 Representing Sets as Strings . . . . . . . . . . . . . . . . . 118
3.9.3 Length-Based Filtering . . . . . . . . . . . . . . . . . . . . 119
3.9.4 Prefix Indexing . . . . . . . . . . . . . . . . . . . . . . . . 119
3.9.5 Using Position Information . . . . . . . . . . . . . . . . . 121
3.9.6 Using Position and Length in Indexes . . . . . . . . . . . 122
3.9.7 Exercises for Section 3.9 . . . . . . . . . . . . . . . . . . . 125
3.10 Summary of Chapter 3 . . . . . . . . . . . . . . . . . . . . . . . . 126
3.11 References for Chapter 3 . . . . . . . . . . . . . . . . . . . . . . . 128
4 Mining Data Streams 131
4.1 The Stream Data Model . . . . . . . . . . . . . . . . . . . . . . . 131
4.1.1 A Data-Stream-Management System . . . . . . . . . . . . 132
4.1.2 Examples of Stream Sources . . . . . . . . . . . . . . . . . 133
4.1.3 Stream Queries . . . . . . . . . . . . . . . . . . . . . . . . 134
4.1.4 Issues in Stream Processing . . . . . . . . . . . . . . . . . 135
4.2 Sampling Data in a Stream . . . . . . . . . . . . . . . . . . . . . 136
4.2.1 A Motivating Example . . . . . . . . . . . . . . . . . . . . 136
4.2.2 Obtaining a Representative Sample . . . . . . . . . . . . . 137
4.2.3 The General Sampling Problem . . . . . . . . . . . . . . . 137
4.2.4 Varying the Sample Size . . . . . . . . . . . . . . . . . . . 138
4.2.5 Exercises for Section 4.2 . . . . . . . . . . . . . . . . . . . 138
4.3 Filtering Streams . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
4.3.1 A Motivating Example . . . . . . . . . . . . . . . . . . . . 139
4.3.2 The Bloom Filter . . . . . . . . . . . . . . . . . . . . . . . 140
4.3.3 Analysis of Bloom Filtering . . . . . . . . . . . . . . . . . 140
4.3.4 Exercises for Section 4.3 . . . . . . . . . . . . . . . . . . . 141
4.4 Counting Distinct Elements in a Stream . . . . . . . . . . . . . . 142
4.4.1 The Count-Distinct Problem . . . . . . . . . . . . . . . . 142
4.4.2 The Flajolet-Martin Algorithm . . . . . . . . . . . . . . . 143
4.4.3 Combining Estimates . . . . . . . . . . . . . . . . . . . . 144
4.4.4 Space Requirements . . . . . . . . . . . . . . . . . . . . . 144
4.4.5 Exercises for Section 4.4 . . . . . . . . . . . . . . . . . . . 145
4.5 Estimating Moments . . . . . . . . . . . . . . . . . . . . . . . . . 145
4.5.1 Definition of Moments . . . . . . . . . . . . . . . . . . . . 145
4.5.2 The Alon-Matias-Szegedy Algorithm for Second
Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
4.5.3 Why the Alon-Matias-Szegedy Algorithm Works . . . . . 147
4.5.4 Higher-Order Moments . . . . . . . . . . . . . . . . . . . 148
4.5.5 Dealing With Infinite Streams . . . . . . . . . . . . . . . . 148
4.5.6 Exercises for Section 4.5 . . . . . . . . . . . . . . . . . . . 149
4.6 Counting Ones in a Window . . . . . . . . . . . . . . . . . . . . . 150
4.6.1 The Cost of Exact Counts . . . . . . . . . . . . . . . . . . 151
4.6.2 The Datar-Gionis-Indyk-Motwani Algorithm . . . . . . . 151
4.6.3 Storage Requirements for the DGIM Algorithm . . . . . . 153CONTENTS xi
4.6.4 Query Answering in the DGIM Algorithm . . . . . . . . . 153
4.6.5 Maintaining the DGIM Conditions . . . . . . . . . . . . . 154
4.6.6 Reducing the Error . . . . . . . . . . . . . . . . . . . . . . 155
4.6.7 Extensions to the Counting of Ones . . . . . . . . . . . . 156
4.6.8 Exercises for Section 4.6 . . . . . . . . . . . . . . . . . . . 157
4.7 Decaying Windows . . . . . . . . . . . . . . . . . . . . . . . . . . 157
4.7.1 The Problem of Most-Common Elements . . . . . . . . . 157
4.7.2 Definition of the Decaying Window . . . . . . . . . . . . . 158
4.7.3 Finding the Most Popular Elements . . . . . . . . . . . . 159
4.8 Summary of Chapter 4 . . . . . . . . . . . . . . . . . . . . . . . . 160
4.9 References for Chapter 4 . . . . . . . . . . . . . . . . . . . . . . . 161
5 Link Analysis 163
5.1 PageRank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
5.1.1 Early Search Engines and Term Spam . . . . . . . . . . . 164
5.1.2 Definition of PageRank . . . . . . . . . . . . . . . . . . . 165
5.1.3 Structure of the Web . . . . . . . . . . . . . . . . . . . . . 169
5.1.4 Avoiding Dead Ends . . . . . . . . . . . . . . . . . . . . . 170
5.1.5 Spider Traps and Taxation . . . . . . . . . . . . . . . . . 173
5.1.6 Using PageRank in a Search Engine . . . . . . . . . . . . 175
5.1.7 Exercises for Section 5.1 . . . . . . . . . . . . . . . . . . . 175
5.2 Efficient Computation of PageRank . . . . . . . . . . . . . . . . . 177
5.2.1 Representing Transition Matrices . . . . . . . . . . . . . . 178
5.2.2 PageRank Iteration Using MapReduce . . . . . . . . . . . 179
5.2.3 Use of Combiners to Consolidate the Result Vector . . . . 179
5.2.4 Representing Blocks of the Transition Matrix . . . . . . . 180
5.2.5 Other Efficient Approaches to PageRank Iteration . . . . 181
5.2.6 Exercises for Section 5.2 . . . . . . . . . . . . . . . . . . . 183
5.3 Topic-Sensitive PageRank . . . . . . . . . . . . . . . . . . . . . . 183
5.3.1 Motivation for Topic-Sensitive Page Rank . . . . . . . . . 183
5.3.2 Biased Random Walks . . . . . . . . . . . . . . . . . . . . 184
5.3.3 Using Topic-Sensitive PageRank . . . . . . . . . . . . . . 185
5.3.4 Inferring Topics from Words . . . . . . . . . . . . . . . . . 186
5.3.5 Exercises for Section 5.3 . . . . . . . . . . . . . . . . . . . 187
5.4 Link Spam . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
5.4.1 Architecture of a Spam Farm . . . . . . . . . . . . . . . . 187
5.4.2 Analysis of a Spam Farm . . . . . . . . . . . . . . . . . . 189

6 Frequent Itemsets 201
2
6.2.4 Tyranny of Counting Pairs . . . . . . . . . . . . . . . . . 213
6.2.5 The A-Priori Algorithm . . . . . . . . . . . . . . . . . . . 213


7 Clustering 241
249
7.2.4 Hierarchical Clustering in Non-Euclidean Spaces . . . . . 252


7.7 Summary of Chapter 7 . . . . . . . . . . . . . . . . . . . . . . . . 276
7.8 References for Chapter 7 . . . . . . . . . . . . . . . . . . . . . . . 280
8 Advertising on the Web 281
8
oblem . . . . . . 292

二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

关键词:MapReduce Big data massive Hadoop reduce 斯坦福 2014

mining of massive data.pdf
下载链接: https://bbs.pinggu.org/a-1699397.html

2.91 MB

需要: 10 个论坛币  [购买]

2014 斯坦福教授 big data hadoop mapred

已有 1 人评分经验 收起 理由
飞天玄舞6 + 20 精彩帖子

总评分: 经验 + 20   查看全部评分

本帖被以下文库推荐

沙发
水天一色DIY(未真实交易用户) 在职认证  发表于 2014-12-23 09:43:34
谢谢楼主分享!

藤椅
zzzppp(真实交易用户) 发表于 2015-1-10 17:30:09
thanks!

板凳
CITSA(未真实交易用户) 发表于 2015-2-6 08:51:24
thanks for sharring

报纸
rbhuang(真实交易用户) 发表于 2015-4-11 11:19:25
thank you so much!!

您需要登录后才可以回帖 登录 | 我要注册

本版微信群
加好友,备注cda
拉您进交流群
GMT+8, 2025-12-29 00:38