楼主: Bookseek
1659 1

[实际应用] Practical Data Science with R, Nina Zumel John Mount, 第二版 2014 [推广有奖]

  • 0关注
  • 0粉丝

等待验证会员

高中生

5%

还不是VIP/贵宾

-

威望
0
论坛币
1214 个
通用积分
0.0600
学术水平
1 点
热心指数
1 点
信用等级
1 点
经验
196 点
帖子
8
精华
0
在线时间
21 小时
注册时间
2016-6-20
最后登录
2018-11-14

相似文件 换一批

+2 论坛币
k人 参与回答

经管之家送您一份

应届毕业生专属福利!

求职就业群
赵安豆老师微信:zhaoandou666

经管之家联合CDA

送您一个全额奖学金名额~ !

感谢您参与论坛问题回答

经管之家送您两个论坛币!

+2 论坛币

PART 1 INTRODUCTION TO DATA SCIENCE......................1

1 The data science process 3

1.1 The roles in a data science project 3

Project roles 4

1.2 Stages of a data science project 6

Defining the goal 7 ■ Data collection and management 8

Modeling 10 ■ Model evaluation and critique 11

Presentation and documentation 13 ■ Model deployment and

maintenance 14

1.3 Setting expectations 14

Determining lower and upper bounds on model performance 15

1.4 Summary 17

2 Loading data into R 18

2.1 Working with data from files 19

Working with well-structured data from files or URLs 19

Using R on less-structured data 22

2.2 Working with relational databases 24

A production-size example 25 ■ Loading data from a database

into R 30 ■ Working with the PUMS data 31

2.3 Summary 34

3 Exploring data 35

3.1 Using summary statistics to spot problems 36

Typical problems revealed by data summaries 38

3.2 Spotting problems using graphics and visualization 41

Visually checking distributions for a single variable 43

Visually checking relationships between two variables 51

3.3 Summary 62

4 Managing data 64

4.1 Cleaning data 64

Treating missing values (NAs) 65 ■ Data transformations 69

4.2 Sampling for modeling and validation 76

Test and training splits 76 ■ Creating a sample group

column 77 ■ Record grouping 78 ■ Data provenance 78

4.3 Summary 79

PART 2 MODELING METHODS ......................................81

5 Choosing and evaluating models 83

5.1 Mapping problems to machine learning tasks 84

Solving classification problems 85 ■ Solving scoring

problems 87 ■ Working without known targets 88

Problem-to-method mapping 90

5.2 Evaluating models 92

Evaluating classification models 93 ■ Evaluating scoring

models 98 ■ Evaluating probability models 101 ■ Evaluating

ranking models 105 ■ Evaluating clustering models 105

CONTENTS xi

5.3 Validating models 108

Identifying common model problems 108 ■ Quantifying model

soundness 110 ■ Ensuring model quality 111

5.4 Summary 113

6 Memorization methods 115

6.1 KDD and KDD Cup 2009 116

Getting started with KDD Cup 2009 data 117

6.2 Building single-variable models 118

Using categorical features 119 ■ Using numeric features 121

Using cross-validation to estimate effects of overfitting 123

6.3 Building models using many variables 125

Variable selection 125 ■ Using decision trees 127 ■ Using

nearest neighbor methods 130 ■ Using Naive Bayes 134

6.4 Summary 138

7 Linear and logistic regression 140

7.1 Using linear regression 141

Understanding linear regression 141 ■ Building a linear

regression model 144 ■ Making predictions 145 ■ Finding

relations and extracting advice 149 ■ Reading the model summary

and characterizing coefficient quality 151 ■ Linear regression

takeaways 156

7.2 Using logistic regression 157

Understanding logistic regression 157 ■ Building a logistic

regression model 159 ■ Making predictions 160 ■ Finding

relations and extracting advice from logistic models 164

Reading the model summary and characterizing coefficients 166

Logistic regression takeaways 173

7.3 Summary 174

8 Unsupervised methods 175

8.1 Cluster analysis 176

Distances 176 ■ Preparing the data 178 ■ Hierarchical

clustering with hclust() 180 ■ The k-means algorithm 190

Assigning new points to clusters 195 ■ Clustering

takeaways 198

xii CONTENTS

8.2 Association rules 198

Overview of association rules 199 ■ The example problem 200

Mining association rules with the arules package 201

Association rule takeaways 209

8.3 Summary 209

9 Exploring advanced methods 211

9.1 Using bagging and random forests

to reduce training variance 212

Using bagging to improve prediction 213 ■ Using random forests

to further improve prediction 216 ■ Bagging and random forest

takeaways 220

9.2 Using generalized additive models (GAMs) to learn nonmonotone

relationships 221

Understanding GAMs 221 ■ A one-dimensional regression

example 222 ■ Extracting the nonlinear relationships 226

Using GAM on actual data 228 ■ Using GAM for logistic

regression 231 ■ GAM takeaways 233

9.3 Using kernel methods to increase data separation 233

Understanding kernel functions 234 ■ Using an explicit kernel on

a problem 238 ■ Kernel takeaways 241

9.4 Using SVMs to model complicated decision

boundaries 242

Understanding support vector machines 242 ■ Trying an SVM on

artificial example data 245 ■ Using SVMs on real data 248

Support vector machine takeaways 251

9.5 Summary 251

PART 3 DELIVERING RESULTS . ...................................253

10 Documentation and deployment 255

10.1 The buzz dataset 256

10.2 Using knitr to produce milestone documentation 258

What is knitr? 258 ■ knitr technical details 261 ■ Using knitr

to document the buzz data 262

CONTENTS xiii

10.3 Using comments and version control for running

documentation 266

Writing effective comments 266 ■ Using version control to record

history 267 ■ Using version control to explore your project 272

Using version control to share work 276

10.4 Deploying models 280

Deploying models as R HTTP services 280 ■ Deploying models by

export 283 ■ What to take away 284

10.5 Summary 286

11 Producing effective presentations 287

11.1 Presenting your results to the project sponsor 288

Summarizing the project’s goals 289 ■ Stating the project’s

results 290 ■ Filling in the details 292 ■ Making

recommendations and discussing future work 294

Project sponsor presentation takeaways 295

11.2 Presenting your model to end users 295

Summarizing the project’s goals 296 ■ Showing how the model fits

the users’ workflow 296 ■ Showing how to use the model 299

End user presentation takeaways 300

11.3 Presenting your work to other data scientists 301

Introducing the problem 301 ■ Discussing related work 302

Discussing your approach 302 ■ Discussing results and future

work 303 ■ Peer presentation takeaways 304

11.4 Summary 304

appendix A Working with R and other tools 307


二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

关键词:Data Science Practical practic Science Data collection management process science 2014

Practical Data Science with R.pdf

20.26 MB

需要: 10 个论坛币  [购买]

Practical Data Science with R, Nina Zumel John Mount, 第二版 2014

本帖被以下文库推荐

沙发
shang00122 在职认证  发表于 2016-12-25 19:10:11 |只看作者 |坛友微信交流群
thank you

使用道具

您需要登录后才可以回帖 登录 | 我要注册

本版微信群
加好友,备注cda
拉您进交流群

京ICP备16021002-2号 京B2-20170662号 京公网安备 11010802022788号 论坛法律顾问:王进律师 知识产权保护声明   免责及隐私声明

GMT+8, 2024-4-28 19:27