人大经济论坛 › 论坛 › 数据科学与人工智能 › 数据分析与数据科学 › R语言论坛 › Practical Data Science with R, Nina Zumel John Moun ...

CDA数据分析研究院

商业数据分析与大数据领航教育品牌



经管云课堂

经管/金融/财会/社科/名师公开课



学术培训

Stata 空间计量 SSCI Python

贵宾：通行论坛特权+数据库权限
+案例库+下载特权 VIP：论坛特权+更多下载次数
+ccerdata数据库+更高阅读权限+……

返回列表

发帖

楼主: Bookseek

1659 1

[实际应用] Practical Data Science with R, Nina Zumel John Mount, 第二版 2014 [推广有奖]

0关注
0粉丝

等待验证会员

高中生

还不是VIP/贵宾

威望: 0 级
论坛币: 1214 个
通用积分: 0.0600
学术水平: 1 点
热心指数: 1 点
信用等级: 1 点
经验: 196 点
帖子: 8
精华: 0
在线时间: 21 小时
注册时间: 2016-6-20
最后登录: 2018-11-14

楼主

Bookseek 发表于 2016-6-27 10:19:49 |只看作者 |坛友微信交流群|倒序 |AI写论文

相似文件

换一批

是否 +2 论坛币

k人参与回答

经管之家送您一份

应届毕业生专属福利!

求职就业群

赵安豆老师微信：zhaoandou666

经管之家联合CDA

送您一个全额奖学金名额~ !

立即领取

感谢您参与论坛问题回答

经管之家送您两个论坛币！

+2 论坛币

PART 1 INTRODUCTION TO DATA SCIENCE......................1

1 The data science process 3

1.1 The roles in a data science project 3

Project roles 4

1.2 Stages of a data science project 6

Defining the goal 7 ■ Data collection and management 8

Modeling 10 ■ Model evaluation and critique 11

Presentation and documentation 13 ■ Model deployment and

maintenance 14

1.3 Setting expectations 14

Determining lower and upper bounds on model performance 15

1.4 Summary 17

2 Loading data into R 18

2.1 Working with data from files 19

Working with well-structured data from files or URLs 19

Using R on less-structured data 22

2.2 Working with relational databases 24

A production-size example 25 ■ Loading data from a database

into R 30 ■ Working with the PUMS data 31

2.3 Summary 34

3 Exploring data 35

3.1 Using summary statistics to spot problems 36

Typical problems revealed by data summaries 38

3.2 Spotting problems using graphics and visualization 41

Visually checking distributions for a single variable 43

Visually checking relationships between two variables 51

3.3 Summary 62

4 Managing data 64

4.1 Cleaning data 64

Treating missing values (NAs) 65 ■ Data transformations 69

4.2 Sampling for modeling and validation 76

Test and training splits 76 ■ Creating a sample group

column 77 ■ Record grouping 78 ■ Data provenance 78

4.3 Summary 79

PART 2 MODELING METHODS ......................................81

5 Choosing and evaluating models 83

5.1 Mapping problems to machine learning tasks 84

Solving classification problems 85 ■ Solving scoring

problems 87 ■ Working without known targets 88

Problem-to-method mapping 90

5.2 Evaluating models 92

Evaluating classification models 93 ■ Evaluating scoring

models 98 ■ Evaluating probability models 101 ■ Evaluating

ranking models 105 ■ Evaluating clustering models 105

CONTENTS xi

5.3 Validating models 108

Identifying common model problems 108 ■ Quantifying model

soundness 110 ■ Ensuring model quality 111

5.4 Summary 113

6 Memorization methods 115

6.1 KDD and KDD Cup 2009 116

Getting started with KDD Cup 2009 data 117

6.2 Building single-variable models 118

Using categorical features 119 ■ Using numeric features 121

Using cross-validation to estimate effects of overfitting 123

6.3 Building models using many variables 125

Variable selection 125 ■ Using decision trees 127 ■ Using

nearest neighbor methods 130 ■ Using Naive Bayes 134

6.4 Summary 138

7 Linear and logistic regression 140

7.1 Using linear regression 141

Understanding linear regression 141 ■ Building a linear

regression model 144 ■ Making predictions 145 ■ Finding

relations and extracting advice 149 ■ Reading the model summary

and characterizing coefficient quality 151 ■ Linear regression

takeaways 156

7.2 Using logistic regression 157

Understanding logistic regression 157 ■ Building a logistic

regression model 159 ■ Making predictions 160 ■ Finding

relations and extracting advice from logistic models 164

Reading the model summary and characterizing coefficients 166

Logistic regression takeaways 173

7.3 Summary 174

8 Unsupervised methods 175

8.1 Cluster analysis 176

Distances 176 ■ Preparing the data 178 ■ Hierarchical

clustering with hclust() 180 ■ The k-means algorithm 190

Assigning new points to clusters 195 ■ Clustering

takeaways 198

xii CONTENTS

8.2 Association rules 198

Overview of association rules 199 ■ The example problem 200

Mining association rules with the arules package 201

Association rule takeaways 209

8.3 Summary 209

9 Exploring advanced methods 211

9.1 Using bagging and random forests

to reduce training variance 212

Using bagging to improve prediction 213 ■ Using random forests

to further improve prediction 216 ■ Bagging and random forest

takeaways 220

9.2 Using generalized additive models (GAMs) to learn nonmonotone

relationships 221

Understanding GAMs 221 ■ A one-dimensional regression

example 222 ■ Extracting the nonlinear relationships 226

Using GAM on actual data 228 ■ Using GAM for logistic

regression 231 ■ GAM takeaways 233

9.3 Using kernel methods to increase data separation 233

Understanding kernel functions 234 ■ Using an explicit kernel on

a problem 238 ■ Kernel takeaways 241

9.4 Using SVMs to model complicated decision

boundaries 242

Understanding support vector machines 242 ■ Trying an SVM on

artificial example data 245 ■ Using SVMs on real data 248

Support vector machine takeaways 251

9.5 Summary 251

PART 3 DELIVERING RESULTS . ...................................253

10 Documentation and deployment 255

10.1 The buzz dataset 256

10.2 Using knitr to produce milestone documentation 258

What is knitr? 258 ■ knitr technical details 261 ■ Using knitr

to document the buzz data 262

CONTENTS xiii

10.3 Using comments and version control for running

documentation 266

Writing effective comments 266 ■ Using version control to record

history 267 ■ Using version control to explore your project 272

Using version control to share work 276

10.4 Deploying models 280

Deploying models as R HTTP services 280 ■ Deploying models by

export 283 ■ What to take away 284

10.5 Summary 286

11 Producing effective presentations 287

11.1 Presenting your results to the project sponsor 288

Summarizing the project’s goals 289 ■ Stating the project’s

results 290 ■ Filling in the details 292 ■ Making

recommendations and discussing future work 294

Project sponsor presentation takeaways 295

11.2 Presenting your model to end users 295

Summarizing the project’s goals 296 ■ Showing how the model fits

the users’ workflow 296 ■ Showing how to use the model 299

End user presentation takeaways 300

11.3 Presenting your work to other data scientists 301

Introducing the problem 301 ■ Discussing related work 302

Discussing your approach 302 ■ Discussing results and future

work 303 ■ Peer presentation takeaways 304

11.4 Summary 304

appendix A Working with R and other tools 307

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

分享0 收藏4 回帖

关键词：Data Science Practical practic Science Data collection management process science 2014

本帖被以下文库推荐

· Data Science NewOccidental|主题: 1233, 订阅: 120
· 2万+全球顶级名校/投行英文文献 |主题: 21680, 订阅: 2696

使用道具举报

沙发

shang00122

发表于 2016-12-25 19:10:11 |只看作者 |坛友微信交流群

thank you

使用道具举报

返回列表

发帖

本版微信群

加好友,备注cda
拉您进交流群

手机版 |

意见反馈 |

帮助 |

新手入门 |

用户手册 |

友情链接 |

如有投资本站、合作意向或投放广告，请联系：13661292478（刘老师）

联系客服

邮箱：service@pinggu.org 投诉或不良信息处理：（010-68466864）

京ICP备16021002-2号京B2-20170662号京公网安备 11010802022788号论坛法律顾问：王进律师知识产权保护声明免责及隐私声明