Data Science Live Book - 经管之家

1关注
185
粉丝

版主

已卖：3000份资源

泰斗

1%

还不是VIP/贵宾

-

TA的文库 其他...

计量文库

0%

威望: 7 级
论坛币: -62545 个
通用积分: 31675.3973
学术水平: 1454 点
热心指数: 1573 点
信用等级: 1364 点
经验: 384234 点
帖子: 9629
精华: 66
在线时间: 5508 小时
注册时间: 2007-5-21
最后登录: 2025-7-8

楼主

oliyiyi 发表于 2016-8-11 10:10:42 |AI写论文

是否 +2 论坛币

k人参与回答

经管之家送您一份

应届毕业生专属福利!

求职就业群

赵安豆老师微信：zhaoandou666

经管之家联合CDA

送您一个全额奖学金名额~ !

立即领取

感谢您参与论坛问题回答

经管之家送您两个论坛币！

+2 论坛币

http://livebook.datascienceheroes.com/

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

分享0 收藏0 回帖

关键词：Data Science Science Book Data live

相关帖子

• [2016 new book] Scala for Data Science
• R For data science
• Data Science: A key
• Journal of Data Science是EI收录吗？
• R for Data Science
• Textual Data Science with R
• 好教材 R for Data Science - Dan Toomey
• R for Data Science
• R for Data Science, 2nd 第二版！
• Data Science in R CRC 2015

缺少币币的网友请访问有奖回帖集合：
https://bbs.pinggu.org/thread-3990750-1-1.html

沙发

oliyiyi 发表于 2016-8-11 10:11:33

A book to learn data science, data analysis and machine learning, suitable for all ages!

What does it cover?

This book covers common aspects in predictive modeling:

A. Data Preparation / Data Profiling
B. Selecting best variables (dataviz)
C. Assessing model performance
D. Miscellaneous

And it is heavly based on the funModeling package from the R language . Please install before starting :)

install.packages("funModeling")

Model creation consumes around 10% of almost any predictive modeling project; funModeling will try to cover remaining 90%.
It's not only the function itself, but the explanation of how to interpret results. This brings a deeper understanding of what is being done, boosting the freedom to use that knowledge in other situations regardless of the language.

Why a live book?

Hopefully this book barerly has an end, it will be updated periodically. Next planned chapter is a case study in predictive modeling. And you can contribute! below the github link.

缺少币币的网友请访问有奖回帖集合：
https://bbs.pinggu.org/thread-3990750-1-1.html

藤椅

oliyiyi 发表于 2016-8-11 10:12:33

What is this about?

This chapter will cover three types of plots which aim to understand what are the most correlated numeric variables against a target variable.

Overview:
- Analysis purpose: To identify if the input variable is a good/bad predictor through visual analysis.
- General purpose: To explain the decision of including -or not- a variable to a model to a non-analyst person.

Constraint: Target variable must contain only 2 values. If it has NAvalues, they will be removed.

缺少币币的网友请访问有奖回帖集合：
https://bbs.pinggu.org/thread-3990750-1-1.html

板凳

oliyiyi 发表于 2016-8-11 10:25:15

缺少币币的网友请访问有奖回帖集合：
https://bbs.pinggu.org/thread-3990750-1-1.html

报纸

oliyiyi 发表于 2016-8-11 10:25:49

Last two plots have the same data source, showing the distribution ofhas_heart_disease in terms of gender. The one on the left shows in percentage value, while the one on the right shows in absolute value.

How to extract conclusions from the plots? (Short version)

Gender variable seems to be a good predictor, since the likelihood of having heart disease is different given the female/male groups. it gives an order to the data.

缺少币币的网友请访问有奖回帖集合：
https://bbs.pinggu.org/thread-3990750-1-1.html

地板

oliyiyi 发表于 2016-8-11 10:26:27

From 1st plot (%):

The likelihood of having heart disease for males is 55.3%, while for females is: 25.8%.
The heart disease rate for males doubles the rate for females (55.3 vs 25.8, respectively).

缺少币币的网友请访问有奖回帖集合：
https://bbs.pinggu.org/thread-3990750-1-1.html

7楼

oliyiyi 发表于 2016-8-11 10:27:40

From 2nd plot (count):

There are a total of 97 females:
- 25 of them have heart disease (25/97=25.8%, which is the ratio of 1st plot).
- the remaining 72 have not heart disease (74.2%)
There are a total of 206 males:
- 114 of them have heart disease (55.3%)
- the remaining 92 have not heart disease (44.7%)
Total cases: Summing the values of four bars: 25+72+114+92=303.

Note: What would it happened if instead of having the rates of 25.8% vs. 55.3% (female vs male), they had been more similar like 30.2% vs. 30.6%). In this case variable gender it would have been much less relevant, since it doesn't separate the has_heart_disease event.

缺少币币的网友请访问有奖回帖集合：
https://bbs.pinggu.org/thread-3990750-1-1.html

8楼

oliyiyi 发表于 2016-8-11 10:29:25

Example 2: Crossing with numerical variables

Numerical variables should be binned in order to plot them with an histogram, otherwise the plot is not showing information, as it can be seen here:

Equal frequency binning

There is a function included in the package (inherited from Hmisc package) : equal_freq, which returns the bins/buckets based on theequal frequency criteria. Which is -or tries to- have the same quantity of rows per bin.

For numerical variables, cross_plot has by default theauto_binning=T, which automatically calls the equal_freq function with n_bins=10 (or the closest number).

缺少币币的网友请访问有奖回帖集合：
https://bbs.pinggu.org/thread-3990750-1-1.html

9楼

oliyiyi 发表于 2016-8-11 10:29:59

缺少币币的网友请访问有奖回帖集合：
https://bbs.pinggu.org/thread-3990750-1-1.html

10楼

oliyiyi 发表于 2016-8-11 10:45:29

缺少币币的网友请访问有奖回帖集合：
https://bbs.pinggu.org/thread-3990750-1-1.html

Data Science Live Book [推广有奖]

经管之家送您一份

经管之家联合CDA

感谢您参与论坛问题回答

扫码加我拉你入群

相关帖子

浏览过的帖子

浏览过的版块

初级学术勋章

初级热心勋章

初级信用勋章

中级信用勋章

中级学术勋章

中级热心勋章

高级热心勋章

高级学术勋章

高级信用勋章

特级热心勋章

特级学术勋章

特级信用勋章

本版微信群

Data Science Live Book [推广有奖]

经管之家送您一份

经管之家联合CDA

感谢您参与论坛问题回答

扫码加我 拉你入群

相关帖子

浏览过的帖子

浏览过的版块

初级学术勋章

初级热心勋章

初级信用勋章

中级信用勋章

中级学术勋章

中级热心勋章

高级热心勋章

高级学术勋章

高级信用勋章

特级热心勋章

特级学术勋章

特级信用勋章

本版微信群

扫码加我拉你入群