请选择 进入手机版 | 继续访问电脑版
楼主: oliyiyi
1708 14

Data Science Live Book [推广有奖]

版主

泰斗

0%

还不是VIP/贵宾

-

TA的文库  其他...

计量文库

威望
7
论坛币
272091 个
通用积分
31269.1753
学术水平
1435 点
热心指数
1554 点
信用等级
1345 点
经验
383778 点
帖子
9599
精华
66
在线时间
5466 小时
注册时间
2007-5-21
最后登录
2024-3-21

初级学术勋章 初级热心勋章 初级信用勋章 中级信用勋章 中级学术勋章 中级热心勋章 高级热心勋章 高级学术勋章 高级信用勋章 特级热心勋章 特级学术勋章 特级信用勋章

oliyiyi 发表于 2016-8-11 10:10:42 |显示全部楼层 |坛友微信交流群

+2 论坛币
k人 参与回答

经管之家送您一份

应届毕业生专属福利!

求职就业群
赵安豆老师微信:zhaoandou666

经管之家联合CDA

送您一个全额奖学金名额~ !

感谢您参与论坛问题回答

经管之家送您两个论坛币!

+2 论坛币
http://livebook.datascienceheroes.com/
二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

关键词:Data Science Science Book Data live

缺少币币的网友请访问有奖回帖集合
https://bbs.pinggu.org/thread-3990750-1-1.html
oliyiyi 发表于 2016-8-11 10:11:33 |显示全部楼层 |坛友微信交流群
A book to learn data science, data analysis and machine learning, suitable for all ages!


What does it cover?

This book covers common aspects in predictive modeling:

  • A. Data Preparation / Data Profiling
  • B. Selecting best variables (dataviz)
  • C. Assessing model performance
  • D. Miscellaneous

And it is heavly based on the funModeling package from the R language . Please install before starting :)

install.packages("funModeling")

  • Model creation consumes around 10% of almost any predictive modeling project; funModeling will try to cover remaining 90%.
  • It's not only the function itself, but the explanation of how to interpret results. This brings a deeper understanding of what is being done, boosting the freedom to use that knowledge in other situations regardless of the language.

Why a live book?

Hopefully this book barerly has an end, it will be updated periodically. Next planned chapter is a case study in predictive modeling. And you can contribute! below the github link.



缺少币币的网友请访问有奖回帖集合
https://bbs.pinggu.org/thread-3990750-1-1.html

使用道具

oliyiyi 发表于 2016-8-11 10:12:33 |显示全部楼层 |坛友微信交流群
What is this about?

This chapter will cover three types of plots which aim to understand what are the most correlated numeric variables against a target variable.

  • Overview:
    • Analysis purpose: To identify if the input variable is a good/bad predictor through visual analysis.
    • General purpose: To explain the decision of including -or not- a variable to a model to a non-analyst person.

Constraint: Target variable must contain only 2 values. If it has NAvalues, they will be removed.

缺少币币的网友请访问有奖回帖集合
https://bbs.pinggu.org/thread-3990750-1-1.html

使用道具

oliyiyi 发表于 2016-8-11 10:25:15 |显示全部楼层 |坛友微信交流群

缺少币币的网友请访问有奖回帖集合
https://bbs.pinggu.org/thread-3990750-1-1.html

使用道具

oliyiyi 发表于 2016-8-11 10:25:49 |显示全部楼层 |坛友微信交流群

Last two plots have the same data source, showing the distribution ofhas_heart_disease in terms of gender. The one on the left shows in percentage value, while the one on the right shows in absolute value.

How to extract conclusions from the plots? (Short version)

Gender variable seems to be a good predictor, since the likelihood of having heart disease is different given the female/male groups. it gives an order to the data.


缺少币币的网友请访问有奖回帖集合
https://bbs.pinggu.org/thread-3990750-1-1.html

使用道具

oliyiyi 发表于 2016-8-11 10:26:27 |显示全部楼层 |坛友微信交流群
From 1st plot (%):
  • The likelihood of having heart disease for males is 55.3%, while for females is: 25.8%.
  • The heart disease rate for males doubles the rate for females (55.3 vs 25.8, respectively).

缺少币币的网友请访问有奖回帖集合
https://bbs.pinggu.org/thread-3990750-1-1.html

使用道具

oliyiyi 发表于 2016-8-11 10:27:40 |显示全部楼层 |坛友微信交流群
From 2nd plot (count):
  • There are a total of 97 females:

    • 25 of them have heart disease (25/97=25.8%, which is the ratio of 1st plot).
    • the remaining 72 have not heart disease (74.2%)
  • There are a total of 206 males:

    • 114 of them have heart disease (55.3%)
    • the remaining 92 have not heart disease (44.7%)
  • Total cases: Summing the values of four bars: 25+72+114+92=303.


Note: What would it happened if instead of having the rates of 25.8% vs. 55.3% (female vs male), they had been more similar like 30.2% vs. 30.6%). In this case variable gender it would have been much less relevant, since it doesn't separate the has_heart_disease event.


缺少币币的网友请访问有奖回帖集合
https://bbs.pinggu.org/thread-3990750-1-1.html

使用道具

oliyiyi 发表于 2016-8-11 10:29:25 |显示全部楼层 |坛友微信交流群
Example 2: Crossing with numerical variables

Numerical variables should be binned in order to plot them with an histogram, otherwise the plot is not showing information, as it can be seen here:

Equal frequency binning

There is a function included in the package (inherited from Hmisc package) : equal_freq, which returns the bins/buckets based on theequal frequency criteria. Which is -or tries to- have the same quantity of rows per bin.

For numerical variables, cross_plot has by default theauto_binning=T, which automatically calls the equal_freq function with n_bins=10 (or the closest number).


缺少币币的网友请访问有奖回帖集合
https://bbs.pinggu.org/thread-3990750-1-1.html

使用道具

oliyiyi 发表于 2016-8-11 10:29:59 |显示全部楼层 |坛友微信交流群

缺少币币的网友请访问有奖回帖集合
https://bbs.pinggu.org/thread-3990750-1-1.html

使用道具

oliyiyi 发表于 2016-8-11 10:45:29 |显示全部楼层 |坛友微信交流群

缺少币币的网友请访问有奖回帖集合
https://bbs.pinggu.org/thread-3990750-1-1.html

使用道具

您需要登录后才可以回帖 登录 | 我要注册

本版微信群
加好友,备注jltj
拉您入交流群

京ICP备16021002-2号 京B2-20170662号 京公网安备 11010802022788号 论坛法律顾问:王进律师 知识产权保护声明   免责及隐私声明

GMT+8, 2024-3-29 06:20