请选择 进入手机版 | 继续访问电脑版
楼主: oliyiyi
922 0

A gentle introduction to random forests using R [推广有奖]

版主

泰斗

0%

还不是VIP/贵宾

-

TA的文库  其他...

计量文库

威望
7
论坛币
272091 个
通用积分
31269.1729
学术水平
1435 点
热心指数
1554 点
信用等级
1345 点
经验
383778 点
帖子
9599
精华
66
在线时间
5466 小时
注册时间
2007-5-21
最后登录
2024-3-21

初级学术勋章 初级热心勋章 初级信用勋章 中级信用勋章 中级学术勋章 中级热心勋章 高级热心勋章 高级学术勋章 高级信用勋章 特级热心勋章 特级学术勋章 特级信用勋章

oliyiyi 发表于 2016-12-27 13:20:15 |显示全部楼层 |坛友微信交流群

+2 论坛币
k人 参与回答

经管之家送您一份

应届毕业生专属福利!

求职就业群
赵安豆老师微信:zhaoandou666

经管之家联合CDA

送您一个全额奖学金名额~ !

感谢您参与论坛问题回答

经管之家送您两个论坛币!

+2 论坛币
Introduction

[size=17.0667px]In a previous post, I described how decision tree algorithms work and demonstrated their use via the rpart library in R. Decision trees work by splitting a dataset recursively. That is, subsets arising from a split are further split until a predetermined termination criterion is reached.  At each step, a split is made based on the independent variable that results in the largest possible reduction in heterogeneity of the dependent  variable.

[size=17.0667px](Note:  readers unfamiliar with decision trees may want to read that post before proceeding)

[size=17.0667px]The main drawback of decision trees is that they are prone to overfitting.   The  reason for this is that trees, if grown deep, are able to fit  all kinds of variations in the data, including noise. Although it is possible to address this partially by pruning, the result often remains less than satisfactory. This is because makes a locally optimalchoice at each split without any regard to whether the choice made is the best one overall.  A poor split made in the initial stages can thus doom the model, a problem that cannot be fixed by post-hoc pruning.

[size=17.0667px]In this post I describe random forests, a tree-based algorithm that addresses the above shortcoming of decision trees. I’ll first describe the intuition behind the algorithm via an analogy and then do a demo using the R randomForest library.

Motivating random forests

[size=17.0667px]One of the reasons for the popularity of decision trees is that they reflect the way humans make decisions: by weighing up options at each stage and choosing the best one available.  The analogy is particularly useful because it also suggests how decision trees can be improved.

[size=17.0667px]One of the lifelines in the game show, Who Wants to be A Millionaire, is “Ask The Audience” wherein a contestant can ask the audience to vote on the answer to a question.  The rationale here is that the majority response from a large number of independent decision makers is more likely to yield a correct answer than one from a randomly chosen person.  There are two factors at play here:

  • People have different experiences and will therefore draw upon different “data” to answer the question.
  • People have different knowledge bases and preferences and will therefore draw upon different “variables” to make their choices at each stage in their decision process.

[size=17.0667px]Taking a cue from the above, it seems reasonable to build many decision trees using:

  • Different sets of training data.
  • Randomly selected subsets of variables at each split of every decision tree.

[size=17.0667px]Predictions can then made by taking the majority vote over all trees (for classification problems) or averaging results over all trees (for regression problems).  This is essentially how the random forest algorithm works.

[size=17.0667px]The net effect of the two strategies is to reduce overfitting by a) averaging over trees created from different samples of the dataset and b) decreasing the likelihood of a small set of strong predictors dominating the splits.  The price paid is reduced interpretability as well as increased computational complexity. But then, there is no such thing as a free lunch.

The mechanics of the algorithm

[size=17.0667px]Although we will not delve into the mathematical details of the algorithm, it is important to understand how two points made above are implemented in the algorithm.

Bootstrap aggregating… and a (rather cool) error estimate

[size=17.0667px]A key feature of the algorithm is the use of multiple datasets for training individual decision trees.  This is done via a neat statistical trick called bootstrap aggregating(also called bagging).

[size=17.0667px]Here’s how bagging works:


二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

关键词:introduction troduction Forests random Forest

缺少币币的网友请访问有奖回帖集合
https://bbs.pinggu.org/thread-3990750-1-1.html
您需要登录后才可以回帖 登录 | 我要注册

本版微信群
加好友,备注jltj
拉您入交流群

京ICP备16021002-2号 京B2-20170662号 京公网安备 11010802022788号 论坛法律顾问:王进律师 知识产权保护声明   免责及隐私声明

GMT+8, 2024-3-28 21:59