人大经济论坛 › 论坛 › 计量经济学与统计论坛五区 › 计量经济学与统计软件 › HLM专版 › [Case Study]Random Forests using Python

CDA数据分析研究院

商业数据分析与大数据领航教育品牌



经管云课堂

经管/金融/财会/社科/名师公开课



学术培训

Stata 空间计量 SSCI Python

贵宾：通行论坛特权+数据库权限
+案例库+下载特权 VIP：论坛特权+更多下载次数
+ccerdata数据库+更高阅读权限+……

发帖

楼主: Lisrelchen

1761 0

[Case Study]Random Forests using Python [推广有奖]

0关注
62粉丝

VIP

院士

67%

还不是VIP/贵宾

TA的文库 其他...

Bayesian NewOccidental

Spatial Data Analysis

东西方数据挖掘

威望: 0 级
论坛币: 50154 个
通用积分: 81.3828
学术水平: 253 点
热心指数: 300 点
信用等级: 208 点
经验: 41518 点
帖子: 3256
精华: 14
在线时间: 766 小时
注册时间: 2006-5-4
最后登录: 2022-11-6

楼主

Lisrelchen 发表于 2014-7-11 23:42:12 |只看作者 |坛友微信交流群|倒序 |AI写论文

是否 +2 论坛币

k人参与回答

经管之家送您一份

应届毕业生专属福利!

求职就业群

赵安豆老师微信：zhaoandou666

经管之家联合CDA

送您一个全额奖学金名额~ !

立即领取

感谢您参与论坛问题回答

经管之家送您两个论坛币！

+2 论坛币

Random Forest

The algorithm to induce a random forest will create a bunch of random decision trees automatically. Since the trees are generated at random, most won't be all that meaningful to learning your classification/regression problem (maybe 99.9% of trees).

If an observation has a length of 45, blue eyes, and 2 legs, it's going to be classified as red.

Arboreal Voting

So what good are 10000 (probably) bad models? Well it turns out that they really aren't that helpful. But what is helpful are the few really good decision trees that you also generated along with the bad ones.

When you make a prediction, the new observation gets pushed down each decision tree and assigned a predicted value/label. Once each of the trees in the forest have reported its predicted value/label, the predictions are tallied up and the mode vote of all trees is returned as the final prediction.

Simply, the 99.9% of trees that are irrelevant make predictions that are all over the map and cancel each another out. The predictions of the minority of trees that are good top that noise and yield a good prediction.

Why you should I use it?It's Easy

Random forest is the Leatherman of learning methods. You can throw pretty much anything at it and it'll do a serviceble job. It does a particularly good job of estimating inferred transformations, and, as a result, doesn't require much tuning like SVM (i.e. it's good for folks with tight deadlines).

An Example Transformation

Random forest is capable of learning without carefully crafted data transformations. Take the the f(x) = log(x) function for example.

Create some fake data and add a little noise.

import numpy as npx = np.random.uniform(1, 100, 1000)y = np.log(x) + np.random.normal(0, .3, 1000)full gist here

If we try and build a basic linear model to predict y using x we wind up with a straight line that sort of bisects the log(x) function. Whereas if we use a random forest, it does a much better job of approximating the log(x) curve and we get something that looks much more like the true function.

You could argue that the random forest overfits the log(x) function a little bit. Either way, I think this does a nice job of illustrating how the random forest isn't bound by linear constraints.

UsesVariable Selection

One of the best use cases for random forest is feature selection. One of the byproducs of trying lots of decision tree variations is that you can examine which variables are working best/worst in each tree.

When a certain tree uses a one variable and another doesn't, you can compare the value lost or gained from the inclusion/exclusion of that variable. The good random forest implementations are going to do that for you, so all you need to do is know which method or variable to look at.

In the following examples, we're trying to figure out which variables are most important for classifying a wine as being red or white.

Classification

Random forest is also great for classification. It can be used to make predictions for categories with multiple possible values and it can be calibrated to output probabilities as well. One thing you do need to watch out for is overfitting. Random forest can be prone to overfitting, especially when working with relatively small datasets. You should be suspicious if your model is making "too good" of predictions on our test set.

One way to overfitting is to only use really relevant features in your model. While this isn't always cut and dry, using a feature selection technique (like the one mentioned previously) can make it a lot easier.

Regression

Yep. It does regression too.

I've found that random forest--unlike other algorithms--does really well learning on categorical variables or a mixture of categorical and real variables. Categorical variables with high cardinality (# of possible values) can be tricky, so having something like this in your back pocket can come in quite useful.

A Short Python Example

Scikit-Learn is a great way to get started with random forest. The scikit-learn API is extremely consistent across algorithms, so you horse race and switch between models very easily. A lot of times I start with something simple and then move to random forest.

One of the best features of the random forest implementation in scikit-learn is the n_jobsparameter. This will automatically paralellize fitting your random forest based on the number of cores you want to use. Here's a great presentation by scikit-learn contributor Olivier Grisel where he talks about training a random forest on a 20 node EC2 cluster.

from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
import numpy as np

iris = load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df['is_train'] = np.random.uniform(0, 1, len(df)) <= .75
df['species'] = pd.Factor(iris.target, iris.target_names)
df.head()

train, test = df[df['is_train']==True], df[df['is_train']==False]

features = df.columns[:4]
clf = RandomForestClassifier(n_jobs=2)
y, _ = pd.factorize(train['species'])
clf.fit(train[features], y)

preds = iris.target_names[clf.predict(test[features])]
pd.crosstab(test['species'], preds, rownames=['actual'], colnames=['preds'])

view rawrf_iris.py hosted with ❤ by GitHub

Looks pretty good!

Final Thoughts

Random forests are remarkabley easy to use given how advanced they are. As with any modeling, be wary of overfitting. If you're interested in getting started with random forest inR, check out the randomForest package.

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

分享0 收藏1 回帖

关键词：Case study Forests random python Forest

[Case Study]Random Forests using Python [推广有奖]

经管之家送您一份

经管之家联合CDA

感谢您参与论坛问题回答

扫码加我拉你入群

相关帖子

本版微信群

[Case Study]Random Forests using Python [推广有奖]

经管之家送您一份

经管之家联合CDA

感谢您参与论坛问题回答

扫码加我 拉你入群

相关帖子

本版微信群

扫码加我拉你入群