楼主: yaoqsm321
57212 80

[问答] 随机森林过拟合问题,在训练集上表现很好,在测试集上的表现很差 [推广有奖]

51
jgchen1966 发表于 2016-12-14 10:36:43
Random Forests


      LEO BREIMAN  (随机森林的原创者)
Statistics Department, University of California, Berkeley, CA 94720

Abstract. Random forests are a combination of tree predictors such that each tree depends on the values of a
random vector sampled independently and with the same distribution for all trees in the forest. The generalization
error for forests converges a.s. to a limit as the number of trees in the forest becomes large. The generalization
error of a forest of tree classifiers depends on the strength of the individual trees in the forest and the correlation
between them. Using a random selection of features to split each node yields error rates that compare
favorably to Adaboost (Y. Freund & R. Schapire, Machine Learning: Proceedings of the Thirteenth International
conference, ∗ ∗ ∗, 148–156), but are more robust with respect to noise. Internal estimates monitor error,
strength, and correlation and these are used to show the response to increasing the number of features used in
the splitting. Internal estimates are also used to measure variable importance. These ideas are also applicable to
regression.
此文 发表于:
    Machine Learning, 45, 5–32, 2001
c 2001 Kluwer Academic Publishers. Manufactured in The Netherlands

鹑居鷇食,鸟行无彰

52
jgchen1966 发表于 2016-12-14 10:38:50
大一仔 发表于 2016-12-14 10:16
随机森林的优良取决于决策树的优良和各决策树的相互独立程度,而决策树通常对于数据要求不高,见决策树的 ...
请认真读上楼的 LEO BREIMAN 原创论文 ,请对他表示出足够尊重,其人虽仙去!!

53
大一仔 发表于 2016-12-14 11:13:48
jgchen1966 发表于 2016-12-14 10:38
请认真读上楼的 LEO BREIMAN 原创论文 ,请对他表示出足够尊重,其人虽仙去!!
我的观点:认真读过之后还是没发现你的论点啊,我的结论就是看了他的结论。你重点标明的地方只是说明了在构建森林时,每棵树都是用同样的方法进行的有放回抽样,这样抽样的结果就是对于每棵树来说分布自然是一样的,但是对于单一的树,数据是否独立同分布并没有影响。重点应该在“for all trees”而不是“for all sample datas”吧?

54
jgchen1966 发表于 2016-12-14 11:14:18
以下大段英文 引自
      Ingo Steinwart , Andreas Christmann  :
                    Support Vector Machines
                               2008 Springer Science+Business Media, LLC
         p3

  最好地说明了机器学习对数据样本的普遍性要求

        As already mentioned informally, we assume in the machine learning approach
that we have collected a sequence D := ((x1, y1), . . . , (xn, yn)) of input/
output pairs that are used to “learn” a function f : X → Y such that
f(x) is a good approximation of the possible response y to an arbitrary x.
Obviously, in order to find such a function, it is necessary that the already
collected data D have something in common with the new and unseen data.
In the framework of statistical learning theory, this is guaranteed by assuming
that both past and future pairs (x, y) are independently generated by the same,
but of course unknown, probability distribution P on X × Y . In other words,
a pair (x, y) is generated in two steps. First, the input value x is generated
according to the marginal distribution PX. Second, the output value y is generated
according to the conditional probability P( · |x) on Y given the value
of x. Note that by letting x be generated by an unknown distribution PX,
we basically assume that we have no control over how the input values have
been and will be observed. Furthermore, assuming that the output value y to
a given x is stochastically generated by P( · |x) accommodates the fact that
in general the information contained in x may not be sufficient to determine
a single response in a deterministic manner. In particular, this assumption
includes the two extreme cases where either all input values determine an
(almost surely) unique output value or the input values are completely irrelevant
for the output value. Finally, assuming that the conditional probability
P( · |x) is unknown contributes to the fact that we assume that we do not
have a reasonable description of the relationship between the input and output
values. Note that this is a fundamental difference from parametric models,
in which the relationship between the inputs x and the outputs y is assumed
to follow some unknown function f ∈ F from a known, finite-dimensional set
of functions F.
鹑居鷇食,鸟行无彰

55
jgchen1966 发表于 2016-12-14 11:17:01
大一仔 发表于 2016-12-14 11:13
我的观点:认真读过之后还是没发现你的论点啊,我的结论就是看了他的结论。你重点标明的地方只是说明了在 ...
行吧!!!人人是不同的,念书的结果当然 也不同!!!

56
yaoqsm321 发表于 2016-12-14 11:25:12
jameschin007 发表于 2016-12-13 21:12
对于数据不平衡问题,随机森林确实解决的不好。 其他算法,GBM,SVM都可以。
方法主要采用过采样和欠采样 ...
我把788个小样本复制了两遍,然后加进了原数据,数据结构变成了6689个总数据,其中1为4324,0为2364个,然后用我之前的随机森林代码去建模,模型做出来的结果很好,但是我现在心里没底,不知道这种做法对不对。

prediction   0   1
         0 464  19
         1  12 859

57
yaoqsm321 发表于 2016-12-14 11:25:50
大一仔 发表于 2016-12-13 16:56
我的观点,如果数据都符合独立同分布,传统的统计模型比如判别模型就可以很好地解决这个问题,没必要上机 ...
我把788个小样本复制了两遍,然后加进了原数据,数据结构变成了6689个总数据,其中1为4324,0为2364个,然后用我之前的随机森林代码去建模,模型做出来的结果很好,但是我现在心里没底,不知道这种做法对不对。

prediction   0   1
         0 464  19
         1  12 859

58
yaoqsm321 发表于 2016-12-14 11:25:57
大一仔 发表于 2016-12-13 16:56
我的观点,如果数据都符合独立同分布,传统的统计模型比如判别模型就可以很好地解决这个问题,没必要上机 ...
我把788个小样本复制了两遍,然后加进了原数据,数据结构变成了6689个总数据,其中1为4324,0为2364个,然后用我之前的随机森林代码去建模,模型做出来的结果很好,但是我现在心里没底,不知道这种做法对不对。

prediction   0   1
         0 464  19
         1  12 859

59
yaoqsm321 发表于 2016-12-14 11:25:59
大一仔 发表于 2016-12-13 16:56
我的观点,如果数据都符合独立同分布,传统的统计模型比如判别模型就可以很好地解决这个问题,没必要上机 ...
我把788个小样本复制了两遍,然后加进了原数据,数据结构变成了6689个总数据,其中1为4324,0为2364个,然后用我之前的随机森林代码去建模,模型做出来的结果很好,但是我现在心里没底,不知道这种做法对不对。

prediction   0   1
         0 464  19
         1  12 859

60
yaoqsm321 发表于 2016-12-14 11:26:26
大一仔 发表于 2016-12-13 16:56
我的观点,如果数据都符合独立同分布,传统的统计模型比如判别模型就可以很好地解决这个问题,没必要上机 ...
我把788个小样本复制了两遍,然后加进了原数据,数据结构变成了6689个总数据,其中1为4324,0为2364个,然后用我之前的随机森林代码去建模,模型做出来的结果很好,但是我现在心里没底,不知道这种做法对不对。

prediction   0   1
         0 464  19
         1  12 859

您需要登录后才可以回帖 登录 | 我要注册

本版微信群
加好友,备注cda
拉您进交流群
GMT+8, 2025-12-26 16:07