随机森林过拟合问题，在训练集上表现很好，在测试集上的表现很差 - 第6页

51楼

jgchen1966 发表于 2016-12-14 10:36:43

Random Forests

LEO BREIMAN （随机森林的原创者）
Statistics Department, University of California, Berkeley, CA 94720

Abstract. Random forests are a combination of tree predictors such that each tree depends on the values of a
random vector sampled independently and with the same distribution for all trees in the forest. The generalization
error for forests converges a.s. to a limit as the number of trees in the forest becomes large. The generalization
error of a forest of tree classifiers depends on the strength of the individual trees in the forest and the correlation
between them. Using a random selection of features to split each node yields error rates that compare
favorably to Adaboost (Y. Freund & R. Schapire, Machine Learning: Proceedings of the Thirteenth International
conference, ∗ ∗ ∗, 148–156), but are more robust with respect to noise. Internal estimates monitor error,
strength, and correlation and these are used to show the response to increasing the number of features used in
the splitting. Internal estimates are also used to measure variable importance. These ideas are also applicable to
regression.
此文发表于：
Machine Learning, 45, 5–32, 2001
c 2001 Kluwer Academic Publishers. Manufactured in The Netherlands

鹑居鷇食，鸟行无彰

52楼

jgchen1966 发表于 2016-12-14 10:38:50

大一仔发表于 2016-12-14 10:16
随机森林的优良取决于决策树的优良和各决策树的相互独立程度，而决策树通常对于数据要求不高，见决策树的 ...

请认真读上楼的 LEO BREIMAN 原创论文，请对他表示出足够尊重，其人虽仙去！！

53楼

大一仔 发表于 2016-12-14 11:13:48

jgchen1966 发表于 2016-12-14 10:38
请认真读上楼的 LEO BREIMAN 原创论文，请对他表示出足够尊重，其人虽仙去！！

我的观点：认真读过之后还是没发现你的论点啊，我的结论就是看了他的结论。你重点标明的地方只是说明了在构建森林时，每棵树都是用同样的方法进行的有放回抽样，这样抽样的结果就是对于每棵树来说分布自然是一样的，但是对于单一的树，数据是否独立同分布并没有影响。重点应该在“for all trees”而不是“for all sample datas”吧？

54楼

jgchen1966 发表于 2016-12-14 11:14:18

以下大段英文引自
   Ingo Steinwart ， Andreas Christmann  ：
                  Support Vector Machines
                           2008 Springer Science+Business Media, LLC
      p3

  最好地说明了机器学习对数据样本的普遍性要求

      As already mentioned informally, we assume in the machine learning approach
that we have collected a sequence D := ((x1, y1), . . . , (xn, yn)) of input/
output pairs that are used to “learn” a function f : X → Y such that
f(x) is a good approximation of the possible response y to an arbitrary x.
Obviously, in order to find such a function, it is necessary that the already
collected data D have something in common with the new and unseen data.
In the framework of statistical learning theory, this is guaranteed by assuming
that both past and future pairs (x, y) are independently generated by the same,
but of course unknown, probability distribution P on X × Y . In other words,
a pair (x, y) is generated in two steps. First, the input value x is generated
according to the marginal distribution PX. Second, the output value y is generated
according to the conditional probability P( · |x) on Y given the value
of x. Note that by letting x be generated by an unknown distribution PX,
we basically assume that we have no control over how the input values have
been and will be observed. Furthermore, assuming that the output value y to
a given x is stochastically generated by P( · |x) accommodates the fact that
in general the information contained in x may not be sufficient to determine
a single response in a deterministic manner. In particular, this assumption
includes the two extreme cases where either all input values determine an
(almost surely) unique output value or the input values are completely irrelevant
for the output value. Finally, assuming that the conditional probability
P( · |x) is unknown contributes to the fact that we assume that we do not
have a reasonable description of the relationship between the input and output
values. Note that this is a fundamental difference from parametric models,
in which the relationship between the inputs x and the outputs y is assumed
to follow some unknown function f ∈ F from a known, finite-dimensional set
of functions F.

鹑居鷇食，鸟行无彰

55楼

jgchen1966 发表于 2016-12-14 11:17:01

大一仔发表于 2016-12-14 11:13
我的观点：认真读过之后还是没发现你的论点啊，我的结论就是看了他的结论。你重点标明的地方只是说明了在 ...

行吧！！！人人是不同的，念书的结果当然也不同！！！