发帖

楼主: oliyiyi

996 1

Why Do Deep Learning Networks Scale? [推广有奖]

1关注
184
粉丝

版主

泰斗

0%

还不是VIP/贵宾

-

TA的文库 其他...

计量文库

0%

威望: 7 级
论坛币: 271951 个
通用积分: 31269.3519
学术水平: 1435 点
热心指数: 1554 点
信用等级: 1345 点
经验: 383775 点
帖子: 9598
精华: 66
在线时间: 5468 小时
注册时间: 2007-5-21
最后登录: 2024-4-18

楼主

oliyiyi 发表于 2016-8-12 08:00:15 |只看作者 |坛友微信交流群|倒序 |AI写论文

是否 +2 论坛币

k人参与回答

经管之家送您一份

应届毕业生专属福利!

求职就业群

赵安豆老师微信：zhaoandou666

经管之家联合CDA

送您一个全额奖学金名额~ !

立即领取

感谢您参与论坛问题回答

经管之家送您两个论坛币！

+2 论坛币

By Charles H Martin, Calculation Consulting.

The naive interpretation is that these are highly non-convex methods, and therefore should not scale at all. I will address this issue since this is critical to understanding scaling.

(This is different than making things go fast or making solvers converge better, and I don’t address this)

A common theme today is that the energy functions on very large dimensional spaces have no low lying local minima, except those very close to the ground state.

This has been recently popularized by LeCun

The Loss Surfaces of Multilayer Networks (2015)

https://arxiv.org/pdf/1412.0233v...

and by Bengio and others, using methods like Hessian Free solvers

This is justified from computations at Google & Stanford

Qualitatively Characterizing Neural Network Optimization Problems

http://arxiv.org/pdf/1412.6544v4...

And indeed, it appears to be the case for many systems.

Theoretically, this might be justified using some recent analytic results on the p-spin spherical glass

Note that this the general idea has been suspected for a very long time—at least 15 years. See, for example, section 6.4.4 of the book Pattern Classification, 2nd Edition (2000)

See Google books: Pattern Classification

So it has always been suspected deep networks would scale—what is puzzling is that deep neural nets do not overtrain.

An old, common belief is simply that deep nets find the local minima, and that any local minima may do. Furthermore, if we run a solver for too long, we would reach the ground state, and this is a state of overtraining. So we have no local minima, and we want to apply early stopping. Add some regularization to avoid overfitting, and this is why deep nets scale.

This is the standard picture; it has been around 15–20 years.

This is somewhat unsatisfying to me, mainly because I find that these things are really hard to build and train without a lot of human inspection.

A neural network, like a deep MLP, will readily overtrain a small data set. But on larger data sets, it is chaotically unstable.

Indeed, it is pretty easy to make a network totally diverge (test and training accuracy) by simply running for a very long time. Other methods, like SVMs, may perform poorly, but they rarely behave like this (except in cases when, say, the variance explodes)

Here is an example of an exploding network, using a naive 5-layer MLP, trained with TensorFlow, using the infiminist data set.

CalculatedContent/big_deep_simple_mlp

Of course, we know that gradients can explode in simple networks, and it is necessary to damp the gradients to prevent this. But it is not exactly automated. This is what I mean by unstable.

To address this, I have been working on this, with some collaborators at UC Berkeley, on a general theory for Why Deep Learning Works

CC mmds talk 2106 (video coming soon)

Why does Deep Learning work?

The new idea is that, when a Deep Network is being trained, if it is designed properly, it behaves like a Spin Glass of Minimal Frustration (first postulated to explain protein folding)

Therefore, the energy landscape not only lacks local minima, but it also forms an n-dimensional funnel-like shape during training, which causes the energy landscape to acquire a strong attractor. This attractor works analogously to a critical point in statistical mechanics, described using renormalization group equations

Why Deep Learning Works II: the Renormalization Group

Indeed, it seems that RBMs at least looks like it is solving a form of the variational renormalization group equations, in a rough sense.

This means deep networks are operating a sub-critical point, balanced between order and chaos. This also means that any generalization bound has to be strongly data dependent, and that the VC approach is not so useful.

These ideas are supported by looking at recent dynamical simulations and analysis using statistical mechanics

Exponential expressivity in deep neural networks through transient chaos (2016)

https://arxiv.org/pdf/1606.05340...

This would also suggest that deep networks would do well when they are self-similar, as self-similar solutions solve the RG equations. And, indeed, recent work suggests that very deep networks can be trained when they are self similar

FractalNet: Ultra-Deep Neural Networks without Residuals (2016)

Furthermore, recent work shows that real systems like textual / nlp datasets also display long tail correlations in the mutual information. This is characteristic of critical / subcritical behavior

Critical Behavior from Deep Dynamics: A Hidden Dimension in Natural Language (2016)

I believe I am the first to characterize why deep nets work using the strongly correlated Random Energy Model. That is, I extend the approach by LeCun, which uses the p-spherical spin glass, by extending the spin glass model to include strong correlations, induced by the data itself. This gives a very different, new picture of why deep learning works.

Good networks behave very well; bad networks diverge. That means that when say we don’t have enough regularization, we can haveboth the test accuracy and the training accuracy diverge. And this is quite different than the VC description, where overtraining is a problem, not complete divergence. And there is no notion of dynamical stability in VC theory, which is critical for deep networks to function.

So why do deep nets scale well with training data ? My hypothesis: A deep net, when trained on a lot of well labelled data, behaves like a spin glass of minimal frustration. There is a trade off between the Energy to match the training data and the inherent Entropy of the network itself. A good training set will modify the Energy Landscape , which is otherwise wide and degenerate, causing it become punctuated with deep wells, acting like chaotic dynamical attractors.

Furthermore, the stable attract is analogous to a subcritical point in a physical system, and the deep layers renormalize the free energy at each layer, with each layer keeping the others ‘in balance’. Regularization acts like a temperature control, keeping the system from collapsing into a glassy state , where overtraining occurs.

Bio: Charles H Martin, PhD, is a Data Scientist & Machine Learning Expert in San Francisco Bay Area, currently with Calculation Consulting.

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

分享0 收藏0 回帖

关键词：Networks Learning network earning Learn different therefore critical address methods

Why Do Deep Learning Networks Scale? [推广有奖]

经管之家送您一份

经管之家联合CDA

感谢您参与论坛问题回答

扫码加我拉你入群

相关帖子

初级学术勋章

初级热心勋章

初级信用勋章

中级信用勋章

中级学术勋章

中级热心勋章

高级热心勋章

高级学术勋章

高级信用勋章

特级热心勋章

特级学术勋章

特级信用勋章

本版微信群

Why Do Deep Learning Networks Scale? [推广有奖]

经管之家送您一份

经管之家联合CDA

感谢您参与论坛问题回答

扫码加我 拉你入群

相关帖子

初级学术勋章

初级热心勋章

初级信用勋章

中级信用勋章

中级学术勋章

中级热心勋章

高级热心勋章

高级学术勋章

高级信用勋章

特级热心勋章

特级学术勋章

特级信用勋章

本版微信群

扫码加我拉你入群