请选择 进入手机版 | 继续访问电脑版
楼主: oliyiyi
1138 2

Dark Knowledge Distilled from Neural Network [推广有奖]

版主

泰斗

0%

还不是VIP/贵宾

-

TA的文库  其他...

计量文库

威望
7
论坛币
272091 个
通用积分
31269.1753
学术水平
1435 点
热心指数
1554 点
信用等级
1345 点
经验
383778 点
帖子
9599
精华
66
在线时间
5466 小时
注册时间
2007-5-21
最后登录
2024-3-21

初级学术勋章 初级热心勋章 初级信用勋章 中级信用勋章 中级学术勋章 中级热心勋章 高级热心勋章 高级学术勋章 高级信用勋章 特级热心勋章 特级学术勋章 特级信用勋章

oliyiyi 发表于 2016-1-24 10:19:01 |显示全部楼层 |坛友微信交流群

+2 论坛币
k人 参与回答

经管之家送您一份

应届毕业生专属福利!

求职就业群
赵安豆老师微信:zhaoandou666

经管之家联合CDA

送您一个全额奖学金名额~ !

感谢您参与论坛问题回答

经管之家送您两个论坛币!

+2 论坛币

For almost any machine learning algorithm, a very simple way to improve its performance is to train many different models on the same data and then to average their predictions. But it requires too much computation at test time.
Geoffrey Hinton et al. claimed that when extracting knowledge from data, we do not need to worry about the test computation. It seems not hard to distill most of the knowledge into a smaller model for deployment, which they called “dark knowledge”. A little creepy name, but not in a harmful way.

Their paper on dark knowledge includes two main parts.

  • Model compression: transfer the language learned from the ensemble models into a single smaller model to reduce test computation.
  • Specialist Networks: training models specialized on a confusable subset of the classes to reduce the time to train an ensemble.

Let’s compress the cumbersome models.

The principal idea is that once the cumbersome model has been trained, we can use a different kind of training called ‘distillation’ to transfer the knowledge to a single smaller model. The same idea has been introduced by Caruana in 2006 while he used a different way of transferring the knowledge, which Hinton shows is a special case of distillation.

To put it in another way, for each data point, we have a predicted distribution using the bigger ensemble network. We will train the smaller network using the output distribution as “soft targets” instead of the true labels. In this way, we can train the small model to generalize in the same way as the large model.

Before training the small model, Hinton increases the entropy of the posteriors and get a much softer distribution by using a transform that “raises the temperature”. A higher temperature will produce a softer probability distribution over classes.

The result on MNIST is surprisingly good, even when they tried omitting some digit. When they omitted all examples of the digit 3 during the transfer training, the distilled net gets 98.6% of the test 3s correct even though 3 is a mythical digit it has never seen.

Train Specialist to Disambiguate

Another question is that sometimes the computation required at training time is also excessive, so how can we make an ensemble mine knowledge more efficiently. Hinton’s answer is to encourage each “specialist” to focus on resolving different confusions.

The specialist estimates two things: whether the image is in my special subset and what the relative probabilities of the classes in the special subset are. For example, in ImageNet, we can make one “specialist” to see examples in mushrooms while another focuses on sports cars.

More technical details can be found on their paper and slides, which has this example – see slides for details.

Experiment on MNIST

Vanilla backprop in a 784 -> 800 -> 800 -> 10 net with rectified linear hidden units gives 146 test errors.

Training a 784 -> 1200 -> 1200 -> 10 net using dropout and weight constraints and jittering the input, eventually gets 67 errors.

It turns out, that much of this improvement can be transferred to a much smaller neural net. Using both the soft targets obtained from the big net and the hard targets, you get 74 errors in the 784 -> 800 -> 800 -> 10 net.

The transfer training uses the same training set but with no dropout and no jitter

It is just vanilla backprop (with added soft targets).

The soft targets contain almost all the knowledge.

The big net learns a similarity metric for the training digits even though this isn’t the objective function for learning.



二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

关键词:knowledge network Neural Still Edge Network

缺少币币的网友请访问有奖回帖集合
https://bbs.pinggu.org/thread-3990750-1-1.html
hjtoh 发表于 2016-1-24 10:32:54 来自手机 |显示全部楼层 |坛友微信交流群
oliyiyi 发表于 2016-1-24 10:19
For almost any machine learning algorithm, a very simple way to improve its performance is to trai ...
算法的知识还是很欠缺

使用道具

kyle2014 发表于 2016-7-7 09:59:40 |显示全部楼层 |坛友微信交流群
thanks

使用道具

您需要登录后才可以回帖 登录 | 我要注册

本版微信群
加好友,备注jltj
拉您入交流群

京ICP备16021002-2号 京B2-20170662号 京公网安备 11010802022788号 论坛法律顾问:王进律师 知识产权保护声明   免责及隐私声明

GMT+8, 2024-3-29 04:30