发帖

楼主: oliyiyi

1513 2

Dark Knowledge Distilled from Neural Network [推广有奖]

1关注
185
粉丝

版主

已卖：3000份资源

泰斗

1%

还不是VIP/贵宾

-

TA的文库 其他...

计量文库

0%

威望: 7 级
论坛币: -87545 个
通用积分: 31676.0721
学术水平: 1454 点
热心指数: 1573 点
信用等级: 1364 点
经验: 384234 点
帖子: 9629
精华: 66
在线时间: 5508 小时
注册时间: 2007-5-21
最后登录: 2025-7-8

楼主

oliyiyi 发表于 2016-1-24 10:19:01 |AI写论文

是否 +2 论坛币

k人参与回答

经管之家送您一份

应届毕业生专属福利!

求职就业群

赵安豆老师微信：zhaoandou666

经管之家联合CDA

送您一个全额奖学金名额~ !

立即领取

感谢您参与论坛问题回答

经管之家送您两个论坛币！

+2 论坛币

For almost any machine learning algorithm, a very simple way to improve its performance is to train many different models on the same data and then to average their predictions. But it requires too much computation at test time.
Geoffrey Hinton et al. claimed that when extracting knowledge from data, we do not need to worry about the test computation. It seems not hard to distill most of the knowledge into a smaller model for deployment, which they called “dark knowledge”. A little creepy name, but not in a harmful way.

Their paper on dark knowledge includes two main parts.

Model compression: transfer the language learned from the ensemble models into a single smaller model to reduce test computation.
Specialist Networks: training models specialized on a confusable subset of the classes to reduce the time to train an ensemble.

Let’s compress the cumbersome models.

The principal idea is that once the cumbersome model has been trained, we can use a different kind of training called ‘distillation’ to transfer the knowledge to a single smaller model. The same idea has been introduced by Caruana in 2006 while he used a different way of transferring the knowledge, which Hinton shows is a special case of distillation.

To put it in another way, for each data point, we have a predicted distribution using the bigger ensemble network. We will train the smaller network using the output distribution as “soft targets” instead of the true labels. In this way, we can train the small model to generalize in the same way as the large model.

Before training the small model, Hinton increases the entropy of the posteriors and get a much softer distribution by using a transform that “raises the temperature”. A higher temperature will produce a softer probability distribution over classes.

The result on MNIST is surprisingly good, even when they tried omitting some digit. When they omitted all examples of the digit 3 during the transfer training, the distilled net gets 98.6% of the test 3s correct even though 3 is a mythical digit it has never seen.

Train Specialist to Disambiguate

Another question is that sometimes the computation required at training time is also excessive, so how can we make an ensemble mine knowledge more efficiently. Hinton’s answer is to encourage each “specialist” to focus on resolving different confusions.

The specialist estimates two things: whether the image is in my special subset and what the relative probabilities of the classes in the special subset are. For example, in ImageNet, we can make one “specialist” to see examples in mushrooms while another focuses on sports cars.

More technical details can be found on their paper and slides, which has this example – see slides for details.

Experiment on MNIST

Vanilla backprop in a 784 -> 800 -> 800 -> 10 net with rectified linear hidden units gives 146 test errors.

Training a 784 -> 1200 -> 1200 -> 10 net using dropout and weight constraints and jittering the input, eventually gets 67 errors.

It turns out, that much of this improvement can be transferred to a much smaller neural net. Using both the soft targets obtained from the big net and the hard targets, you get 74 errors in the 784 -> 800 -> 800 -> 10 net.

The transfer training uses the same training set but with no dropout and no jitter

It is just vanilla backprop (with added soft targets).

The soft targets contain almost all the knowledge.

The big net learns a similarity metric for the training digits even though this isn’t the objective function for learning.

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

分享0 收藏0 回帖

关键词：knowledge network Neural Still Edge Network

Dark Knowledge Distilled from Neural Network [推广有奖]

经管之家送您一份

经管之家联合CDA

感谢您参与论坛问题回答

扫码加我拉你入群

相关帖子

浏览过的帖子

浏览过的版块

初级学术勋章

初级热心勋章

初级信用勋章

中级信用勋章

中级学术勋章

中级热心勋章

高级热心勋章

高级学术勋章

高级信用勋章

特级热心勋章

特级学术勋章

特级信用勋章

本版微信群

Dark Knowledge Distilled from Neural Network [推广有奖]

经管之家送您一份

经管之家联合CDA

感谢您参与论坛问题回答

扫码加我 拉你入群

相关帖子

浏览过的帖子

浏览过的版块

初级学术勋章

初级热心勋章

初级信用勋章

中级信用勋章

中级学术勋章

中级热心勋章

高级热心勋章

高级学术勋章

高级信用勋章

特级热心勋章

特级学术勋章

特级信用勋章

本版微信群

扫码加我拉你入群