发帖

楼主: oliyiyi

2321 0

Learning to Learn by Gradient Descent by Gradient Descent [推广有奖]

1关注
185
粉丝

版主

已卖：2995份资源

泰斗

1%

还不是VIP/贵宾

-

TA的文库 其他...

计量文库

0%

威望: 7 级
论坛币: 66190 个
通用积分: 31671.1867
学术水平: 1454 点
热心指数: 1573 点
信用等级: 1364 点
经验: 384134 点
帖子: 9629
精华: 66
在线时间: 5508 小时
注册时间: 2007-5-21
最后登录: 2025-7-8

楼主

oliyiyi 发表于 2017-2-9 22:53:41 |AI写论文

是否 +2 论坛币

k人参与回答

经管之家送您一份

应届毕业生专属福利!

求职就业群

赵安豆老师微信：zhaoandou666

经管之家联合CDA

送您一个全额奖学金名额~ !

立即领取

感谢您参与论坛问题回答

经管之家送您两个论坛币！

+2 论坛币

Learning to learn by gradient descent by gradient descent, Andrychowicz et al., NIPS 2016

One of the things that strikes me when I read these NIPS papers is just how short some of them are – between the introduction and the evaluation sections you might find only one or two pages! A general form is to start out with a basic mathematical model of the problem domain, expressed in terms of functions. Selected functions are then learned, by reaching into the machine learning toolbox and combining existing building blocks in potentially novel ways. When looked at this way, we could really call machine learning ‘function learning‘.

Thinking in terms of functions like this is a bridge back to the familiar (for me at least). We have function composition. For example, given a function [size=+2]f mapping images to feature representations, and a function [size=+2]g acting as a classifier mapping image feature representations to objects, we can build a systems that classifies objects in images with [size=+2]g ○ f.

Each function in the system model could be learned or just implemented directly with some algorithm. For example, feature mappings (or encodings) were traditionally implemented by hand, but increasingly are learned...

The move from hand-designed features to learned features in machine learning has been wildly successful.

Part of the art seems to be to define the overall model in such a way that no individual function needs to do too much (avoiding too big a gap between the inputs and the target output) so that learning becomes more efficient / tractable, and we can take advantage of different techniques for each function as appropriate. In the above example, we composed one learned function for creating good representations, and another function for identifying objects from those representations.

We can have higher-order functions that combine existing (learned or otherwise) functions, and of course that means we can also use combinators.

And what do we find when we look at the components of a ‘function learner’ (machine learning system)? More functions!

Frequently, tasks in machine learning can be expressed as the problem of optimising an objective function [size=+2]f(θ) defined over some domain [size=+2]θ ∈ Θ.

The optimizer function maps from [size=+2]f θ to [size=+2]argminθ ∈ Θ f θ . The standard approach is to use some form of gradient descent (e.g., SGD – stochastic gradient descent). A classic paper in optimisation is ‘No Free Lunch Theorems for Optimization’ which tells us that no general-purpose optimisation algorithm can dominate all others. So to get the best performance, we need to match our optimisation technique to the characteristics of the problem at hand:

... specialisation to a subclass of problems is in fact the only way that improved performance can be achieved in general.

Thus there has been a lot of research in defining update rules tailored to different classes of problems – within deep learning these include for example momentum, Rprop, Adagrad, RMSprop, and ADAM.

But what if instead of hand designing an optimising algorithm (function) we learn it instead? That way, by training on the class of problems we’re interested in solving, we can learn an optimum optimiser for the class!

The goal of this work is to develop a procedure for constructing a learning algorithm which performs well on a particular class of optimisation problems. Casting algorithm design as a learning problem allows us to specify the class of problems we are interested in through example problem instances. This is in contrast to the ordinary approach of characterising properties of interesting problems analytically and using these analytical insights to design learning algorithms by hand.

If learned representations end up performing better than hand-designed ones, can learned optimisers end up performing better than hand-designed ones too? The answer turns out to be yes!

Our experiments have confirmed that learned neural optimizers compare favorably against state-of-the-art optimization methods used in deep learning.

In fact not only do these learned optimisers perform very well, but they also provide an interesting way to transfer learning across problems sets. Traditionally transfer learning is a hard problem studied in its own right. But in this context, because we’re learning how to learn, straightforward generalization (the key property of ML that lets us learn on a training set and then perform well on previously unseen examples) provides for transfer learning!!

We witnessed a remarkable degree of transfer, with for example the LSTM optimizer trained on 12,288 parameter neural art tasks being able to generalize to tasks with 49,512 parameters, different styles, and different content images all at the same time. We observed similar impressive results when transferring to different architectures in the MNIST task.

Learning how to learn

Thinking functionally, here’s my mental model of what’s going on… In the beginning, you might have hand-coded a classifier function, [size=+2]c, which maps from some Input to a Class:

c :: Input -> Class

With machine learning, we figured out for certain types of functions it’s better to learn an implementation than try and code it by hand. An optimisation function [size=+2]f takes some TrainingData and an existing classifier function, and returns an updated classifier function:

type Classifier = (Input -> Class) f :: TrainingData -> Classifier -> Classifier

What we’re doing now is saying, “well, if we can learn a function, why don’t we learn f itself?”

type Optimiser = (TrainingData -> Classifier -> Classifier) g :: TrainingData -> Optimiser -> Optimiser

Let [size=+2]ϕ be the (to be learned) update rule for our (optimiser) optimiser. We need to evaluate how effective [size=+2]g is over a number of iterations, and for this reason [size=+2]g is modelled using a recurrent neural network (LSTM). The state of this network at time [size=+2]t is represented by [size=+2]ht.

Suppose we are training [size=+2]g to optimise an optimisation function [size=+2]f. Let [size=+2]g(ϕ) result in a learned set of parameters for [size=+2]f θ, The loss function for training [size=+2]g(ϕ) uses as its expected loss the expected loss of [size=+2]f as trained by [size=+2]g(ϕ).

We can minimise the value of [size=+2]L(ϕ) using gradient descent on [size=+2]ϕ.

To scale to tens of thousands of parameters or more, the optimiser network m operators coordinatewise on the parameters of the objective function, similar to update rules like RMSProp and ADAM. The update rule for each coordinate is implemented using a 2-layer LSTM network using a forget-gate architecture.

The network takes as input the optimizee gradient for a single coordinate as well as the previous hidden state and outputs the update for the corresponding optimise parameter. We refer to this architecture as an LSTM optimiser.

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

分享0 收藏0 回帖

关键词：gradient Learning earning Learn Earn expressed learning gradient general between

Learning to Learn by Gradient Descent by Gradient Descent [推广有奖]

经管之家送您一份

经管之家联合CDA

感谢您参与论坛问题回答

扫码加我拉你入群

相关帖子

浏览过的帖子

浏览过的版块

初级学术勋章

初级热心勋章

初级信用勋章

中级信用勋章

中级学术勋章

中级热心勋章

高级热心勋章

高级学术勋章

高级信用勋章

特级热心勋章

特级学术勋章

特级信用勋章

本版微信群

Learning to Learn by Gradient Descent by Gradient Descent [推广有奖]

经管之家送您一份

经管之家联合CDA

感谢您参与论坛问题回答

扫码加我 拉你入群

相关帖子

浏览过的帖子

浏览过的版块

初级学术勋章

初级热心勋章

初级信用勋章

中级信用勋章

中级学术勋章

中级热心勋章

高级热心勋章

高级学术勋章

高级信用勋章

特级热心勋章

特级学术勋章

特级信用勋章

本版微信群

扫码加我拉你入群