Fastai offers some really good courses in machine learning and deep learning for programmers. I recently took their "Practical Deep Learning for Coders" course and found it really interesting. Here are my learnings from the course. By Raimi Bin Karim, AI Singapore Everyone’s talking about the fast.ai Massive Open Online Course (MOOC) so I decided to have a go at their 2019 deep learning course Practical Deep Learning for Coders, v3. I’ve always known some deep learning concepts/ideas (I’ve been in this field for about a year now, dealing mostly with computer vision), but never really understood some intuitions or explanations. I also understand that Jeremy Howard, Rachel Thomas and Sylvain Gugger (follow them on Twitter!) are influential people in the deep learning sphere (Jeremy has a lot of experience with Kaggle competitions), so I hope to gain new insights and intuitions, and some tips and tricks for model training from them. I have so much to learn from these folks. So, here I am after 3 weeks of watching the videos (I didn’t do any exercises 🤫🤫🤫🤫) and writing this post to compartmentalise the new things I learnt to share with you. There were of course some things I was clueless about so I did a little bit more research on them, and presented them in this article. Towards the end, I also write about how I felt about the course (spoiler alert: I love it ❣️). Disclaimer Different people gather different learning points, depending on what deep learning background you have. This post is not recommended for beginners of deep learning and is not a summary of the course contents. This post instead assumes you have basic knowledge of neural networks, gradient descent, loss functions, regularisation techniques and generating embeddings. Some experience of the following would be great too: image classification, text classification, semantic segmentation and generative adversarial networks. I organised the content of my 10 learning points as such: from the theory of neural networks, to architectures, to things related to loss function (learning rate, optimiser), to model training (and regularisation), to the deep learning tasks and finally model interpretability. Contents: 10 New Things I Learnt “It’s always good to use transfer learning [to train your model] if you can.” — Jeremy Howard Fast.ai is synonymous to transfer learning and achieving great results in a short amount of time. The course really lives up to its name. Transfer learning and experimentalism are the two key ideas that Jeremy Howard keeps emphasizing in order to be efficient Machine Learning Practitioners. Photo by Vincentiu Solomon on Unsplash The universal approximation theorem says that you can approximate any function with just one hidden layer in a feed-forward neural network. This follows that you can also achieve the same kind of approximation for any neural network that goes deeper. I mean, wow! I just got to know this like only now. This is the fundamental of deep learning. If you have stacks of affine functions (or matrix multiplications) and nonlinear functions, the thing you end up with can approximate any function arbitrarily closely. It is the reason behind the race for different combinations of affine functions and nonlinearities. It’s the reason why architectures are getting deeper. Fig 3.1: Loss landscapes; left landscape has many bumps, right is a smooth landscape. Source: https://arxiv.org/abs/1712.09913 Loss functions usually have bumpy and flat areas (if you visualise them in 2D or 3D diagrams). Have a look at Fig. 3.2. If you end up in a bumpy area, that solution will tend not to generalise very well. This is because you found a solution that is good in one place, but it’s not very good in other place. But if you found a solution in a flat area, you probably will generalise well. And that’s because you found a solution that is not only good at one spot, but around it as well. Fig. 3.2: Loss landscape visualised in a 2D diagram. Screenshot from course.fast.ai. Most of the above paragraph are quoted from Jeremy Howard. Such a simple and beautiful explanation. 本帖隐藏的内容
0. Fast.ai & Transfer Learning
In this section, I will highlight the architectures that were in the limelight during the course, and certain designs incorporated into state-of-the-art (SOTA) models like dropout.
3. Understanding the Loss Landscape
At random, we throw away activations. Intuition: so that no activation can memorise any part of the input. This would help with overfitting wherein some part of the model is basically learning to recognise a particular image rather than a particular feature or item. There’s also embedding dropout but that was briefly touched upon.
BatchNorm does 2 things: (1) normalise activations, and (2) introduce scaling and shifting parameters to each normalised activation. However, it turns out that (1) is not as important as (2). In the paper How Does Batch Normalization Help Optimization?, it was mentioned that “[BatchNorm] reparametrizes the underlying optimization problem to make its landscape significantly smooth[er].” Intuition is this: because it’s now less bumpy, we can use a higher learning rate, hence faster convergence (see Fig. 3.1).
4. Gradient Descent Optimisers
The new thing I learnt was that the RMSprop optimiser acts as an “accelerator”. Intuition: if your gradient has been small for the past few steps, obviously you need to go a little bit faster now.
(For an overview of gradient descent optimisers, I have written a post titled 10 Gradient Descent Optimisation Algorithms.)
Learnt 2 new loss functions: