Lightning AI Studios: Never set up a local environment again →

Log in or create a free account to track your progress and access additional course materials  

Unit 6.3 – Using More Advanced Optimization Algorithms


What we covered in this video lecture

In this lecture, we learned how to modify the stochastic gradient descent (SGD) optimizer with a momentum term. Momentum is a technique used to accelerate the training of SGD algorithms. It helps to improve the convergence rate and stability of the optimization process by smoothening out the gradient estimates. It can also help to avoid local minima and saddle points by carrying the momentum across flat regions of the loss surface.

Then, we covered the popular Adam algorithm. The Adam optimizer (short for “adaptive moment estimation”) is an extension of SGD that uses a combination of momentum and adaptive learning rates to accelerate convergence. The basic idea of Adam is to compute individual learning rates for each parameter in the network based on estimates of the mean and variance of the gradients. These estimates are then used to adjust the learning rates of each parameter during training.

Additional resources if you want to learn more

A popular alternative to Adam is AdamW, which stands for “Adam with Weight Decay.” Weight decay is a regularization technique used in deep learning and other machine learning algorithms to prevent overfitting. Weight decay is often implemented via a modification of the optimizer to encourage smaller weights as training progresses (if you are familiar with L2-regularization, weight decay is a somewhat related concept, as I explain here.

Both Adam and AdamW support weight decay, but AdamW is an improvement over Adam if you use weight decay asAdamW decouples weight decay from the adaptive learning rate update step. Instead of applying weight decay within the update rule for the first and second-moment estimates, it directly applies weight decay to the weights during the update step. This decoupling of weight decay from the adaptive learning rate update leads to better training performance and generalization. If you are interested in AdamW, you can find out more about it in the Decoupled Weight Decay Regularization paper. AdamW is also available in PyTorch.6.4 Choosing Activation Functions

Log in or create a free account to access:

  • Quizzes
  • Completion badges
  • Progress tracking
  • Additional downloadable content
  • Additional AI education resources
  • Notifications when new units are released
  • Free cloud computing credits

Quiz: 6.3 Using More Advanced Optimization Algorithms - Part 1

A high momentum term means that …

Incorrect. A momentum term doesn’t increase the learning rate itself.


Please answer all questions to proceed.

Quiz: 6.3 Using More Advanced Optimization Algorithms - Part 2

Which is the most widely used optimization algorithm as of 2020/21?

Correct. SGD remains as the most popular optimizer by a wide margin.

Incorrect. Adam takes the third place.

Incorrect. Close, SGD + Momentum takes the second place.

Incorrect. Only a very small fraction of papers uses this optimizer.

Please answer all questions to proceed.
Watch Video 1

Unit 6.3