Deep Learning Fundamentals

- Deep Learning Fundamentals
- Unit 1Intro to ML and DL
- Unit 2Using Tensors w/ PyTorch
- Unit 3Model Training in PyTorch
- Unit 3.1Using Logistic Regression for Classification
- Unit 3.2The Logistic Regression Computation Graph
- Unit 3.3Model Training with Stochastic Gradient Descent
- Unit 3.4Automatic Differentiation in PyTorch
- Unit 3.5The PyTorch API
- Unit 3.6Training a Logistic Regression Model in PyTorch
- Unit 3.7 Feature Normalization
- Unit 3 ExercisesUnit 3 Exercies

- Unit 4Training Multilayer Neural Networks Overview
- Unit 4.1Logistic Regression for Multiple Classes
- Unit 4.2Multilayer Neural Networks
- Unit 4.3Training a Multilayer Neural Network in PyTorch
- Unit 4.4Defining Efficient Data Loaders
- Unit 4.5Multilayer Neural Networks for Regression
- Unit 4.6Speeding Up Model Training Using GPUs
- Unit 4 ExercisesUnit 4 Exercises

- Unit 5Organizing Your Code with Lightning
- Unit 5.1 Organizing Your Code with Lightning
- Unit 5.2Training a Multilayer Perceptron using the Lightning Trainer
- Unit 5.3Computing Metrics Efficiently with TorchMetrics
- Unit 5.4Making Code Reproducible
- Unit 5.5Organizing Your Data Loaders with Data Modules
- Unit 5.6The Benefits of Logging Your Model Training
- Unit 5.7Evaluating and Using Models on New Data
- Unit 5.8Add Functionality with Callbacks
- Unit 5 ExercisesUnit 5 Exercises

- Unit 6Essential Deep Learning Tips & Tricks
- Unit 6.1 Model Checkpointing and Early Stopping
- Unit 6.2Learning Rates and Learning Rate Schedulers
- Unit 6.3Using More Advanced Optimization Algorithms
- Unit 6.4Choosing Activation Functions
- Unit 6.5Automating The Hyperparameter Tuning Process
- Unit 6.6Improving Convergence with Batch Normalization
- Unit 6.7Reducing Overfitting With Dropout
- Unit 6.8Debugging Deep Neural Networks
- Unit 6 ExercisesUnit 6 Exercises

- Unit 7Getting Started with Computer Vision
- Unit 7.1Working With Images
- Unit 7.2How Convolutional Neural Networks Work
- Unit 7.3Convolutional Neural Network Architectures
- Unit 7.4Training Convolutional Neural Networks
- Unit 7.5Improving Predictions with Data Augmentation
- Unit 7.6Leveraging Pretrained Models with Transfer Learning
- Unit 7.7Using Unlabeled Data with Self-Supervised
- Unit 7 ExercisesUnit 7 Exercises

- Unit 8Natural Language Processing and Large Language Models
- Unit 8.1Working with Text Data
- Unit 8.2Training A Text Classifier Baseline
- Unit 8.3Introduction to Recurrent Neural Networks
- Unit 8.4From RNNs to the Transformer Architecture
- Unit 8.5Understanding Self-Attention
- Unit 8.6Large Language Models
- Unit 8.7A Large Language Model for Classification
- Unit 8 ExercisesUnit 8 Exercises

- Unit 9Techniques for Speeding Up Model Training
- Unit 10 The Finale: Our Next Steps After AI Model Training

# Unit 6.3 – Using More Advanced Optimization Algorithms

#### References

- Qian (1999). On The Momentum Term In Gradient Descent Learning Algorithms
- Schmidt, Schneider, Hennig (2021). Descending through a Crowded Valley — Benchmarking Deep Learning Optimizers
- Kingma, Ba (2014). Adam: A Method for Stochastic Optimization
- PyTorch Adam documentation

**What we covered in this video lecture**

In this lecture, we learned how to modify the stochastic gradient descent (SGD) optimizer with a momentum term. Momentum is a technique used to accelerate the training of SGD algorithms. It helps to improve the convergence rate and stability of the optimization process by smoothening out the gradient estimates. It can also help to avoid local minima and saddle points by carrying the momentum across flat regions of the loss surface.

Then, we covered the popular Adam algorithm. The Adam optimizer (short for “adaptive moment estimation”) is an extension of SGD that uses a combination of momentum and adaptive learning rates to accelerate convergence. The basic idea of Adam is to compute individual learning rates for each parameter in the network based on estimates of the mean and variance of the gradients. These estimates are then used to adjust the learning rates of each parameter during training.

**Additional resources if you want to learn more**

A popular alternative to Adam is AdamW, which stands for “Adam with Weight Decay.” Weight decay is a regularization technique used in deep learning and other machine learning algorithms to prevent overfitting. Weight decay is often implemented via a modification of the optimizer to encourage smaller weights as training progresses (if you are familiar with L2-regularization, weight decay is a somewhat related concept, as I explain here.

Both Adam and AdamW support weight decay, but AdamW is an improvement over Adam if you use weight decay asAdamW decouples weight decay from the adaptive learning rate update step. Instead of applying weight decay within the update rule for the first and second-moment estimates, it directly applies weight decay to the weights during the update step. This decoupling of weight decay from the adaptive learning rate update leads to better training performance and generalization. If you are interested in AdamW, you can find out more about it in the Decoupled Weight Decay Regularization paper. AdamW is also available in PyTorch.6.4 Choosing Activation Functions

**Log in or create a free Lightning.ai account to access:**

- Quizzes
- Completion badges
- Progress tracking
- Additional downloadable content
- Additional AI education resources
- Notifications when new units are released
- Free cloud computing credits

#### Quiz: 6.3 Using More Advanced Optimization Algorithms - Part 1

#### Quiz: 6.3 Using More Advanced Optimization Algorithms - Part 2

##### Watch Video 1 Mark complete and go to Unit 6.4 →

Unit 6.3