Unit 6.2 – Learning Rates and Learning Rate Schedulers
Parts 1, 2 & 4: 6.2-learning-rates/
What we covered in this video lecture
In this lecture, we introduced three different kinds of learning rate schedulers: step schedulers, on-plateau schedulers, and cosine decay schedulers. They all have in common that they decay the learning rate over time to achieve better annealing — making the loss less jittery or jumpy towards the end of the training.
In practice, I often recommend starting without a learning rate scheduler and then adding a learning rate scheduler while making sure that the predictive performance is better than before — if the predictive performance becomes worse than without a scheduler, that’s usually an indicator that the scheduler’s hyperparameters need to be adjusted.
Additional resources if you want to learn more
If you are interested in additional analyses about learning rate scheduling, you might like the classic Super-Convergence: Very Fast Training of Neural Networks Using Large Learning Rates paper. The paper discusses a phenomenon called “super-convergence” where neural networks can be trained much faster than with standard methods, leading to better generalization. Super-convergence is achieved through training with one learning rate cycle and a large maximum learning rate, which regularizes the training and requires a reduction in other forms of regularization. The authors also propose a simplified method to estimate the optimal learning rate. The experiments demonstrate the effectiveness of super-convergence on several datasets and architectures, especially when the amount of labeled training data is limited.
Log in or create a free Lightning.ai account to access:
- Completion badges
- Progress tracking
- Additional downloadable content
- Additional AI education resources
- Notifications when new units are released
- Free cloud computing credits