Lightning AI Studios: Never set up a local environment again →

Log in or create a free Lightning.ai account to track your progress and access additional course materials  

Unit 6.2 – Learning Rates and Learning Rate Schedulers

References

Code

Parts 1, 2 & 4: 6.2-learning-rates/

What we covered in this video lecture

In this lecture, we introduced three different kinds of learning rate schedulers: step schedulers, on-plateau schedulers, and cosine decay schedulers. They all have in common that they decay the learning rate over time to achieve better annealing — making the loss less jittery or jumpy towards the end of the training.

In practice, I often recommend starting without a learning rate scheduler and then adding a learning rate scheduler while making sure that the predictive performance is better than before — if the predictive performance becomes worse than without a scheduler, that’s usually an indicator that the scheduler’s hyperparameters need to be adjusted.

Additional resources if you want to learn more

If you are interested in additional analyses about learning rate scheduling, you might like the classic Super-Convergence: Very Fast Training of Neural Networks Using Large Learning Rates paper. The paper discusses a phenomenon called “super-convergence” where neural networks can be trained much faster than with standard methods, leading to better generalization. Super-convergence is achieved through training with one learning rate cycle and a large maximum learning rate, which regularizes the training and requires a reduction in other forms of regularization. The authors also propose a simplified method to estimate the optimal learning rate. The experiments demonstrate the effectiveness of super-convergence on several datasets and architectures, especially when the amount of labeled training data is limited.

Log in or create a free Lightning.ai account to access:

  • Quizzes
  • Completion badges
  • Progress tracking
  • Additional downloadable content
  • Additional AI education resources
  • Notifications when new units are released
  • Free cloud computing credits

Quiz: 6.2 Learning Rates and Learning Rate Schedulers - Part 1

If the learning rate is too small, the loss will

Correct. Usually, the loss will stay more or less constant if the learning rate is too small.

Incorrect. Usually, the loss will stay more or less constant if the learning rate is too small.

Please answer all questions to proceed.

Quiz: 6.2 Learning Rates and Learning Rate Schedulers - Part 2

Suppose we want to create a model checkpoint based on the training set loss. What would the correct ModelCheckpoint code look like in this case?

Incorrect. The code shown above monitors the accuracy, not the loss.

Incorrect. The code shown above monitors the accuracy, not the loss.

Incorrect. The code would maximize the training set loss.

Correct. The code minimizes the training set loss.

Please answer all questions to proceed.

Quiz: 6.2 Learning Rates and Learning Rate Schedulers - Part 3

The automatic learning rate finding function will always find the optimal learning rate.

Incorrect. Automatic learning rate finders use a heuristic to get a good ballpark estimate, but they are not guaranteed to return the optimal learning rate.

Correct. Automatic learning rate finders use a heuristic to get a good ballpark estimate, but they are not guaranteed to return the optimal learning rate.

Please answer all questions to proceed.

Quiz: 6.2 Learning Rates and Learning Rate Schedulers - Part 4

Setting step_size=5 and gamma=0.3 in the step scheduler will

Incorrect. The step_size, by default, refers to the epochs (although it is possible to reconfigure it)

Correct. We multiply the learning rate by 0.3 every 5 epochs.

Incorrect. The step_size, by default, refers to the epochs (although it is possible to reconfigure it)

Incorrect. We decrease, not increase the learning rate.

Please answer all questions to proceed.

Quiz: 6.2 Learning Rates and Learning Rate Schedulers - Part 5

CosineAnnealingLR T_max argument …

Correct. If T_max is reached, the learning rate decayed to the lowest point and will be reset.

Correct. If T_max is reached, the learning rate decayed to the lowest point and will be reset.

Incorrect. T_max refers to the step number and not a specific learning rate value.

Incorrect. T_max refers to the step number and not a specific learning rate value.

Please answer all questions to proceed.
Watch Video 1

Unit 6.2

Videos