Is gradient clipping done before or after gradients accumulation?

Let’s say I set grad_clip to be X with norm 2. If I’m training on 8 GPUs (or even with accumulate_gradients > 0), will this clip values to X before or after they are accumulated?
I’m asking because in the paper OWL-ViT they say it makes a big difference (https://arxiv.org/pdf/2205.06230.pdf, Appendix A1.9

Thank you!

Hey @zlenyk

The Trainer does gradient clipping right before the optimizer step. So that means after accumulation of gradients. Here is the relevant code: lightning/precision_plugin.py at 1d1f6009630d01f5347a7234dad97f6c75f93af0 · Lightning-AI/lightning · GitHub

This code gets called when the training loop calls precision_plugin.optimizer_step() etc.

You can have the other behavior as well (clipping before during accumulation) by enabling manual optimization and performing the accumulation yourself: Manual Optimization — PyTorch Lightning 2.1.0dev documentation

Or alternatively, for even more control there is Lightning Fabric (write the training loops completely yourself).

1 Like

That’s great, thank you for the answer and the documentation!

1 Like