Is gradient clipping done before or after gradients accumulation?

zlenyk · April 5, 2023, 9:49am

Let’s say I set grad_clip to be X with norm 2. If I’m training on 8 GPUs (or even with accumulate_gradients > 0), will this clip values to X before or after they are accumulated?
I’m asking because in the paper OWL-ViT they say it makes a big difference (https://arxiv.org/pdf/2205.06230.pdf, Appendix A1.9

Thank you!

awaelchli · April 5, 2023, 10:38am

Hey @zlenyk

The Trainer does gradient clipping right before the optimizer step. So that means after accumulation of gradients. Here is the relevant code: lightning/precision_plugin.py at 1d1f6009630d01f5347a7234dad97f6c75f93af0 · Lightning-AI/lightning · GitHub

This code gets called when the training loop calls precision_plugin.optimizer_step() etc.

You can have the other behavior as well (clipping before during accumulation) by enabling manual optimization and performing the accumulation yourself: Manual Optimization — PyTorch Lightning 2.1.0dev documentation

Or alternatively, for even more control there is Lightning Fabric (write the training loops completely yourself).

zlenyk · April 5, 2023, 10:57am

That’s great, thank you for the answer and the documentation!

Topic		Replies	Views
Global_step increased at new epoch regardless of gradient accumulation Trainer	2	1077	March 26, 2023
How to use Adaptive Gradient Clipping in PL?	1	1355	February 22, 2021
Gradient Accumulation with Dual (optimizer, scheduler) Training Trainer	0	482	November 10, 2022
Confusing # of optimizer steps when using gradient accumulation with DeepSpeed Trainer	0	847	May 25, 2023
Clarification on log_every_n_steps with accumulate_grad_batches Trainer	1	553	July 16, 2023

Is gradient clipping done before or after gradients accumulation?

Related topics