Let’s say I set grad_clip to be X with norm 2. If I’m training on 8 GPUs (or even with accumulate_gradients > 0), will this clip values to X before or after they are accumulated?
I’m asking because in the paper OWL-ViT they say it makes a big difference (https://arxiv.org/pdf/2205.06230.pdf, Appendix A1.9