So in case of DDP one should set it to lr
(specific to per-gpu batch_size) but in case of DP it should be set to lr*N
since backward is done on single gpu right??
And same in case of TPU(8 cores training) as that of DDP since it’s basically DDP?
So in case of DDP one should set it to lr
(specific to per-gpu batch_size) but in case of DP it should be set to lr*N
since backward is done on single gpu right??
And same in case of TPU(8 cores training) as that of DDP since it’s basically DDP?