Still confused a bit. So in DDP, backward pass is done on all the devices and later on synced so in this case each device will be using batch_size
that will be assigned in the dataloader and learning_rate
should be set corresponding to batch_size
and not batch_size*N
but in case of DP, backward pass is done on batch_size*N
on a single device so there should we set learning_rate=learning_rate*N
??