So I see that in the documentation you recommend using ‘ddp2’ for contrastive learning, but with the DistributedSampler partitioning the dataset under the hood would that mean given N nodes that the examples in a node would never see (N-1)/N percentage of all potential negative examples for all epochs and throughout the entire training?
Also you suggest ‘ddp’ instead of ‘dp’ as a “much faster” alternative for even a single node multi accelerator training. Again in a contrastive setting, wouldn’t that extremely deprive the set of negative examples on average a data example would ever see throughout training?
As an example for a single node with 4 GPUs this would mean each example would maximally be allowed to be contrasted against 25% of the dataset and no matter how large my number of epochs be, its the seeding of DistributedDampler partitioning the dataset across GPUs that mostly determines the fate of the learned representations.
Am I right in my understanding above, if not what am I missing? and if yes, then would ‘dp’ be my only resort for a distributed contrastive learning setting that would actually make sense without harming the algorithm?