On Contrastive Learning, ddp and dataset partitioning

pychort · February 27, 2021, 11:46am

So I see that in the documentation you recommend using ‘ddp2’ for contrastive learning, but with the DistributedSampler partitioning the dataset under the hood would that mean given N nodes that the examples in a node would never see (N-1)/N percentage of all potential negative examples for all epochs and throughout the entire training?

Also you suggest ‘ddp’ instead of ‘dp’ as a “much faster” alternative for even a single node multi accelerator training. Again in a contrastive setting, wouldn’t that extremely deprive the set of negative examples on average a data example would ever see throughout training?

As an example for a single node with 4 GPUs this would mean each example would maximally be allowed to be contrasted against 25% of the dataset and no matter how large my number of epochs be, its the seeding of DistributedDampler partitioning the dataset across GPUs that mostly determines the fate of the learned representations.

Am I right in my understanding above, if not what am I missing? and if yes, then would ‘dp’ be my only resort for a distributed contrastive learning setting that would actually make sense without harming the algorithm?

Topic		Replies	Views
Accumulated Gradients + DDP in Contrastive Learning? DDP/GPU	1	1279	April 15, 2022
Behaviour of dropout over multiple gpu setting DDP/GPU	4	380	December 18, 2023
Performance Modeling of Distributed Training	0	279	September 12, 2022
Model Parallel Layer DDP/GPU	1	1732	February 22, 2021
How to implement the Dataset or Data module to achieve the following goals? DDP/GPU	0	172	April 15, 2023

On Contrastive Learning, ddp and dataset partitioning

Related topics