Hi, i have a question about how to het indices of Dataloader from multi-gpu training (DDP).
I work on my private cluster and had succesfully use DDP on 4 GPUs (on a single machine). However, once my data get larger and i don’t want all of my machines to store these data. So my roughly solution is to create a thread that will copy the chunk of data from my main machine (e.g. copy 10k images, use it for training then delete and copy another 10k).
My question is
- How can i know which indices that would be sampled from the Dataloader?
- Does the torch lightning sampled the indices once and spread to other subprocess, or each subprocess would sampled their own indices?
- I believed that the DDP would sync the gradient and update at the end of each step. In this sense, would it possible to pause all the training (if the copied data are all covered) and copy a new chunk of data then continue the training. My roughly solution is to do something like @rank_zero_only on train_batch_end() and make a copy then continue the training once copied is done.
(But the important thing is to make sure i can know the indices before hand and can correctly copy samples rather than copy a duplicated one)
Environments: I used pytorch_lightning version 1.6.0 and pytorch version 1.11.0
Thank you and appreciate for your help!!!