Get the indices of Dataloader for multi-gpu training

XezXey · December 1, 2023, 9:02am

Hi, i have a question about how to het indices of Dataloader from multi-gpu training (DDP).
I work on my private cluster and had succesfully use DDP on 4 GPUs (on a single machine). However, once my data get larger and i don’t want all of my machines to store these data. So my roughly solution is to create a thread that will copy the chunk of data from my main machine (e.g. copy 10k images, use it for training then delete and copy another 10k).
My question is

How can i know which indices that would be sampled from the Dataloader?
Does the torch lightning sampled the indices once and spread to other subprocess, or each subprocess would sampled their own indices?
I believed that the DDP would sync the gradient and update at the end of each step. In this sense, would it possible to pause all the training (if the copied data are all covered) and copy a new chunk of data then continue the training. My roughly solution is to do something like @rank_zero_only on train_batch_end() and make a copy then continue the training once copied is done.
(But the important thing is to make sure i can know the indices before hand and can correctly copy samples rather than copy a duplicated one)

Environments: I used pytorch_lightning version 1.6.0 and pytorch version 1.11.0

Thank you and appreciate for your help!!!

Topic		Replies	Views
Implement DDP sampling strategy which requires rank? DDP/GPU	1	461	August 2, 2023
How to not load complete in-memory dataset for every process in DDP training DDP/GPU	2	3990	October 17, 2023
DDP MultiGPU Training does not reduce training time DDP/GPU	3	1667	November 8, 2023
Training not proceeding DDP/GPU	0	907	August 4, 2022
Share state between DDP processes DDP/GPU	0	1229	June 3, 2021

Get the indices of Dataloader for multi-gpu training

Related topics