Question regarding prepare_data and setup in DataModule

ankitvad · February 21, 2021, 1:49pm

The lightning data module mentions:

    def prepare_data(self):
        # download, split, etc...
        # only called on 1 GPU/TPU in distributed
    def setup(self):
        # make assignments here (val/train/test split)
        # called on every process in DDP

I’m a bit confused here. If I don’t have any data to download. I load multiple torch.Dataset for the train/test/val and then later they are passed in the pl_dataloader functions but I’m assuming that I will load all the dataset in the setup() function and not prepare_data(). So does this mean that the whole dataset will be loaded on all GPU’s I’m planning to use for training?

Also, how does lightning pass batches for multi-GPU training? I thought the dataset is on the CPU and then Lightning takes a batch from _dataloader and passes it to a GPU so in an epoch the same batch won’t be sent to multiple GPUs for training? But, if I load my dataset in prepare_data will multiple copies be made? Also, if there are no transformations/calculations to be performed then would everything happen in the CPU once?

ankitvad · February 21, 2021, 5:08pm

I think this is key:

WARNING
prepare_data is called from a single GPU. Do not use it to assign state (self.x = y).

So, am I correct in assuming that if we ran it on a single machine consisting of multiple GPU’s then self.x = y would work in prepare_data as well?

Topic		Replies	Views
LightningDatamodule.prepare_data() & LightningDatamodule.setup() outside of Trainer DataModule	0	1735	February 23, 2021
DDP and pl.LightningDataModule parallelization Issues DDP/GPU	1	607	March 29, 2023
Creating Custom Dataset with LightningDataModule DataModule	2	5051	November 8, 2020
How to implement the Dataset or Data module to achieve the following goals? DDP/GPU	0	172	April 15, 2023
Where should code to compute dataset-level stats go?	1	608	March 31, 2022

Question regarding prepare_data and setup in DataModule

Related topics