Following up on How many times does the pl.LightningDataModule.setup() runs in DDP · Issue #11642 · Lightning-AI/lightning · GitHub
I’m currently in the process of updating code that runs fine in lightning 1.1.5 to 1.9.4 to take advantage of some of the new features and am running into issues with the new data paralyzation logic for the data module and DDP for multimode multi gpu training which seem to make my original task impossible.
Originally, our 1.1.5 code instantiated the pl.LightningDataModule, then called the setup method before the trainer to get vocab (and cache files), we then passed the vocab into constructing the model (molformer/train_pubchem_light.py at main · IBM/molformer · GitHub) and then created the trainer and fit functioned perfectly.
When we moved to 1.9.1 (and now 1.9.4) we are seeing the Setup method called repeatedly (Num Nodes + Total Num GPUs times), and all the GPUs seem to be seeing the same data. We tried to move the processes to the prepare_data function and tried setting prepare data per node to both True and False and are seeing the same issue.