DDP and pl.LightningDataModule parallelization Issues

Following up on How many times does the pl.LightningDataModule.setup() runs in DDP · Issue #11642 · Lightning-AI/lightning · GitHub

I’m currently in the process of updating code that runs fine in lightning 1.1.5 to 1.9.4 to take advantage of some of the new features and am running into issues with the new data paralyzation logic for the data module and DDP for multimode multi gpu training which seem to make my original task impossible.

Originally, our 1.1.5 code instantiated the pl.LightningDataModule, then called the setup method before the trainer to get vocab (and cache files), we then passed the vocab into constructing the model (molformer/train_pubchem_light.py at main · IBM/molformer · GitHub) and then created the trainer and fit functioned perfectly.

When we moved to 1.9.1 (and now 1.9.4) we are seeing the Setup method called repeatedly (Num Nodes + Total Num GPUs times), and all the GPUs seem to be seeing the same data. We tried to move the processes to the prepare_data function and tried setting prepare data per node to both True and False and are seeing the same issue.

Hey @bbelgodere
1.1.5 is like 2 years old, that’s an eternity. Yes I remember that in the old versions of Lightning, setup() was incorrectly called only once before spawning processes. In summary:

setup(): Gets called once on every process/GPU (per fit/test/val call)
prepare_data(): Gets called in every local-rank-0 process, so in other words on each node once.

These are the expected behaviors and have been working like so since ~1.5.0 (I don’t remember exactly which version).

Based on your description, you can split your workflow like so:
On rank 0 only, preprocess the data and save the cache files to disk. Then, in setup() load the vocab so that every process has it available. WDYT?