How to not load complete in-memory dataset for every process in DDP training

sigmoid_amidst_relus · February 13, 2021, 5:15am

I’m working in an environment that has regular HDDs, shared amongst many users. I/O performance is too poor to simply read and parse data on the fly, so I have to load my data in memory.

I have a single node with 4 GPUs (Node resources are not shared, underlying storage is). When training in DDP mode, each process loads the entire dataset in memory, which although works for my current dataset, won’t work for larger ones, and I’d like to avoid that since each process uses only a subset of the data anyway and the rest is redundant. The dataset preparation is done directly in the LightningModule (without explicitly using LightningDataModule)

From my understanding of things, the following should solve this problem:

Disable adding of Distributed Sampler in Trainer using replace_sampler_ddp=False
pass local rank information to the Dataset and load a particular shard.

So my questions are:

How do I achieve the above, i.e. getting rank information of the process in the LightningModule and passing it on to my dataset object?
Is there a better way to do this using existing pytorch-lightning components?

sigmoid_amidst_relus · February 13, 2021, 8:11am

Solved. All relevant information can be found in the environment variables set by pytorch-lightning when the DDP processes are launched.

Specifically, do the following in your dataset/lightning module definition

import os
env_cp = os.environ.copy()
node_rank, local_rank, world_size = env_cp['NODE_RANK'], env_cp['LOCAL_RANK'], env_cp['WORLD_SIZE']

is_in_ddp_subprocess = env_cp['PL_IN_DDP_SUBPROCESS']
pl_trainer_gpus = enc_cp['PL_TRAINER_GPUS']

Using this info I was able to write shard logic that loaded a specific subset of data in memory for each DDP process.

Leaving here as it might be of help for someone.

alaahouimel · October 17, 2023, 2:54pm

Hey @sigmoid_amidst_relus (cool name btw)

would it be possible to share the code behind the “sharding logic” , specifically with DataLoaders
I believe PL uses torch.utils.data.distributed.DistributedSampler under the hood in a multi GPU setup.
did you override that as well ?
if not wouldn’t that possibly cause indices conflict ?

Cheers

Topic		Replies	Views
Share state between DDP processes DDP/GPU	0	1233	June 3, 2021
Implement DDP sampling strategy which requires rank? DDP/GPU	1	467	August 2, 2023
Best practises for implementing large datasets with DDP DDP/GPU	0	803	December 12, 2021
DistributedSampler and LightningDataModule DDP/GPU	1	9066	January 29, 2022
Get the indices of Dataloader for multi-gpu training DDP/GPU	0	449	December 1, 2023

How to not load complete in-memory dataset for every process in DDP training

Related topics