Implement DDP sampling strategy which requires rank?

I’m training on multi-GPU on a dataset of sequences, where each of the sequences can have very different lengths. To efficiently deal with this, I am sorting the sequences into buckets of different lengths - on GPU 0, I train on the shortest sequences, with a big batch size; on GPU 1, medium sequences with a medium batch size, etc. This trades things off to have the best average GPU memory and compute utilization. The way I allocate the batches to each GPU in plain PyTorch is via DDP, mp.spawn, and using the rank in the batch sampler.

This is complicated, and I’m trying to move this to Lightning. I’m wondering how to proceed, as it doesn’t seem I can have access to the rank of the worker in my dataloader. Is this possible? I’m also open to using Fabric instead.

Hey @patrickmineault
Thank you for the interest!

You can access the process rank from the trainer (in your case that would simply be the id of the process associated with each GPU, 0 and 1) and pass it to the dataset/dataloader, for example like so:

trainer = Trainer(accelerator="cuda", devices=2)

# Let's create my dataset
dataset = MyDataset(..., rank=trainer.global_rank)
dataloader = DataLoader(dataset, ...)

# Train, dataloader)

In Fabric, it would be similar:

fabric = Fabric(accelerator="cuda", devices=2)

# Let's create my dataset
dataset = MyDataset(..., rank=fabric.global_rank)
dataloader = DataLoader(dataset, ...)

# What follows is the training loop (implemented by you using Fabric)

This should answer the question “how to access the rank in the dataset”.

Fabric or Trainer?
Fabric if you prefer to write the training loop yourself or have a pre-existing PyTorch code base and want to refactor as little as possible. Trainer if you want a fully managed battle tested loop and the convenient abstractions of LightningModule etc.

An important thing to look out for (you may already be well aware of it): Running vastly different sequence lengths on each GPU can lead to load inbalance and make the DDP approach less efficient, as there will be time when one of the GPU is doing nothing waiting for the other to complete. Your approach to counter this seems to be to use higher batch sizes on the GPU that processes shorter sequences. This means you will need a more complicated way of partitioning your samples (the torch DistributedSampler won’t be useful here). For DDP training, it is important that each GPU/process sees the same amount of data, i.e., you will run into synchronization issues if your dataloader returns fewer batches on one GPU than the other. You will have to ensure yourself that your sampling yields a dataset/dataloader that returns an equal number of batches on each process. Let me know if you need more details on this.

1 Like