Lightning + multi-GPU + IterableDataset uneven batches

kemil511 · December 19, 2023, 2:07pm

Hello,

I’m training a model using an IterableDataset on multiple GPUs. There is an issue where if the number of batches is uneven between workers then training hangs. After some research it looks like in vanilla PyTorch one can use a join context to solve the issue but that this isn’t a supported yet with lightning (the issue is discussed here but still open).

I would think Lightning + multi-GPU + IterableDataset is quite a common setup for large datasets that need to be streamed so I’m a little bit surprised as I couldn’t find any workarounds or suggestions of how to deal with this issue? Interested to hear how people are dealing with it or if I’m missing something.

doloresg382 · February 17, 2024, 10:13am

I have the same issue, did you find a solution?

kemil511 · February 17, 2024, 2:47pm

Hi Dolores,

Unfortunately I didn’t find a proper workaround so I had to just throw away a few batches away at the end of the epoch/validation set to ensure all workers yield exactly the same number of batches in the iterator. It works but it’s not ideal (for example it means if we change batch size or number of devices etc. different batches will be discarded which isn’t great when comparing metrics on validation set etc.). If you find a better solution I’d be keen to hear!

Topic		Replies	Views
Get batch’s datapoints across all GPUs DDP/GPU	2	1061	January 31, 2022
Facing various issues with validation loop when using IterableDataset that implements __len__ DataModule	1	263	February 9, 2024
Collective mismatch at end of training epoch DDP/GPU	0	1062	July 30, 2022
Stucks on 8gpu training setting	2	2210	February 25, 2021
Cuda IndexKernel error, device side assert triggered Trainer	1	3612	July 12, 2021

Lightning + multi-GPU + IterableDataset uneven batches

Related topics