Get batch’s datapoints across all GPUs

fermoren · January 28, 2022, 8:06am

Hello,

I´m running my model in a cluster with multiples GPUs (2). My problem is that I would like to access all the datapoints in the batch. Because I´m using more than 2 GPUs, my batch in divided between those two devices for parallelisation purposes, which means than when I access the data in the batch in eval/training, I´m getting just half the batch.

How could I obtain the complete batch and the predictions of the model that are divided among different devices/GPUs? I tried to set the flag accelerator=“ddp” but the problem persists.

Thanks!

goku · January 29, 2022, 10:28pm

hey @fermoren

if you need just the predictions, you can use self.all_gather within your LightningModule.

def LitModel(LightningModule):
    def some_hook(...):
        preds = ...
        if self.trainer.is_global_zero:
            preds = self.all_gather(preds)
            # do whatever

Also, we have moved the discussions to GitHub Discussions. You might want to check that out instead to get a quick response. The forums will be marked read-only soon.

Thank you

fermoren · January 31, 2022, 10:03am

Hey, thanks for your answer!

I´m trying your suggested solution into the LightningModule’s forward and I´m afraid it is just returning half the data, that is, just the data in one of the two gpus I´m using, any idea of what’s going on and how can I solve it?

Thanks!

PD: I will move the question to github discussions.

Topic		Replies	Views
Reproduce one GPU score/loss using DDP - Disrepancy DDP/GPU	1	343	January 28, 2024
Get the indices of Dataloader for multi-gpu training DDP/GPU	0	443	December 1, 2023
Share state between DDP processes DDP/GPU	0	1219	June 3, 2021
Use DDP to train a single model, on a single GPU, multiple processes	0	140	May 15, 2024
Lightning + multi-GPU + IterableDataset uneven batches	2	568	February 17, 2024

Get batch’s datapoints across all GPUs

Related topics