Combining loss, predictions in multi gpus

Hi I’m facing an issue in gathering all the losses and predictions in multi gpu scenario. I’m using pytorch lightning 2.0.4 and deepspeed, distributed strategy - deepspeed_stage_2.

I’m adding my skeleton code here for reference.

    def __init__(self):
        self.batch_train_preds = []
        self.batch_train_losses = []


    def  training_step(self, batch, batch_idx):
        input_ids = batch['input_ids']
        attention_mask = batch['attention_mask']

        # Model Step
        outputs = self.model(input_ids=input_ids, attention_mask=attention_mask, labels=train_labels)

        train_preds = torch.argmax(outputs.logits, dim=-1)

        return {'loss': outputs[0],
                'train_preds': train_preds}

    def on_train_batch_end(self, outputs, batch, batch_idx):
        # aggregate metrics or outputs at batch level
        train_batch_loss = outputs["loss"].mean()
        train_batch_preds = torch.cat(outputs["train_preds"])

        self.batch_train_preds.append(train_batch_preds)
        self.batch_train_losses.append(train_batch_loss.item())

        return {'train_batch_loss': train_batch_loss,
                'train_batch_preds': train_batch_preds
                }

    def on_train_epoch_end(self) -> None:
        # Aggregate epoch level training metrics

        epoch_train_preds = torch.cat(self.batch_train_preds)
        epoch_train_loss = np.mean(self.batch_train_losses)

        self.logger.log_metrics({"epoch_train_loss": epoch_train_loss})

In the above code block, I’m trying to combine all the predictions into a single tensor at the end of the epoch by tracking each batch in a global list (defined at init). but in multi gpu training, I faced an error with concatination as each gpu is treating the batch in it’s own device and I couldn’t combine the results in a single global list.

What should I be doing in on_train_batch_end or on_train_epoch_end or in training_step in order to combine the results across all the gpus.

Hey

Your observations are correct, each process/GPU will run its own “on_train_epoch_end” or any other hook with the results from that GPU. But there is a method you can use to gather all resuts into all GPUs, like this:

all_losses = self.all_gather(self.batch_train_losses)
print(all_losses[0].shape)

In your case, the losses is a list, so you will get a list of tensors of shape [N, 1] as output where N was the number of GPUs. Now you can for example cat or stack your losses and do whatever you want with them.

mean_loss = torch.cat(all_losses).mean()

You can do a similar thing with your predictions, for example:

all_predictions = self.all_gather(self.batch_train_preds)
all_predictions = torch.cat(all_predictions).view(-1)

The code above is brain-compiled, apologies for any typos, but I hope it gets the idea across.

References:
self.all_gather docs

Hi, Thank you very much for your support. This makes sense. i have come across torchmetrics library. Would you please guide me on this multi gpu setting? Can I actually just use torchmetrics and it will take care of gathering the values across the gpus and calculating the final metric? If yes, would you please guide me on necessary code configurations required to enable multi gpu metric gathering using torchmetrics

Yes, that is what TorchMetrics was designed for. It will work out of the box without you having to configure anything. I could write down a full guide here, but I think it would be better if you just start with the TorchMetrics docs, and to use it with Lightning you can read this section here with examples: TorchMetrics in PyTorch Lightning — PyTorch-Metrics 1.0.0 documentation
If you then have any further questions, I’m happy to help.