Hi I’m facing an issue in gathering all the losses and predictions in multi gpu scenario. I’m using pytorch lightning 2.0.4 and deepspeed, distributed strategy - deepspeed_stage_2.
I’m adding my skeleton code here for reference.
def __init__(self):
self.batch_train_preds = []
self.batch_train_losses = []
def training_step(self, batch, batch_idx):
input_ids = batch['input_ids']
attention_mask = batch['attention_mask']
# Model Step
outputs = self.model(input_ids=input_ids, attention_mask=attention_mask, labels=train_labels)
train_preds = torch.argmax(outputs.logits, dim=-1)
return {'loss': outputs[0],
'train_preds': train_preds}
def on_train_batch_end(self, outputs, batch, batch_idx):
# aggregate metrics or outputs at batch level
train_batch_loss = outputs["loss"].mean()
train_batch_preds = torch.cat(outputs["train_preds"])
self.batch_train_preds.append(train_batch_preds)
self.batch_train_losses.append(train_batch_loss.item())
return {'train_batch_loss': train_batch_loss,
'train_batch_preds': train_batch_preds
}
def on_train_epoch_end(self) -> None:
# Aggregate epoch level training metrics
epoch_train_preds = torch.cat(self.batch_train_preds)
epoch_train_loss = np.mean(self.batch_train_losses)
self.logger.log_metrics({"epoch_train_loss": epoch_train_loss})
In the above code block, I’m trying to combine all the predictions into a single tensor at the end of the epoch by tracking each batch in a global list (defined at init). but in multi gpu training, I faced an error with concatination as each gpu is treating the batch in it’s own device and I couldn’t combine the results in a single global list.
What should I be doing in on_train_batch_end or on_train_epoch_end or in training_step in order to combine the results across all the gpus.