Synchronize train logging

Cheng_Young · October 16, 2021, 1:13pm

as what is said in the above picture, validation and test log can use sync_dist=True.

I wonder whether there is a solution for training to synchoronize? like the following code, I run in 8 gpus, I want 8 gpus’s train_loss and train_acc to be averaged :

    def training_step(self, batch, batch_idx):

        inputs = self.train_inputs(batch)
        loss, logits = self(**inputs)

        mask = (batch['labels'] != 5).long()
        ntotal = mask.sum()
        ncorrect = ((logits.argmax(dim=-1) == batch['labels']).long() *
                    mask).sum()
        acc = ncorrect / ntotal

        self.log('train_loss', loss, on_step=True, prog_bar=True,sync_dist =True)
        self.log("train_acc", acc, on_step=True, prog_bar=True,sync_dist= True)

        return loss

goku · January 29, 2022, 9:29pm

hey @Cheng_Young

the same holds for training as well.

Also, we have moved the discussions to GitHub Discussions. You might want to check that out instead to get a quick response. The forums will be marked read-only after some time.

Thank you

Topic		Replies	Views
Logger in Lightning	0	190	March 14, 2022
Combining loss, predictions in multi gpus implementation help	3	1499	July 9, 2023
Ignore log in one of the GPUs as it does not have a specific loss DDP/GPU	2	304	October 24, 2023
Understanding epoch metrics: acc_train_epoch does not appear to be the average of acc_train_step	0	371	August 31, 2022
Is it possible in ddp mode to log combined metrics across processes? At least val epoch end metrics?	3	2400	August 18, 2021

Synchronize train logging

Related topics