How to keep track of training time in DDP setting?

WKuro · January 16, 2023, 1:25pm

Hi, currently I am trying to keep track of the training time by implementing the following CallBack:

class CustomCallback(pl.Callback):
    def on_train_epoch_start(self, trainer, pl_module):
        self.time = time.time()
    
    def on_train_epoch_end(self, trainer, pl_module):
        train_time = time.time() - self.time
        self.log("training_time", train_time)
        self.time = 0

I just created a simple model with the image classification task of CIFAR-10 using DDP and 4 GPUs. After one epoch, the training time is logged for 4 GPUs separately as follows:

023-01-16 13:45:05,056 - Training epoch ends, training time: 189.20574188232422
2023-01-16 13:45:05,056 - Training epoch ends, training time: 189.20563769340515
2023-01-16 13:45:05,057 - Training epoch ends, training time: 189.20640778541565
2023-01-16 13:45:05,060 - Training epoch ends, training time: 189.20853972434998

I have 2 questions:

How can I get all values of 4 GPUs and compute the average time in the CustomCallBack class?
Is it possible to keep track of the training time given the max_epochs value? For example, the training time after 50 epochs.

Thank you.

awaelchli · January 18, 2023, 1:27am

Hi @WKuro

How can I get all values of 4 GPUs and compute the average time in the CustomCallBack class?

You could do trainer.strategy.reduce(self.time, reduce_op="mean").
Note: this adds additional communication overhead. In the end, it will average your very very similar numbers but will it be worth it? Your call, but maybe an alternative would just be to print the values only on rank 0:

if trainer.global_rank == 0:
    log(self.time)

Is it possible to keep track of the training time given the max_epochs value? For example, the training time after 50 epochs.

To get an estimate, you can just multiply the max_epochs value with your measured time:

print("estimated time to finish training:", self.time * trainer.max_epochs)

Hope this helps

WKuro · January 18, 2023, 9:08am

Thank you for the answer! Just a question, does logging using

trainer.global_rank

give me the value after being reduced with the mean across 4 GPUs? Currently I use this:

self.log(“training_time”, train_time, sync_dist=True)

So I don’t know if they are different or not.

awaelchli · January 18, 2023, 10:18pm

Yes, self.log computes the reduced value, but it will send it to the logger, not the terminal standard output. You could add

self.log(“training_time”, train_time, sync_dist=True, prog_bar=True)

to send it to the progress bar. That might be a good solution for you.

So I don’t know if they are different or not.

To know whether they are different, you would have to call print() and see in the terminal. But they won’t be significantly different, because the training steps in DDP are synchronized.

WKuro · January 23, 2023, 2:36pm

Hi,

Thank you for the previous answers. I would like to ask another question. I saw this issue on github: Proper way to log things when using DDP · Discussion #6501 · Lightning-AI/lightning · GitHub.

SkafteNicki stated that it is not recommended to use sync_dist=True for logging in the validation step. Since our new code includes torchmetrics, would this be a problem?

def validation_step(self, batch, idx):
    ....
    self.log(“val_acc”, val_acc, sync_dist=True)

awaelchli · January 23, 2023, 6:35pm

Yes, he is right in the sense that logging frequently (on every step) with sync_dist=True is not recommended because it adds an expensive synchronization that slows down your loop. Therefore, one should only add it when necessary, when the slow down is acceptable. In most cases it is not necessary to sync on every step. For example, to compute your average time, you could also just log at the end of the epoch instead of every step.

laksuperstar · February 29, 2024, 5:52pm

Hey! Can you share the github repo for this? I am trying to do the same 4 gpu setup with DDP. It’s the ddp_train function in this file on my repo and I would like to implement this timing functionality too! I am not sure if I have set up DDP correctly or not: DNN-Training-Acceleration/train.py at main · ChaosAdmStudent/DNN-Training-Acceleration · GitHub

Topic		Replies	Views
DDP MultiGPU Training does not reduce training time DDP/GPU	3	1607	November 8, 2023
Rank_zero_only Callback in ddp DDP/GPU	2	2668	January 30, 2023
Error Logging DDP Trainer metrics in a remote function implementation help	0	218	November 15, 2023
Storing test output (dict) when using DDP DDP/GPU	1	1915	January 30, 2022
Proper way to log things when using DDP	0	2203	March 12, 2021

How to keep track of training time in DDP setting?

Related topics