How to keep track of training time in DDP setting?

Hi, currently I am trying to keep track of the training time by implementing the following CallBack:

class CustomCallback(pl.Callback):
    def on_train_epoch_start(self, trainer, pl_module):
        self.time = time.time()
    
    def on_train_epoch_end(self, trainer, pl_module):
        train_time = time.time() - self.time
        self.log("training_time", train_time)
        self.time = 0

I just created a simple model with the image classification task of CIFAR-10 using DDP and 4 GPUs. After one epoch, the training time is logged for 4 GPUs separately as follows:

023-01-16 13:45:05,056 - Training epoch ends, training time: 189.20574188232422
2023-01-16 13:45:05,056 - Training epoch ends, training time: 189.20563769340515
2023-01-16 13:45:05,057 - Training epoch ends, training time: 189.20640778541565
2023-01-16 13:45:05,060 - Training epoch ends, training time: 189.20853972434998

I have 2 questions:

  1. How can I get all values of 4 GPUs and compute the average time in the CustomCallBack class?
  2. Is it possible to keep track of the training time given the max_epochs value? For example, the training time after 50 epochs.

Thank you.

Hi @WKuro

How can I get all values of 4 GPUs and compute the average time in the CustomCallBack class?

You could do trainer.strategy.reduce(self.time, reduce_op="mean").
Note: this adds additional communication overhead. In the end, it will average your very very similar numbers but will it be worth it? Your call, but maybe an alternative would just be to print the values only on rank 0:

if trainer.global_rank == 0:
    log(self.time)

Is it possible to keep track of the training time given the max_epochs value? For example, the training time after 50 epochs.

To get an estimate, you can just multiply the max_epochs value with your measured time:

print("estimated time to finish training:", self.time * trainer.max_epochs)

Hope this helps

1 Like

Thank you for the answer! Just a question, does logging using

trainer.global_rank

give me the value after being reduced with the mean across 4 GPUs? Currently I use this:

self.log(“training_time”, train_time, sync_dist=True)

So I don’t know if they are different or not.

Yes, self.log computes the reduced value, but it will send it to the logger, not the terminal standard output. You could add

self.log(“training_time”, train_time, sync_dist=True, prog_bar=True)

to send it to the progress bar. That might be a good solution for you.

So I don’t know if they are different or not.

To know whether they are different, you would have to call print() and see in the terminal. But they won’t be significantly different, because the training steps in DDP are synchronized.

1 Like

Hi,

Thank you for the previous answers. I would like to ask another question. I saw this issue on github: Proper way to log things when using DDP · Discussion #6501 · Lightning-AI/lightning · GitHub.

SkafteNicki stated that it is not recommended to use sync_dist=True for logging in the validation step. Since our new code includes torchmetrics, would this be a problem?

def validation_step(self, batch, idx):
    ....
    self.log(“val_acc”, val_acc, sync_dist=True)

Yes, he is right in the sense that logging frequently (on every step) with sync_dist=True is not recommended because it adds an expensive synchronization that slows down your loop. Therefore, one should only add it when necessary, when the slow down is acceptable. In most cases it is not necessary to sync on every step. For example, to compute your average time, you could also just log at the end of the epoch instead of every step.

1 Like

Hey! Can you share the github repo for this? I am trying to do the same 4 gpu setup with DDP. It’s the ddp_train function in this file on my repo and I would like to implement this timing functionality too! I am not sure if I have set up DDP correctly or not: DNN-Training-Acceleration/train.py at main · ChaosAdmStudent/DNN-Training-Acceleration · GitHub