I just created a simple model with the image classification task of CIFAR-10 using DDP and 4 GPUs. After one epoch, the training time is logged for 4 GPUs separately as follows:
023-01-16 13:45:05,056 - Training epoch ends, training time: 189.20574188232422
2023-01-16 13:45:05,056 - Training epoch ends, training time: 189.20563769340515
2023-01-16 13:45:05,057 - Training epoch ends, training time: 189.20640778541565
2023-01-16 13:45:05,060 - Training epoch ends, training time: 189.20853972434998
I have 2 questions:
How can I get all values of 4 GPUs and compute the average time in the CustomCallBack class?
Is it possible to keep track of the training time given the max_epochs value? For example, the training time after 50 epochs.
How can I get all values of 4 GPUs and compute the average time in the CustomCallBack class?
You could do trainer.strategy.reduce(self.time, reduce_op="mean").
Note: this adds additional communication overhead. In the end, it will average your very very similar numbers but will it be worth it? Your call, but maybe an alternative would just be to print the values only on rank 0:
if trainer.global_rank == 0:
log(self.time)
Is it possible to keep track of the training time given the max_epochs value? For example, the training time after 50 epochs.
To get an estimate, you can just multiply the max_epochs value with your measured time:
print("estimated time to finish training:", self.time * trainer.max_epochs)
to send it to the progress bar. That might be a good solution for you.
So I don’t know if they are different or not.
To know whether they are different, you would have to call print() and see in the terminal. But they won’t be significantly different, because the training steps in DDP are synchronized.
SkafteNicki stated that it is not recommended to use sync_dist=True for logging in the validation step. Since our new code includes torchmetrics, would this be a problem?
Yes, he is right in the sense that logging frequently (on every step) with sync_dist=True is not recommended because it adds an expensive synchronization that slows down your loop. Therefore, one should only add it when necessary, when the slow down is acceptable. In most cases it is not necessary to sync on every step. For example, to compute your average time, you could also just log at the end of the epoch instead of every step.