How to keep track of training time in DDP setting?

awaelchli · January 18, 2023, 1:27am

How can I get all values of 4 GPUs and compute the average time in the CustomCallBack class?

You could do trainer.strategy.reduce(self.time, reduce_op="mean").
Note: this adds additional communication overhead. In the end, it will average your very very similar numbers but will it be worth it? Your call, but maybe an alternative would just be to print the values only on rank 0:

if trainer.global_rank == 0:
    log(self.time)

Is it possible to keep track of the training time given the max_epochs value? For example, the training time after 50 epochs.

To get an estimate, you can just multiply the max_epochs value with your measured time:

print("estimated time to finish training:", self.time * trainer.max_epochs)

Hope this helps