Correct approach to calculate metrics in DDP setting

Abhisek_Maiti · April 4, 2022, 7:48am

In the case of DDP:

The metrics should be calculated in validation_step or the metrics should be calculated at validation_step_end after gathering output tensors returned by validation_step?
- If the metrics are calculated in validation_step, would be it correct to take the mean of the corresponding metrics in validation_step_end? Considering batch partitions for each device can be uneven?
- Does calling all_gather on the output tensors inside validation_step_end adds an extra dimension before the batch dimension? For example, if my original batch tensor is of the shape N x C x H x W and 2 GPUs are in use then after all_gather the tensor will be of the shape 2 x M x C x H x W (where 2M = N)? What happens if the batch size (N) is an odd number?

goku · April 4, 2022, 2:06pm

Topic		Replies	Views
Proper way to log things when using DDP	0	2217	March 12, 2021
How to sync rouge score between different process? DDP/GPU	1	1369	October 10, 2021
Validation sanity check hangs after `all_gather` DDP/GPU	2	3244	March 31, 2023
Multi-GPU, TorchMetrics, incorrect aggregation DDP/GPU	0	503	January 24, 2023
Reproduce one GPU score/loss using DDP - Disrepancy DDP/GPU	1	374	January 28, 2024