In the case of DDP:
- The metrics should be calculated in
validation_step
or the metrics should be calculated atvalidation_step_end
after gathering output tensors returned byvalidation_step
?- If the metrics are calculated in
validation_step
, would be it correct to take the mean of the corresponding metrics invalidation_step_end
? Considering batch partitions for each device can be uneven? - Does calling
all_gather
on the output tensors insidevalidation_step_end
adds an extra dimension before the batch dimension? For example, if my original batch tensor is of the shapeN x C x H x W
and 2 GPUs are in use then afterall_gather
the tensor will be of the shape2 x M x C x H x W
(where2M = N
)? What happens if the batch size (N
) is an odd number?
- If the metrics are calculated in