Compute Precision Recall Curve without OOM

wiricon · August 24, 2023, 4:39pm

Ran into something similar (but for mAP metric) and solved it by keeping the metric on the cpu, and then providing a custom dist_sync_fn which puts the tensor on the cuda device, performs the synchronization, and then moves the result back to the cpu:

def all_gather_on_cuda(tensor: torch.Tensor, *args: T.Any, **kwargs: T.Any) -> T.List[torch.Tensor]:
    original_device = tensor.device
    return [
        _tensor.to(original_device)
        for _tensor in gather_all_tensors(tensor.to("cuda"), *args, **kwargs)
    ]
metric.dist_sync_fn = all_gather_on_cuda    # you could alternatively pass this as a keyword argument to the metric's constructor.

Topic		Replies	Views
OOM error due to tensor accumulation when trying to use functional metrics API LightningModule	3	3666	January 12, 2021
Multi-GPU, TorchMetrics, incorrect aggregation DDP/GPU	0	496	January 24, 2023
Computing expensive metrics less frequently than using validation_step()	0	409	May 4, 2021
CUDA OOM while initializing DDP DDP/GPU	1	4173	November 17, 2020
Batch is None when I use the GPU	3	1625	October 1, 2020

Compute Precision Recall Curve without OOM

Related topics