torch.cuda.OutOfMemoryError: CUDA out of memory with mixed precision

subhashnerella · February 24, 2024, 5:05pm

I am trying to run the stable diffusion code on three A100 GPUs using DDP. In the first run, I ran the code with float32 datatype which ran without any issues. Now I used the fp16 mixed precision by passing precision: 16-mixed to the trainer. The run got interrupted after 6 batches into the second epoch.

torch. cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 5.00 GiB (GPU 0; 79.15 GiB total capacity; 66.54 GiB already allocated; 3.43 GiB free; 74.69 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

lightning version: 2.2.0.post0
torch: 2.0.1+cu117
I am using 6 workers

What could have caused this and how can I fix this?

awaelchli · February 24, 2024, 5:22pm

The objective of automatic mixed precision training is not to save memory, but to speed up the computations in a lower precision format. In fact, 16-mixed precision should lead to a slight increase in memory usage because of the extra copy of params that need to be kept. This is likely the reason of your crash, because your training was already peaking close to the memory limit before you switched to mixed precision.

subhashnerella · February 24, 2024, 5:38pm

Thank you for your reply it makes sense. I will reduce the batch size and try running again. It failed in the second epoch why did it not fail within the first epoch? This question is just out of curiosity.

qianruntong · May 9, 2024, 11:56am

Hi Awaelchli, thanks for your explanation. However in my practice of training GCN, the usage of 16-mixed did not speed up training, but lower CUDA memory to about 30% of float32. It seems to be opposite to expectation. Do you have any idea?

Topic		Replies	Views
Saving extra memory consumption because of CUDA Memory issue after a few epochs	0	486	March 13, 2024
RuntimeError: CUDA error: out of memory DDP/GPU	2	3601	February 26, 2021
Error with ddp when updating from pytorch-lightning 1.6.5 to version2.0.9 DDP/GPU	0	1054	October 4, 2023
CUDA out of memory error for tensorized network DDP/GPU	1	2434	June 10, 2021
GPU memory surge after training epochs causing CUDA memory error Trainer	0	2400	August 23, 2021

torch.cuda.OutOfMemoryError: CUDA out of memory with mixed precision

Related topics