I am trying to run the stable diffusion code on three A100 GPUs using DDP. In the first run, I ran the code with float32 datatype which ran without any issues. Now I used the fp16 mixed precision by passing precision: 16-mixed
to the trainer. The run got interrupted after 6 batches into the second epoch.
torch. cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 5.00 GiB (GPU 0; 79.15 GiB total capacity; 66.54 GiB already allocated; 3.43 GiB free; 74.69 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
lightning version: 2.2.0.post0
torch: 2.0.1+cu117
I am using 6 workers
What could have caused this and how can I fix this?
The objective of automatic mixed precision training is not to save memory, but to speed up the computations in a lower precision format. In fact, 16-mixed precision should lead to a slight increase in memory usage because of the extra copy of params that need to be kept. This is likely the reason of your crash, because your training was already peaking close to the memory limit before you switched to mixed precision.
1 Like
Thank you for your reply it makes sense. I will reduce the batch size and try running again. It failed in the second epoch why did it not fail within the first epoch? This question is just out of curiosity.
Hi Awaelchli, thanks for your explanation. However in my practice of training GCN, the usage of 16-mixed did not speed up training, but lower CUDA memory to about 30% of float32. It seems to be opposite to expectation. Do you have any idea?