I am trying to run the stable diffusion code on three A100 GPUs using DDP. In the first run, I ran the code with float32 datatype which ran without any issues. Now I used the fp16 mixed precision by passing precision: 16-mixed
to the trainer. The run got interrupted after 6 batches into the second epoch.
torch. cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 5.00 GiB (GPU 0; 79.15 GiB total capacity; 66.54 GiB already allocated; 3.43 GiB free; 74.69 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
lightning version: 2.2.0.post0
torch: 2.0.1+cu117
I am using 6 workers
What could have caused this and how can I fix this?