Error with ddp when updating from pytorch-lightning 1.6.5 to version2.0.9

Hello all, I am updating the package version from pytorch-lightning. 1.6.5 to 2.0.9. I run into the following two errors:

  • I use the default ddp strategy in my trainer ( on 8 Nvidia-L4 GPU with 24G mem) with the same code, after I upgraded to version 2.0.9, I got the following error

RuntimeError: It looks like your LightningModule has parameters that were not used in producing the loss returned by training_step. If this is intentional, you must enable the detection of unused parameters in DDP, either by setting the string value strategy='ddp_find_unused_parameters_true' or by setting the flag in the strategy with strategy=DDPStrategy(find_unused_parameters=True).

  • In the meantime, when I change the strategy to strategy='ddp_find_unused_parameters_true', I found that I can only use a very small batch size to run the code. Previously, the GPU memory usage with batch 2048 work well and it uses 20% GPU mem, but with the new version 2.0.9, I can ONLY set a batch size 64, otherwise, I will get the following error

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 164.74 GiB (GPU 5; 21.96 GiB total capacity; 16.63 GiB already allocated; 4.34 GiB free; 17.32 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Many thanks