Error with ddp when updating from pytorch-lightning 1.6.5 to version2.0.9

aaron9086 · October 4, 2023, 8:35pm

Hello all, I am updating the package version from pytorch-lightning. 1.6.5 to 2.0.9. I run into the following two errors:

I use the default ddp strategy in my trainer ( on 8 Nvidia-L4 GPU with 24G mem) with the same code, after I upgraded to version 2.0.9, I got the following error

RuntimeError: It looks like your LightningModule has parameters that were not used in producing the loss returned by training_step. If this is intentional, you must enable the detection of unused parameters in DDP, either by setting the string value strategy='ddp_find_unused_parameters_true' or by setting the flag in the strategy with strategy=DDPStrategy(find_unused_parameters=True).

In the meantime, when I change the strategy to strategy='ddp_find_unused_parameters_true', I found that I can only use a very small batch size to run the code. Previously, the GPU memory usage with batch 2048 work well and it uses 20% GPU mem, but with the new version 2.0.9, I can ONLY set a batch size 64, otherwise, I will get the following error

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 164.74 GiB (GPU 5; 21.96 GiB total capacity; 16.63 GiB already allocated; 4.34 GiB free; 17.32 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Many thanks

Topic		Replies	Views
Is Lightning more memory intensive than regular pytorch? DDP/GPU	0	445	April 5, 2022
RuntimeError: CUDA error: out of memory DDP/GPU	2	3627	February 26, 2021
RuntimeError: Parameters that were not used in producing the loss returned by training_step DDP/GPU	0	1859	January 13, 2024
Multi-task model in version 2.0.9 with DDP error DDP/GPU	0	959	October 4, 2023
Disabling find_unused_parameters DDP/GPU	1	6212	January 30, 2022

Error with ddp when updating from pytorch-lightning 1.6.5 to version2.0.9

Related topics