Correct usage of DDP and find_unused_parameters

Fred_Derf · September 16, 2022, 6:36pm

Hi there,

Usually, I’m using DDP strategy with ‘find_unused_parameters=False’, because I’m sure to use all the parameters of my model in computing the loss function.

Now, I’m training a neural network that is composed of more modules (take two as an example).
In this case, the second module takes part of its input from the first module.

What I’m trying to do is:

Train the first module for x epochs, while keeping the second module freezed.
Then freeze the first module and train the second one for x epochs.
Start again.

Each module has its own optimizer/scheduler.

Problem is that with DDP, this gives me the error:

RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument find_unused_parameters=True to torch.nn.parallel.DistributedDataParallel, and by
making sure all forward function outputs participate in calculating loss.

One solution is of course to set “find_unused_parameters” to True, but this slows down training a lot.

I have tried to set “requires_grad=False” for all parameters of the second module, and also to set “module.training=False”, but this does not seem to help.

Do you have any suggestions on what is the best way to proceed?
Can you explain me better what happens when I set "find_unsued_parameters’ to False/True?

albudria · June 9, 2023, 7:04pm

Posting just to say that I’m currently facing the same problem, and tied the same as OP to no avail…

awaelchli · June 10, 2023, 12:35pm

Hi

One solution is of course to set “find_unused_parameters” to True, but this slows down training a lot.

There is no other way. This is a fundamental limitation of DDP when wrapping a single model (here a LightningModule). DDP (pytorch) was designed to be fast when the model it wraps uses all the parameters involved for the forward pass in the backward pass. In Lightning, there is always one model, the LightningModule, and only that one can be wrapped.

If you have two or more models that don’t share parameters and you optimize independently, then you will have to set that flag to True. There is no other option

In Lightning Fabric, it is a bit different. There you can use as many independent modules as you want and wrap each one of them if you want, like this:

from lightning import Fabric
fabric = Fabric()
model1, opt1 = fabric.setup(model1, optimizer1)
model2, opt2 = fabric.setup(model2, optimizer2)
...

and now each one is a DDP-wrapped model. You will be able to run this with find_unused_parameters=False (which is the default).

Topic		Replies	Views
RuntimeError: Parameters that were not used in producing the loss returned by training_step DDP/GPU	0	1812	January 13, 2024
Disabling find_unused_parameters DDP/GPU	1	6179	January 30, 2022
DDP MultiGPU Training does not reduce training time DDP/GPU	3	1696	November 8, 2023
Custom training - RuntimeError due to unused parameters implementation help	0	1903	April 3, 2023
Multi-task model in version 2.0.9 with DDP error DDP/GPU	0	943	October 4, 2023

Correct usage of DDP and find_unused_parameters

Related topics