Correct usage of DDP and find_unused_parameters

Hi there,

Usually, I’m using DDP strategy with ‘find_unused_parameters=False’, because I’m sure to use all the parameters of my model in computing the loss function.

Now, I’m training a neural network that is composed of more modules (take two as an example).
In this case, the second module takes part of its input from the first module.

What I’m trying to do is:

  1. Train the first module for x epochs, while keeping the second module freezed.
  2. Then freeze the first module and train the second one for x epochs.
  3. Start again.

Each module has its own optimizer/scheduler.

Problem is that with DDP, this gives me the error:

RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument find_unused_parameters=True to torch.nn.parallel.DistributedDataParallel, and by
making sure all forward function outputs participate in calculating loss.

One solution is of course to set “find_unused_parameters” to True, but this slows down training a lot.

I have tried to set “requires_grad=False” for all parameters of the second module, and also to set “”, but this does not seem to help.

Do you have any suggestions on what is the best way to proceed?
Can you explain me better what happens when I set "find_unsued_parameters’ to False/True?

Posting just to say that I’m currently facing the same problem, and tied the same as OP to no avail…


One solution is of course to set “find_unused_parameters” to True, but this slows down training a lot.

There is no other way. This is a fundamental limitation of DDP when wrapping a single model (here a LightningModule). DDP (pytorch) was designed to be fast when the model it wraps uses all the parameters involved for the forward pass in the backward pass. In Lightning, there is always one model, the LightningModule, and only that one can be wrapped.

If you have two or more models that don’t share parameters and you optimize independently, then you will have to set that flag to True. There is no other option

In Lightning Fabric, it is a bit different. There you can use as many independent modules as you want and wrap each one of them if you want, like this:

from lightning import Fabric
fabric = Fabric()
model1, opt1 = fabric.setup(model1, optimizer1)
model2, opt2 = fabric.setup(model2, optimizer2)

and now each one is a DDP-wrapped model. You will be able to run this with find_unused_parameters=False (which is the default).