Correct usage of DDP and find_unused_parameters

Hi there,

Usually, I’m using DDP strategy with ‘find_unused_parameters=False’, because I’m sure to use all the parameters of my model in computing the loss function.

Now, I’m training a neural network that is composed of more modules (take two as an example).
In this case, the second module takes part of its input from the first module.

What I’m trying to do is:

  1. Train the first module for x epochs, while keeping the second module freezed.
  2. Then freeze the first module and train the second one for x epochs.
  3. Start again.

Each module has its own optimizer/scheduler.

Problem is that with DDP, this gives me the error:

RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument find_unused_parameters=True to torch.nn.parallel.DistributedDataParallel, and by
making sure all forward function outputs participate in calculating loss.

One solution is of course to set “find_unused_parameters” to True, but this slows down training a lot.

I have tried to set “requires_grad=False” for all parameters of the second module, and also to set “”, but this does not seem to help.

Do you have any suggestions on what is the best way to proceed?
Can you explain me better what happens when I set "find_unsued_parameters’ to False/True?