Custom training - RuntimeError due to unused parameters

lifesthateasy · April 3, 2023, 9:27pm

I’m trying to use donut, which is a transformer model with a lightning implementation, and I want pre-train it on a language it hasn’t been yet on my desktop. Unfortunately the version of the stack provided on the original repo doesn’t support my GPU, so I had to port it to a newer PyTorch Lightning version from 1.6 to 2.0. I’m following the upgrade guide, but I’m still running into issues.

Upon the first run, I got the following error:

RuntimeError: It looks like your LightningModule has parameters that were not used in 
producing the loss returned by training_step. If this is intentional, you must enable 
the detection of unused parameters in DDP, either by setting the string value 
`strategy='ddp_find_unused_parameters_true'` or by setting the flag in the strategy with 
`strategy=DDPStrategy(find_unused_parameters=True)`.

Since I haven’t really used Lightning before, I’m unsure of what this means. I’ve managed to get it run by setting said string value to True, but I don’t know if I did something wrong while porting or if this is by design.

I’ve checked the documentation of lightning, but there’s very limited information. Setting this parameter to True comes with a performance impact, so I’d like to know if I’m doing something wrong or if it’s needed.

The training step is defined as follows:

def training_step(self, batch, batch_idx):
    image_tensors, decoder_input_ids, decoder_labels = list(), list(), list()
    for batch_data in batch:
        image_tensors.append(batch_data[0])
        decoder_input_ids.append(batch_data[1][:, :-1])
        decoder_labels.append(batch_data[2][:, 1:])
    image_tensors = torch.cat(image_tensors)
    decoder_input_ids = torch.cat(decoder_input_ids)
    decoder_labels = torch.cat(decoder_labels)
    loss = self.model(image_tensors, decoder_input_ids, decoder_labels)[0]
    self.log_dict({"train_loss": loss}, sync_dist=True)
    return loss

Here loss is calculated from self.model which is an instance of DonutModel (line 369). Another weird thing is, the loss actually doesn’t seem to decrease during training as shown on Tensorboard:

I’m unsure what’s wrong as most of this is new to me and I’d appreciate some help.

I’ll gladly share more code as I’m not sure where the parameters are being checked for this error message. I’d be thankful for any help.

Topic		Replies	Views
RuntimeError: Parameters that were not used in producing the loss returned by training_step DDP/GPU	0	1676	January 13, 2024
Correct usage of DDP and find_unused_parameters DDP/GPU	2	9746	June 10, 2023
Disabling find_unused_parameters DDP/GPU	1	6047	January 30, 2022
Multi-task model in version 2.0.9 with DDP error DDP/GPU	0	906	October 4, 2023
Error with ddp when updating from pytorch-lightning 1.6.5 to version2.0.9 DDP/GPU	0	1055	October 4, 2023

Custom training - RuntimeError due to unused parameters

Related topics