Multi-task model in version 2.0.9 with DDP error

Hello, I am currently working on a multi-task model with pytorch-lightning 2.0.9 where the whole model is written as a LightningModule class (see code below) .

I got the following error with strategy="ddp"

root INFO - RuntimeError: It looks like your LightningModule has parameters that were not used in producing the loss returned by training_step. If this is intentional, you must enable the detection of unused parameters in DDP, either by setting the string value strategy='ddp_find_unused_parameters_true' or by setting the flag in the strategy with strategy=DDPStrategy(find_unused_parameters=True).

Here are the details about how I construct my multk-task model:

I override the training_step in LightningModule by calling 2 submodels model_one and model_two ( pytorch nn.Model class).

  • class model_one(torch.nn.Model)
  • class model_two(torch.nn.Model)

Each task is updated separately with its own data ( and different loss functions and labels). These two tasks have some common shared layers ( a common BERT model), but they are with different headers.

During the training process, each batch contains only data from model_one or from model_two, and then the corresponding loss will be computed. The backward pass will only update the header of model_one and the Bert model, or the header of model_two and the shared layers.

class mutiltaskModel(LightningModule):
     def __init__(self,  model1: model_one, model2: model_two):
            self.model1 = model_one
            self.model2 = model_two
            self.tasks = [ self.model1, self.model2] 
     
      def training_step(self, batch, batch_id):
            task_id = batch[0][0]
            task_module = self.task[task_id]
            output = task_module.traning_step(batch, batch_id)
            return output

The task modules are as follow ( model_one and model_two are similar but with different number of classes)

class model_one(torch.nn.Model):
     def __init__(self, bert_model):
           super().__init__()
           self.bert= bert_model
           self.loss_fun = BCEWithLogitsLoss(reduction='sum')
           self.num_class = 10
     
     def training_step(self, batch, batch_id):
           task_ids, instance_ids, attention_mask, labels = batch
           bert_emb = self.bert(instance_ids, attention_mask, output_hidden_states=True).last_hidden_state
           logits = nn.Linear(768, self.num_class)(bert_emb)
           probs = nn.Sigmoid()(logits)
           loss = self.loss_fun(probs, labels)
           return {'loss': loss}

My questions:

  • All the parameters in model 1 and model 2 are trainable, why do I get the above error with DDP strategy ? Also, the training with strategy='ddp_find_unused_parameters_true becomes slower.

  • There is a batch size limitation when using the pytorch-lightning 2.0.9. I see that a single batch size is only 300KB ( the total training data is 5.7GB in lmdb format) and my GPU ( Nvidia L4 24G GPU) is with 24G, It still throws out a GPU memory error with DDP strategy.

Tried to allocate 84.35 GiB (GPU 5; 21.96 GiB total capacity; 12.16 GiB already allocated; 9.48 GiB free; 12.18 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Here is my Nvidia-smi information and my pytorch version

Nvidia Driver Version: 535.104.05
NVCC : Cuda compilation tools, release 12.2, V12.2.140
torch 2.0.1+cu118
torchaudio 2.0.2+cu118
torchvision 0.15.2+cu118