Dealing with multiple datasets/dataloaders in Lightning

Federico_Ottomano · April 12, 2023, 12:26pm

I’m in the situation where I’m setting up a training for multiple datasets of different lengths → different number of batches in the DataLoaders. Since I want to use a loss function which combines losses across different datasets (e.g. weighted differently for each dataset) I was trying to keep dataloaders separately into dictionaries:

def train_dataloader(self): #returns a dict of dataloaders
    train_loaders = {}
    for key, value in self.train_dict.items():
        train_loaders[key] = DataLoader(value,
                                        batch_size = self.batch_size,
                                        collate_fn = collate)
    return train_loaders

Then in training_step() I’m doing the following (I’m working on a contrastive learning project):

def training_step(self, batch, batch_idx):
  
total_batch_loss = 0
for key, value in batch.items():
    anc, pos, neg  = value
    emb_anc = F.normalize(self.forward(anc.x,
                                       anc.edge_index,
                                       anc.weights,
                                       anc.batch,
                                       training=True
                                       ), 2, dim=1)
  
    emb_pos = F.normalize(self.forward(pos.x,
                                       pos.edge_index,
                                       pos.weights,
                                       pos.batch,
                                       training=True
                                       ), 2, dim=1)
  
    emb_neg = F.normalize(self.forward(neg.x,
                                       neg.edge_index,
                                       neg.weights,
                                       neg.batch,
                                       training=True
                                       ), 2, dim=1)
                            
    loss_dataset = LossFunc(emb_anc, emb_pos, emb_neg, anc.y, pos.y, neg.y)
    total_batch_loss += loss_dataset
    
self.log("Loss", total_batch_loss, prog_bar=True, on_epoch=True)        
return total_batch_loss

The problem is that I have a different number of batches per dataset and then Lightning will throw a StopIteration when the smallest dataset is exhausted. I was considering simply concatenating everything into a single DataLoader, but that way I don’t think I will be able to compute a loss that gets weighted as I want according to each dataset.

awaelchli · April 18, 2023, 1:45pm

When you create the dataloader, you can wrap it into a CombinedLoader and then specify the mode.
Arbitrary iterable support — PyTorch Lightning 2.0.1.post0 documentation.

Would mode="max_size" make more sense for you? This would conclude the epoch only at the longest dataset, and cycle the other datasets that are shorter.

(this answer applies for Lightning 2.0).

nicolaschahine96 · December 4, 2023, 2:47pm

This is perfect, thank you.

Topic		Replies	Views
Combining loss for multiple dataloaders LightningModule	9	4570	September 12, 2020
Multiple dataloaders in training_step() and use them separately implementation help	0	352	September 13, 2023
How to use multiple train dataloaders with different lengths LightningModule	1	8262	September 27, 2020
How to use two train_dataloaders iterate over each epoch?	1	658	April 11, 2023
How to switch dataloaders between epochs in Lightning LightningModule	4	10120	August 27, 2022

Dealing with multiple datasets/dataloaders in Lightning

Related topics