How to change the way dataloader handles data?

yczhangnaxin · July 24, 2023, 1:15pm

i have the code:

class TrainDataset(Dataset):
                         ......
    def __getitem__(self, index):
                         ......
            out = {
                'source_ids': src_ids,
                'source_mask': src_mask,
                'target_ids': target_ids,
                'label': label
            }
            out_list.append(out)

        return out_list

class DataModule(pl.LightningDataModule):
    def prepare_data(self):
        self.train = TrainDataset(args)
    def train_dataloader(self):
        train_loader = DataLoader(self.train,
                                  batch_size=self.batch_size,
                                  shuffle=True,
                                  pin_memory=True,
                                  num_workers=4)
        return train_loader

traindataset will return a piece of data, same as [{a1},{a2}...{an}]. When I set the batch_size to 2, dataloader will collect my data like this [[{a1},{b1}]...[{an},{bn}]], but what I expect is that he can help me process the data like this: [[{a1},{a2}...{an}],[{b1},{b2}...{bn}]].
i hope i made my question clear

awaelchli · July 30, 2023, 8:36pm

Hi

I think for this it could be useful to implement a collate_fn function where you can define the concatenation of your data into a batch:

def collate_fn(samples):
    # samples is the list of samples returned from your
    # dataset, to be assembled into a batch
    # [[{a1},{a2}...{an}],[{b1},{b2}...{bn}]]
    return samples
    

dataloader = DataLoader(..., collate_fn=collate_fn)

Here are the PyTorch docs for this.

Hope this helps

Topic		Replies	Views
How to use multiple train dataloaders with different lengths LightningModule	1	8249	September 27, 2020
Multiple dataloaders in training_step() and use them separately implementation help	0	337	September 13, 2023
Custom Image Lightning Dataloader DataModule	0	558	April 29, 2023
How to load subset of dataset in subset of epoch	0	973	December 19, 2022
How to use two train_dataloaders iterate over each epoch?	1	653	April 11, 2023

How to change the way dataloader handles data?

Related topics