I am lost on custom batch size definition

simpsus · May 14, 2023, 4:54am

Hallo,

I have a problem understanding how I can define a Dataset/DataLoader combination that processes batches of a custom size. I have a tabular dataset with a categorical variable defining the batch. I define the dataset like

class MyDataset:

    def __init__(self, df, features, target):
        self.df = df
        self.features = features
        self.target = target
        self.category_var = list(self.df.category_var.unique())

    def __len__(self):
        return len(self.category_var)

    def __getitem__(self, idx):
        var = self.category_var[idx]
        X = self.df.query('category_var==@var')[self.features]
        y = self.df.query('category_var==@var')[self.target]
        return X,y

So each item in my dataset is a custom sized batch of samples I want to process.

When I defined a train_loader and trainer like

early_stopping = EarlyStopping("val_loss")
train_loader = torch.utils.data.DataLoader(
    EraDataset(train, features, target),  batch_size=None, batch_sampler=None
)
val_loader = torch.utils.data.DataLoader(
    MyDataset(val, features, target),  batch_size=None, batch_sampler=None
)
trainer = pl.Trainer(
    accelerator="gpu",
    devices=1,
    callbacks=[early_stopping],
    max_epochs=2,
    #auto_scale_batch_size=True,
    deterministic=True,
    default_root_dir="temp/",
    logger=True,
)

My code tanks with:

TypeError: 'int' object is not callable

Which does not give me an angle to work with.

I guess I am understanding the concept of custom dataset and dataloaders, maybe even the definition of a batch wrong.
Is my dataset meant to return a batch or a sample?
If sample: How can my dataset define a custom batch size if it only returns one sample? Is that not the job of the dataset? if not, whose is it? the data loaders? the collate_fn?

Thank you very much

awaelchli · May 17, 2023, 1:57am

TypeError: ‘int’ object is not callable

The stack trace would be useful to know where the error occurred. Please look at the full error or share it here.

Is my dataset meant to return a batch or a sample?

A sample (normaly) The DataLoader takes care of batching the samples together. If your samples aren’t tensors or simple datastructures like dict/list, then the DataLoader won’t know how to form a batch from your samples. In this case you can provide a collate_fn (collation function) and implement your way of collating samples into a batch.

Read about it here: torch.utils.data — PyTorch 2.1 documentation

simpsus · May 17, 2023, 11:04am

Here is a minimal example I cannot understand why it is not working (vanilla pytorch, no lightning) and cannot wrap my head around that my usecase is so special.


import torch
import numpy as np
import pandas as pd
import random

data = np.random.rand(100, 10)
df = pd.DataFrame(data, columns=[f'{i}' for i in range(10)])
df["cat_var"] = [random.choice([f"batch_{i+1}" for i in range(5)]) for j in range(100)]
device = "cpu"

class MyDataset:
    def __init__(self, df, features, target, cat_var):
        self.df = df
        self.features = features
        self.target = target
        self.category_var = list(df.cat_var.unique())

    def __len__(self):
        return len(self.category_var)

    def __getitem__(self, idx):
        print(idx)
        var = self.category_var[idx]
        X = torch.tensor(self.df.query("cat_var==@var")[self.features].values)
        y = torch.tensor(self.df.query("cat_var==@var")[self.target])
        return X, y


dataset = MyDataset(df, [f'{i}' for i in range(9)], '9', "cat_var")
train_loader = torch.utils.data.DataLoader(dataset)


class Model(torch.nn.Module):
    def __init__(self):
        super(Model, self).__init__()
        self.layer = torch.nn.Linear(in_features=9, out_features=1)

    def forward(self, x):
        return self.layer(x)


model = Model().to(device)
optimizer = torch.optim.SGD(model.parameters(), lr=1e-3)
loss = torch.nn.MSELoss()
epoch_losses = []
for epoch in range(5):
    epoch_loss = 0
    for X, y in train_loader:
        X, y = X.to(device), y.to(device)
        pred = model(X)
        l = loss(pred, y)
        epoch_loss += l.item()
        optimizer.zero_grad()
        l.backward()
        optimizer.step()

Topic		Replies	Views
Dataset with variable sized batches implementation help	1	375	September 9, 2021
How to change the way dataloader handles data? DataModule	1	512	July 30, 2023
Val dataloader batchsize overrides train dataloader size	0	1017	May 5, 2021
Training when data is stored in batches Trainer	2	445	May 21, 2023
How to load subset of dataset in subset of epoch	0	966	December 19, 2022

I am lost on custom batch size definition

Related topics