I am lost on custom batch size definition


I have a problem understanding how I can define a Dataset/DataLoader combination that processes batches of a custom size. I have a tabular dataset with a categorical variable defining the batch. I define the dataset like

class MyDataset:

    def __init__(self, df, features, target):
        self.df = df
        self.features = features
        self.target = target
        self.category_var = list(self.df.category_var.unique())

    def __len__(self):
        return len(self.category_var)

    def __getitem__(self, idx):
        var = self.category_var[idx]
        X = self.df.query('category_var==@var')[self.features]
        y = self.df.query('category_var==@var')[self.target]
        return X,y

So each item in my dataset is a custom sized batch of samples I want to process.

When I defined a train_loader and trainer like

early_stopping = EarlyStopping("val_loss")
train_loader = torch.utils.data.DataLoader(
    EraDataset(train, features, target),  batch_size=None, batch_sampler=None
val_loader = torch.utils.data.DataLoader(
    MyDataset(val, features, target),  batch_size=None, batch_sampler=None
trainer = pl.Trainer(

My code tanks with:

TypeError: 'int' object is not callable

Which does not give me an angle to work with.

I guess I am understanding the concept of custom dataset and dataloaders, maybe even the definition of a batch wrong.
Is my dataset meant to return a batch or a sample?
If sample: How can my dataset define a custom batch size if it only returns one sample? Is that not the job of the dataset? if not, whose is it? the data loaders? the collate_fn?

Thank you very much

TypeError: ‘int’ object is not callable

The stack trace would be useful to know where the error occurred. Please look at the full error or share it here.

Is my dataset meant to return a batch or a sample?

A sample (normaly) :slight_smile: The DataLoader takes care of batching the samples together. If your samples aren’t tensors or simple datastructures like dict/list, then the DataLoader won’t know how to form a batch from your samples. In this case you can provide a collate_fn (collation function) and implement your way of collating samples into a batch.

Read about it here: torch.utils.data — PyTorch 2.0 documentation

Here is a minimal example I cannot understand why it is not working (vanilla pytorch, no lightning) and cannot wrap my head around that my usecase is so special.

import torch
import numpy as np
import pandas as pd
import random

data = np.random.rand(100, 10)
df = pd.DataFrame(data, columns=[f'{i}' for i in range(10)])
df["cat_var"] = [random.choice([f"batch_{i+1}" for i in range(5)]) for j in range(100)]
device = "cpu"

class MyDataset:
    def __init__(self, df, features, target, cat_var):
        self.df = df
        self.features = features
        self.target = target
        self.category_var = list(df.cat_var.unique())

    def __len__(self):
        return len(self.category_var)

    def __getitem__(self, idx):
        var = self.category_var[idx]
        X = torch.tensor(self.df.query("cat_var==@var")[self.features].values)
        y = torch.tensor(self.df.query("cat_var==@var")[self.target])
        return X, y

dataset = MyDataset(df, [f'{i}' for i in range(9)], '9', "cat_var")
train_loader = torch.utils.data.DataLoader(dataset)

class Model(torch.nn.Module):
    def __init__(self):
        super(Model, self).__init__()
        self.layer = torch.nn.Linear(in_features=9, out_features=1)

    def forward(self, x):
        return self.layer(x)

model = Model().to(device)
optimizer = torch.optim.SGD(model.parameters(), lr=1e-3)
loss = torch.nn.MSELoss()
epoch_losses = []
for epoch in range(5):
    epoch_loss = 0
    for X, y in train_loader:
        X, y = X.to(device), y.to(device)
        pred = model(X)
        l = loss(pred, y)
        epoch_loss += l.item()