Hallo,
I have a problem understanding how I can define a Dataset/DataLoader combination that processes batches of a custom size. I have a tabular dataset with a categorical variable defining the batch. I define the dataset like
class MyDataset:
def __init__(self, df, features, target):
self.df = df
self.features = features
self.target = target
self.category_var = list(self.df.category_var.unique())
def __len__(self):
return len(self.category_var)
def __getitem__(self, idx):
var = self.category_var[idx]
X = self.df.query('category_var==@var')[self.features]
y = self.df.query('category_var==@var')[self.target]
return X,y
So each item in my dataset is a custom sized batch of samples I want to process.
When I defined a train_loader and trainer like
early_stopping = EarlyStopping("val_loss")
train_loader = torch.utils.data.DataLoader(
EraDataset(train, features, target), batch_size=None, batch_sampler=None
)
val_loader = torch.utils.data.DataLoader(
MyDataset(val, features, target), batch_size=None, batch_sampler=None
)
trainer = pl.Trainer(
accelerator="gpu",
devices=1,
callbacks=[early_stopping],
max_epochs=2,
#auto_scale_batch_size=True,
deterministic=True,
default_root_dir="temp/",
logger=True,
)
My code tanks with:
TypeError: 'int' object is not callable
Which does not give me an angle to work with.
I guess I am understanding the concept of custom dataset and dataloaders, maybe even the definition of a batch wrong.
Is my dataset meant to return a batch or a sample?
If sample: How can my dataset define a custom batch size if it only returns one sample? Is that not the job of the dataset? if not, whose is it? the data loaders? the collate_fn?
Thank you very much