How to see the DataBatch for incomplete batches?

Slowat · November 29, 2022, 5:18pm

I have a pytorch test data loader that has three batches of 64 in it:

DataBatch(x=[1585, 5], edge_index=[2, 3042], y=[64], batch=[1585], ptr=[65])
DataBatch(x=[1311, 5], edge_index=[2, 2494], y=[64], batch=[1311], ptr=[65])
DataBatch(x=[1963, 5], edge_index=[2, 3798], y=[64], batch=[1963], ptr=[65])

There are actually 200 samples in this data set, but I did drop_last = True to avoid issues with incomplete batches (which someone has mentioned may be error-prone).

But when I predict on these batches like this:

model.eval()
trainer = pl.Trainer(accelerator='gpu',devices=-1)
predictions = trainer.predict(model, graph_test_loader) #where the graph_test_loader is 3 batches of 64, i.e. the structure above

The output is:

(tensor(0.6912), tensor(0.5312), tensor(0.6939), tensor(0.5312), tensor(1.), tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], dtype=torch.int32))
(tensor(0.7148), tensor(0.3594), tensor(0.5287), tensor(0.3594), tensor(1.), tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], dtype=torch.int32))
(tensor(0.6912), tensor(0.5312), tensor(0.6939), tensor(0.5312), tensor(1.), tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], dtype=torch.int32))
(tensor(0.7127), tensor(0.3750), tensor(0.5455), tensor(0.3750), tensor(1.), tensor([1, 1, 1, 1, 1, 1, 1, 1], dtype=torch.int32))

So it has predicted on all 200 samples, i.e. the 3 X 64 batches above, and the final 8 that were dropped for not being a complete batch. But when I do:

for each_data_list in graph_test_loader:
    print(each_data_list)

it only prints the three batches I have described above.

How can I also see the DataBatch for the incomplete batch, since it predicted on it?

awaelchli · November 30, 2022, 12:44pm

Your observations are correct. Lightning injects drop_last=False into the dataloader when using trainer.predict() to make sure the user gets all predictions on the dataset.

There is no need to drop any samples for prediction like in training because we typically don’t compute any metrics in the predict step (we only care about the outputs of the model for each sample).

If you would like to iterate over your dataloader without dropping samples, I think you can simply do this:

graph_test_loader.drop_last = False
for each_data_list in graph_test_loader:
    print(each_data_list)

Topic		Replies	Views
Pytorch Lightning Progress Bar Explained	1	1416	April 4, 2022
Get batch’s datapoints across all GPUs DDP/GPU	2	1061	January 31, 2022
Incorrect batch size being inferred using trainer.fit(), correct batch size in dataloader? What could be going wrong? [PyLightning] Trainer	1	618	March 26, 2023
Save Test Predictions	2	5962	January 12, 2021
Lightning + multi-GPU + IterableDataset uneven batches	2	568	February 17, 2024

How to see the DataBatch for incomplete batches?

Related topics