Len warning given during training

I get the len warning during using Deeplake with Lightning.
Previosly, this warning was discussed in here: IterableDataset with wrong length causes validation loop to be skipped. · Issue #10290 · Lightning-AI/pytorch-lightning · GitHub

Since I depend on Deeplake, it is not possible for me to address this warning without changing the source code of Deeplake. Thus, I asked the issue to Deeplake forum. I attach their response as an image.

So do you think I should suppress Lightning’s warning or do you think I can still face problems?
@awaelchli @awaelchli1

@oguz-hanoglu Lightning checks if the user has len defined on the iterable dataset because it is often a source of error, i.e., the user defining it naively with single process in mind, and then silently causing to misbehave wen using multiple workers.

Note that Lightning is just warning the user the same way PyTorch is warning the user, just with a different text. I’ll drop this example here that will hopefully show you what might go wrong:

import torch
from torch.utils.data import DataLoader, IterableDataset


class MyDataset(IterableDataset):
    def __iter__(self):
        for i in range(8):
            yield torch.tensor(i)

    def __len__(self):
        return 8


def main():
    loader = DataLoader(MyDataset(), batch_size=1, num_workers=2)
    print(len(loader))
    for data in loader:
        print(data)


if __name__ == "__main__":
    main()

Note the warning emitted by PyTorch:

UserWarning: Length of IterableDataset <main.MyDataset object at 0x1315dafd0> was reported to be 8 (when accessing len(dataloader)), but 9 samples have been fetched. For multiprocessing data-loading, this could be caused by not properly configuring the IterableDataset replica at each worker. Please see torch.utils.data — PyTorch 2.1 documentation for examples.

At the end of the day, this is a warning not an error. It is not forbidden to define a length, but it is not required and ultimately it is up to the user to implement the IterableDataset sampling correctly with multiprocessing in mind. If Deeplake does this, then the warning (by PyTorch or by Lightning) can be ignored.

1 Like