Sharding dataset initialization at init time

Hi all, I have a large dataset (>100TB) and I’m training a model across 2 nodes of 8x GPUs with DDP on slurm. My dataset needs to build an index of a lot of files at the beginning of training, and this step is taking a lot of time. Frustratingly, since each worker needs to load the same dataset once but during training the data is split among all the workers, there is a lot of wasted work for each rank. My question is: Is there a way that we can setup the dataset to load a rank-specific subset, and tell fabric.setup_dataloader to not split the dataloader across ranks?

Here’s my current hypothesis: It looks like there is a use_distributed_sampler argument in setup_dataloader. So if I set this to false, and then pass in the sharded dataset per rank, this should do what I want? Will there be undesired side effects this way?