What is it exactly that Lightning/Fabric DataLoaders do?

Hello lightning forums :] New to the forums, been using Fabric for a month or two now. Just wanna say that, overall, you guys have done a pretty great job making GPU parallelism easily accessible. I seriously cannot thank the devs here enough for the work they’ve put into this - these kinds of projects make advanced tasks like multi-GPU nn training so much more accessible for people who are not programmers by trade. +1 for removal of barriers to entry, so thanks again, guys. Anyway, on to the problem I’m having…

I have recently moved from training this model I’m working on in a single GPU setting to a multi GPU setting. To date I have been using a custom class for loading data during nn training - will refer to this class as DM, and to Pytorch’s/Lightning’s/Fabric’s standard DataLoaders as DL. I do this because I see about a 450-550% speedup in all data loading/formatting ops when using DM as opposed to DL ( this is the case with stock-standard PyTorch too - it’s not a Lightning/Fabric problem ), which obviously radically reduces my training time. In the single-GPU setting, this has 0 impact on the performance of the trained model or any loss statistics etc - in other words, it does nothing but cut down my training time by a factor of about 3.5-4.5x for no cost to model performance.

But since switching to a distributed setting, this is no longer the case. I have modified my DM class to ( perhaps naively ) handle the distributed case, with each process started by Fabric ( using DDP strategy ) having its own DMₙ on GPU rank n. DMₙ will yield batches of data as tensors on device torch.device(f"cuda:{n}") for the forward pass executed by the DDP process with rank n. Very simple…or so I thought.

After training in the distributed setting like this for a while, I tested the model’s performance and it’s not just simply not learning, but is actually actively getting worse the more it’s trained.

But the point of this post isn’t a deep diagnosis of the problem…it’s too involved. All I am asking is this: My assumption is that the problem in overall model performance arises from the fact that the only part of my distributed setting which differs from the stock-standard Fabric DDP pipeline is the change from DL to DM. So is there anyone out there that can give me some idea of what else Fabric’s DataLoaders do in the DDP context? Is there something else going on here other than partitioning the dataset into n subsets for n GPUs and sending the data to the right GPU at the right time?

I also note some distinct and regular differences in patterns displayed by GPU utilization shown in nvitop when I use Fabric’s DL vs my own DM. Using DM, nvitop looks more or less like this for the entire training process:


However, when I switch to DL, nvitop’s output will alternate regularly between looking like the above, then showing rank 1-7 UTL at MAX, whereupon they will all drop to roughly 0, and rank 0’s UTL will jump to MAX before all ranks’ UTL returns to looking like the above. I’m guessing this is some kind of sync that’s happening between ranks. This sync ( if that is indeed what it is ) must somehow be facilitated by the presence of Fabric’s DLs, as this specific resource utilization pattern is totally absent when I replace DL with DM ( NB I make absolutely no other changes when I swap between DL and DM, so it seems that the problem must be isolated in that change ).

Feel free to ask for additional details - I’ll post whatever code etc that I can to make this easier to diagnose. But mostly, I’m just trying to understand what Fabric’s DLs are doing during training that my DM is not, as the only thing DM does is simply send batches to the right GPU when requested. Are Fabric’s DLs doing something more than this?

Hi @Clion

Adrian from the Lightning team here.
Thanks for the kind words! I am closely monitoring questions and feedback around Fabric, so thanks for posting here about it.

Wow, awesome work. This sounds like a great opportunity for a blog post btw!

Your approach to separating the data into n shards, one per rank, sounds good. It is important to make sure there is no overlapping data, or the data is independently sampled. For example, if you are using random sampling for the data, typically one would need to set a different seed/random state per rank to make sure the sampling is not identical across ranks. A simple trick in Fabric would be to do:

fabric.seed_everything(seed + fabric.global_rank)

If you aren’t doing random shuffling/sampling, then this isn’t a concern.

So is there anyone out there that can give me some idea of what else Fabric’s DataLoaders do in the DDP context?

That’s me! What Fabric does to a dataloader is pretty simple:

  1. Attaches the DistributedSampler to the dataloader if using DDP Code
  2. Wraps the dataloader so calling next() on the iterator would automatically move the data to the device. Code

That’s all. These two features only apply if you call fabric.setup_dataloaders(). So in your case, if you just use your special dataloader then you can be sure that Fabric doesn’t do anything to your loading logic.

Maybe one thing to consider is: Does your DM vs DL experiment use multi-processing (num_wokers > 0) or was this disabled?

Hey @awaelchli, thanks for the quick reply :] As for the DM vs DL performance comparison, my assumption is just that the DATAMACHINE class I made 1) is specialized for specifically my needs, hence lacks the overhead incurred by all the flexibility DataLoaders need to have, being a generic solution that has to work in every case people are using them for, and 2) is written in Cython instead of pure Python so is basically executing all its code in C without any of Python’s dynamic dispatch etc - it has nearly 0 Python interpreter overhead. I’ve actually been doing some experimentation translating PyTorch into Cython ( CyTorch, eheheheh ) to see what kind of speedups I can get, with some very positive results…but that’s sort of beside the point.

Regarding multi-processing, DM has no such capability, and I’ve never really messed with it using DataLoaders. I actually only recently switched from TensorFlow to PyTorch in the middle of what is already a very complex project that’s required me to also switch from Python to Cython to achieve the speeds I need so there’s lots of stuff I’ve had to add to the “learn how this works later” list along the way lol…and leveraging DL’s multiprocessing abilities is one of them.

But to finally get to the point…thank you so much for the clarification. Now, regarding the DistributedSampler - does this do anything but handle seeding etc to make sure sampling happens properly? The difference in GPU utilization behaviour when using DLs vs DMs is very interesting to me. I can get some screenshots of how the resource utilization looks when I run the code using DLs instead…the alternating big spikes in maxed-out activity on ranks 1-7 followed by those seven dropping to zero utilization while rank 0 then briefly spikes to max before they all resume more-or-less even activity before spiking in the exact same way again is screaming to me that something else is going on here that I’m not aware of.

Are you on the Lightning Discord? Would that be a better place to carry on this discussion or do you prefer to keep it here? Either’s fine by me.

Thanks again :]

An update, from a test I’m currently running using DL instead of DM. This is the GPU utilization pattern I was referring to. Utilization will start as in the image in my original post, then will do this ( first the top, then the bottom ) :

Before returning to the pattern displayed in the original post for a few sec, then this cycle repeats. No such cycle is seen when using DM, whose utl pattern resembles the image in the original post for the entire training period. Wonder what would be causing this…

DistributedSampler - does this do anything but handle seeding etc to make sure sampling happens properly?

Yes, it has it’s own seed:
Docs

Code

When the GPU utilization drops, it could be that the GPU is simply waiting for the CPU to get the data. Normally, one would overlap the loading of data with running the model on the GPU so that the data is always ready for the next iteration. The torch dataloader does prefetching and if enabled multiprocessing to take advantage of multiple CPU cores. In your screenshot, it shows that the CPU utilization is low so I don’t think the CPU is the bottleneck.

But could you try to disable moving your data to the GPU in your dataloading code (leave it on CPU) and let it be done by fabric?

Are you on the Lightning Discord?
Yep, my name is just Adrian there.