Hello lightning forums :] New to the forums, been using Fabric for a month or two now. Just wanna say that, overall, you guys have done a pretty great job making GPU parallelism easily accessible. I seriously cannot thank the devs here enough for the work they’ve put into this - these kinds of projects make advanced tasks like multi-GPU nn training so much more accessible for people who are not programmers by trade. +1 for removal of barriers to entry, so thanks again, guys. Anyway, on to the problem I’m having…
I have recently moved from training this model I’m working on in a single GPU setting to a multi GPU setting. To date I have been using a custom class for loading data during nn training - will refer to this class as DM, and to Pytorch’s/Lightning’s/Fabric’s standard DataLoaders as DL. I do this because I see about a 450-550% speedup in all data loading/formatting ops when using DM as opposed to DL ( this is the case with stock-standard PyTorch too - it’s not a Lightning/Fabric problem ), which obviously radically reduces my training time. In the single-GPU setting, this has 0 impact on the performance of the trained model or any loss statistics etc - in other words, it does nothing but cut down my training time by a factor of about 3.5-4.5x for no cost to model performance.
But since switching to a distributed setting, this is no longer the case. I have modified my DM class to ( perhaps naively ) handle the distributed case, with each process started by Fabric ( using DDP strategy ) having its own DMₙ on GPU rank n. DMₙ will yield batches of data as tensors on device torch.device(f"cuda:{n}")
for the forward pass executed by the DDP process with rank n. Very simple…or so I thought.
After training in the distributed setting like this for a while, I tested the model’s performance and it’s not just simply not learning, but is actually actively getting worse the more it’s trained.
But the point of this post isn’t a deep diagnosis of the problem…it’s too involved. All I am asking is this: My assumption is that the problem in overall model performance arises from the fact that the only part of my distributed setting which differs from the stock-standard Fabric DDP pipeline is the change from DL to DM. So is there anyone out there that can give me some idea of what else Fabric’s DataLoaders do in the DDP context? Is there something else going on here other than partitioning the dataset into n subsets for n GPUs and sending the data to the right GPU at the right time?
I also note some distinct and regular differences in patterns displayed by GPU utilization shown in nvitop when I use Fabric’s DL vs my own DM. Using DM, nvitop looks more or less like this for the entire training process:
However, when I switch to DL, nvitop’s output will alternate regularly between looking like the above, then showing rank 1-7 UTL at MAX, whereupon they will all drop to roughly 0, and rank 0’s UTL will jump to MAX before all ranks’ UTL returns to looking like the above. I’m guessing this is some kind of sync that’s happening between ranks. This sync ( if that is indeed what it is ) must somehow be facilitated by the presence of Fabric’s DLs, as this specific resource utilization pattern is totally absent when I replace DL with DM ( NB I make absolutely no other changes when I swap between DL and DM, so it seems that the problem must be isolated in that change ).
Feel free to ask for additional details - I’ll post whatever code etc that I can to make this easier to diagnose. But mostly, I’m just trying to understand what Fabric’s DLs are doing during training that my DM is not, as the only thing DM does is simply send batches to the right GPU when requested. Are Fabric’s DLs doing something more than this?