Behaviour of dropout over multiple gpu setting

vanshils · December 16, 2023, 3:28pm

Would setting a different seed for each process when using multi gpu training paradigm for the dropout layer lead to unstable training ? My thinking is that each dropout layer would operate independently and when we perform all reduce on layers with different masks it wouldn’t be the behavior that we want.

awaelchli · December 17, 2023, 11:28am

@vanshils I don’t see any immediate negative implications here. You are referring to all-reduce of some masks, is that a step in your optimization? Note that in DDP, the gradients get all-reduced but nothing else. So unless you have some all-reduce calls yourself, you don’t need to worry about that.

vanshils · December 17, 2023, 6:01pm

Hi
So i was talking about all reducing the gradients. so during dropout’s backward pass the upstream gradeints would be multiplied by different masks as each dropout is independent and then we would perform all reduce on them. how is that conceptually similar to a dropout in a single gpu ,mini batch ,no data parllelism setting ? Wouldnt this lead to unstable training when a network with dropout is scaled to multiple gpus.
if possible please could you elucidate your thinking.

awaelchli · December 18, 2023, 7:26am

@vanshils Here I’m simulating what is being done when computing gradients with a model with dropout across two processes:

import torch
import torch.nn as nn


torch.manual_seed(0)
model0 = nn.Sequential(nn.Linear(2, 2, bias=False), nn.Dropout(p=0.5))
input0 = torch.ones(1, 2)
model0(input0).sum().backward()

print("model 0 grad:")
print(model0[0].weight.grad)

torch.manual_seed(1)
model1 = nn.Sequential(nn.Linear(2, 2, bias=False), nn.Dropout(p=0.5))

# ensure both models have the same initial weights
with torch.no_grad():
    model1[0].weight.copy_(model0[0].weight)

input1 = torch.ones(1, 2)
model1(input1).sum().backward()
print("model 1 grad:")
print(model1[0].weight.grad)

# simulate all-reduce
reduced_grad = (model0[0].weight.grad + model1[0].weight.grad) / 2
print("reduced_grad:")
print(reduced_grad)

The first model gets contributions from both outputs, while the second model (different seed for the dropout) gets contributions only from one of the outputs (the other got suppressed by the dropout layer). What you are referring to as “unstable training” is perhaps when the gradient can be zero for some parameters, for example when you set torch.manual_seed(2) in the above example.

So the question is more like: “Can training become unstable when some parameters receive 0 gradient”? That’s more of a general question right?

Now take a look at this example, where we use a single model but a batch size of 2. The same thing can happen!!!

import torch
import torch.nn as nn


torch.manual_seed(0)
model0 = nn.Sequential(nn.Linear(2, 2, bias=False), nn.Dropout(p=0.5))
input0 = torch.ones(2, 2)
model0(input0).sum().backward()

print("model 0 grad:")
print(model0[0].weight.grad)

Now you can ask the same question:
“Does training on a single GPU with dropout and batch size > 1 lead to unstable training”?

I think in general optimization can become unstable for many reasons. Here, gradient 0 means we have no information that could lead to an update of that parameter, so we leave it unchanged (according to SGD). So I don’t necessarily see how that alone could lead to instability.

I encourage you to try the above code snippets and inspect the outputs!

vanshils · December 18, 2023, 6:04pm

Thanks a lot for such a detailed response. I am looking into the code and thinking about that.
But highly grateful for this reply.

Topic		Replies	Views
DDP seeding with Transforms DDP/GPU	2	2294	April 16, 2021
Multi-GPU training issue - DDP strategy. Training hangs upon distributed GPU initialisation DDP/GPU	3	3587	January 18, 2023
Multi-gpu training is much lower than single gpu (due to additional processes?) DDP/GPU	0	211	May 8, 2024
Multiple GPU runs the scipt twice DDP/GPU	10	332	February 8, 2024
On Contrastive Learning, ddp and dataset partitioning DDP/GPU	0	1573	February 27, 2021

Behaviour of dropout over multiple gpu setting

Related topics