Behaviour of dropout over multiple gpu setting

Would setting a different seed for each process when using multi gpu training paradigm for the dropout layer lead to unstable training ? My thinking is that each dropout layer would operate independently and when we perform all reduce on layers with different masks it wouldn’t be the behavior that we want.

@vanshils I don’t see any immediate negative implications here. You are referring to all-reduce of some masks, is that a step in your optimization? Note that in DDP, the gradients get all-reduced but nothing else. So unless you have some all-reduce calls yourself, you don’t need to worry about that.

Hi
So i was talking about all reducing the gradients. so during dropout’s backward pass the upstream gradeints would be multiplied by different masks as each dropout is independent and then we would perform all reduce on them. how is that conceptually similar to a dropout in a single gpu ,mini batch ,no data parllelism setting ? Wouldnt this lead to unstable training when a network with dropout is scaled to multiple gpus.
if possible please could you elucidate your thinking.

@vanshils Here I’m simulating what is being done when computing gradients with a model with dropout across two processes:

import torch
import torch.nn as nn


torch.manual_seed(0)
model0 = nn.Sequential(nn.Linear(2, 2, bias=False), nn.Dropout(p=0.5))
input0 = torch.ones(1, 2)
model0(input0).sum().backward()

print("model 0 grad:")
print(model0[0].weight.grad)

torch.manual_seed(1)
model1 = nn.Sequential(nn.Linear(2, 2, bias=False), nn.Dropout(p=0.5))

# ensure both models have the same initial weights
with torch.no_grad():
    model1[0].weight.copy_(model0[0].weight)

input1 = torch.ones(1, 2)
model1(input1).sum().backward()
print("model 1 grad:")
print(model1[0].weight.grad)

# simulate all-reduce
reduced_grad = (model0[0].weight.grad + model1[0].weight.grad) / 2
print("reduced_grad:")
print(reduced_grad)

The first model gets contributions from both outputs, while the second model (different seed for the dropout) gets contributions only from one of the outputs (the other got suppressed by the dropout layer). What you are referring to as “unstable training” is perhaps when the gradient can be zero for some parameters, for example when you set torch.manual_seed(2) in the above example.

So the question is more like: “Can training become unstable when some parameters receive 0 gradient”? That’s more of a general question right?

Now take a look at this example, where we use a single model but a batch size of 2. The same thing can happen!!!

import torch
import torch.nn as nn


torch.manual_seed(0)
model0 = nn.Sequential(nn.Linear(2, 2, bias=False), nn.Dropout(p=0.5))
input0 = torch.ones(2, 2)
model0(input0).sum().backward()

print("model 0 grad:")
print(model0[0].weight.grad)

Now you can ask the same question:
“Does training on a single GPU with dropout and batch size > 1 lead to unstable training”?

I think in general optimization can become unstable for many reasons. Here, gradient 0 means we have no information that could lead to an update of that parameter, so we leave it unchanged (according to SGD). So I don’t necessarily see how that alone could lead to instability.

I encourage you to try the above code snippets and inspect the outputs!

Thanks a lot for such a detailed response. I am looking into the code and thinking about that.
But highly grateful for this reply.