line 122 in pytorch_lightning.strategy.fsdp
self.num_nodes = 1
num_nodes is always 1 in FSDPStrategy. As a result, distributed sampler gets wrong params:
def distributed_sampler_kwargs(self) -> Dict:
return dict(num_replicas=(self.num_nodes * self.num_processes), rank=self.global_rank)
For example, if I start a task with 8 gpus in 2 nodes (each have 4 gpus). The distributed sampler will get
global_rank will be greater than
global_rank can be 4, 5, 6, 7
Is this a bug in lightning? If it is not, is there a way to address the problem?
Is this still an issue after this fix?
01:10AM - 22 Apr 23 UTC
## What does this PR do?
Ports the changes from #17160 to 2.0.x.
Oh thanks. Seems my pl is not up to date. By the way, if I want to use FSDP in a way similar to DDPSharded. What can I do?
By DDPSharded you probably mean sharding optimizer state and gradients (also known as zero-2). For this, you can set
trainer = Trainer(..., strategy=FSDPStrategy(sharding_strategy=ShardingStrategy.SHARD_GRAD_OP))
See the options for
ShardingStrategy here in the FSDP docs for PyTorch:
We are currently working on a full FSDP guide where I will also mention the different sharding strategies and when to use them for maximizing throughput and memory efficiency.
Thanks for your great work! I will give it try.