FSDPStrategy num_node is always 1

line 122 in pytorch_lightning.strategy.fsdp

        self.num_nodes = 1

num_nodes is always 1 in FSDPStrategy. As a result, distributed sampler gets wrong params:

    def distributed_sampler_kwargs(self) -> Dict:
        return dict(num_replicas=(self.num_nodes * self.num_processes), rank=self.global_rank)

For example, if I start a task with 8 gpus in 2 nodes (each have 4 gpus). The distributed sampler will get num_replicas=4, then global_rank will be greater than num_replicas since global_rank can be 4, 5, 6, 7

Is this a bug in lightning? If it is not, is there a way to address the problem?

Is this still an issue after this fix?

Oh thanks. Seems my pl is not up to date. By the way, if I want to use FSDP in a way similar to DDPSharded. What can I do?

By DDPSharded you probably mean sharding optimizer state and gradients (also known as zero-2). For this, you can set

trainer = Trainer(..., strategy=FSDPStrategy(sharding_strategy=ShardingStrategy.SHARD_GRAD_OP))

See the options for ShardingStrategy here in the FSDP docs for PyTorch:

We are currently working on a full FSDP guide where I will also mention the different sharding strategies and when to use them for maximizing throughput and memory efficiency.

Thanks for your great work! I will give it try. :hugs: