FSDPStrategy num_node is always 1

ssyang1999 · July 5, 2023, 7:07am

line 122 in pytorch_lightning.strategy.fsdp

        self.num_nodes = 1

num_nodes is always 1 in FSDPStrategy. As a result, distributed sampler gets wrong params:

    @property
    def distributed_sampler_kwargs(self) -> Dict:
        return dict(num_replicas=(self.num_nodes * self.num_processes), rank=self.global_rank)

For example, if I start a task with 8 gpus in 2 nodes (each have 4 gpus). The distributed sampler will get num_replicas=4, then global_rank will be greater than num_replicas since global_rank can be 4, 5, 6, 7

Is this a bug in lightning? If it is not, is there a way to address the problem?

awaelchli · July 5, 2023, 7:27am

Is this still an issue after this fix?

ssyang1999 · July 5, 2023, 7:39am

Oh thanks. Seems my pl is not up to date. By the way, if I want to use FSDP in a way similar to DDPSharded. What can I do?

awaelchli · July 5, 2023, 2:34pm

By DDPSharded you probably mean sharding optimizer state and gradients (also known as zero-2). For this, you can set

trainer = Trainer(..., strategy=FSDPStrategy(sharding_strategy=ShardingStrategy.SHARD_GRAD_OP))

See the options for ShardingStrategy here in the FSDP docs for PyTorch:
https://pytorch.org/docs/stable/fsdp.html

We are currently working on a full FSDP guide where I will also mention the different sharding strategies and when to use them for maximizing throughput and memory efficiency.

ssyang1999 · July 6, 2023, 12:13pm

Thanks for your great work! I will give it try.

Topic		Replies	Views
Lack of documentation on deepspeed / fsdp DDP/GPU	0	766	April 24, 2023
FSDP sharded checkpointing slower than any other method Trainer	1	325	March 19, 2024
Ddp2 in multi node and multi gpu failing on pytorch lightning	0	555	November 7, 2021
Why `num_replica` != `world_size`? DDP/GPU	0	103	May 22, 2024
Implement DDP sampling strategy which requires rank? DDP/GPU	1	467	August 2, 2023

FSDPStrategy num_node is always 1

Related topics