For example, if I start a task with 8 gpus in 2 nodes (each have 4 gpus). The distributed sampler will get num_replicas=4, then global_rank will be greater than num_replicas since global_rank can be 4, 5, 6, 7
Is this a bug in lightning? If it is not, is there a way to address the problem?
We are currently working on a full FSDP guide where I will also mention the different sharding strategies and when to use them for maximizing throughput and memory efficiency.