How to use DDP in LightningModule in Apple M1?

zhguo1 · February 13, 2024, 2:44pm

Hello,

I am trying to run a CNN model in my MacBook laptop, which has Apple M1 chip. From what I know, PyTorch lightning already supports Apple M1 for multiple GPU training, but I am unable to find detailed tutorial about how to use it. So I tried the following based on the documentation I can find.

I create the trainer by using “mps” accelerator and devices=1. From the documents I read, I think that I should use devices=1, and Lightning will use multiple GPUs automatically.

trainer = pl.Trainer(
        accelerator='mps',
        devices=1,
        strategy="ddp"
        callbacks=[checkpoint_callback, lr_monitor],
        logger=tb_logger,
    )

I created a class inherited from LightningModule.

class Moco_v2(LightningModule):

In this class, I called the following two functions to get the total number of the workers and current worker.

torch.distributed.get_rank()
torch.distributed.get_world_size()

But I got the following error:

raise ValueError(
ValueError: Default process group has not been initialized, please make sure to call init_process_group.

So I need to call init_process_group somewhere, I guess that I could just call this function in the function __init__ of this class something like this:

torch.distributed.init_process_group("gloo", world_size=total_workers, rank=current_rank)

But I am not sure what value should be passed to the parameter world_size and rank. For the world_size, probably I can just use the total number of GPU cores in MacOS M1? For the parameter rank, apparently I can not call torch.distributed.get_rank(). Actually this question is not specific to Apple M1, I also want to know how to call init_process_group for accelerator='gpu'.

Is there any tutorial to show some examples how to use DDP in LightningModule in Apple M1?

Thanks.

awaelchli · February 13, 2024, 2:58pm

@zhguo1

Is there any tutorial to show some examples how to use DDP in LightningModule in Apple M1?

The M1 chip contains one GPU. Therefore, you can’t use multiple GPUs. In fact, Lightning will raise an error if you attempt to:

ValueError: You set `strategy=ddp` but strategies from the DDP family are not supported on the MPS accelerator. Either explicitly set `accelerator='cpu'` or change the strategy.

You don’t need to do any of torch.distributed in your LightningModule. Just remove that part and all you change is accelerator="mps".

zhguo1 · February 13, 2024, 3:11pm

@awaelchli Thank you for your reply.

My MacBook has M1 Pro which has 14 GPU cores. Do you mean that the Lightning does not support the parallelism on multiple GPU cores in Apple M1? Does this mean that I will not see any performance improvement on Apple M1 compared with the case where I run Lightning on a Linux machine with a single CPU without GPU?

Actually I did not see the error you mentioned when I use stragegy=ddp and accelerator='mps'. The Lightning version I am using is 2.2.0.

ValueError: You set `strategy=ddp` but strategies from the DDP family are not supported on the MPS accelerator. Either explicitly set `accelerator='cpu'` or change the strategy.

awaelchli · February 13, 2024, 3:21pm

Lightning does not implement the M1 support, this is done in PyTorch. Parallelization over the GPUP is handled by the MPS backend in PyTorch. Devices in Trainer is not how many cores there are, but how many GPUs. Like, in a single NVIDIA GPU there are thousands of CUDA cores.

Your M1 MacBook has one GPU. It doesn’t matter how many cores it has, you set devices=1 and there is no DDP.

zhguo1 · February 13, 2024, 3:40pm

Thank you. A couple of follow up questions.

Parallelization over the GPUP is handled by the MPS backend in PyTorch.

How to use the parallelization of MPS provided by PyTorch? Could you point me to some documents/examples if any?

If I switched to a machine with NVIDIA GPU. I guess that I still need to call init_process_group to set up DDP?

torch.distributed.init_process_group("gloo", world_size=total_workers, rank=current_rank)

But how to pass the parameter rank since I cannot call torch.distributed.get_rank()? Is this function init_process_group new in the recent version of Lightning? I looked at some public code (for example, this one) (I was able to run it in a machine with multiple processes on a CPU) which use the Lightning 1.6, and those code just call torch.distributed.get_rank() without needing to call init_process_group first.

awaelchli · February 13, 2024, 4:00pm

How to use the parallelization of MPS provided by PyTorch? Could you point me to some documents/examples if any?

It’s already done by PyTorch / the MPS backend. You don’t need to do anything. There almost no documentation out there for “how it works”, so you won’t find much.

If I switched to a machine with NVIDIA GPU. I guess that I still need to call init_process_group to set up DDP?

No absolutely not. The whole point of using Lightning is that you don’t need to do this. Follow the intro here to learn about Lightnings benefits. Then in step 7 there you’ll see how to use multi-GPU and you will find out that it is absolutely trivial.

zhguo1 · February 13, 2024, 5:07pm

If I switched to a machine with NVIDIA GPU. I guess that I still need to call init_process_group to set up DDP?

No absolutely not. The whole point of using Lightning is that you don’t need to do this. Follow the intro here to learn about Lightnings benefits. Then in step 7 there you’ll see how to use multi-GPU and you will find out that it is absolutely trivial.

I am wondering how to call the function torch.distributed.get_rank() and torch.distributed.get_world_size() in the LightningModule. Similar to what was done in the function _batch_shuffle_ddp in this code, I want to re-shuffle the batch. The error I am getting requires me to call the function init_procwss_group first. Thanks!

awaelchli · February 13, 2024, 5:26pm

self.trainer.global_rank and self.trainer.world_size are available in the LightningModule hooks.

zhguo1 · February 13, 2024, 6:09pm

self.trainer.global_rank and self.trainer.world_size are available in the LightningModule hooks.

Thank you very much. Do you have any advise where I should call the function init_process_group. I tried to call it in the function __init___() of my class Moco_v2, which inherits LightningModule, but I got the following error:

    File "/Users/zhguo1/miniconda3/envs/pytorch_ml/lib/python3.12/site-packages/pytorch_lightning/core/module.py", line 207, in trainer
    raise RuntimeError(f"{self.__class__.__qualname__} is not attached to a `Trainer`.")

I have the following code:

# Moco_v2 inherits LightningModule
model = Moco_v2(....) 
...
trainer = pl.Trainer(
        accelerator='mps',
        devices=1,
        strategy='ddp',
        callbacks=[checkpoint_callback, lr_monitor],
        logger=tb_logger,
    )
   trainer.fit(
        model,
        datamodule=datamodule,
        ckpt_path=params['ckpt_path']
    )

It looks like that the trainer is attached to the model after the call trainer.fit(model,...), so the trainer is not available in the function __init__() of the LightingModule?

zhguo1 · February 16, 2024, 3:00pm

Could anyone answer my question about the trainer not attached to LightningModule? Appreciate it.

Topic		Replies	Views
Ddp2 in multi node and multi gpu failing on pytorch lightning	0	535	November 7, 2021
DDP MultiGPU Training does not reduce training time DDP/GPU	3	1607	November 8, 2023
CUDA multiprocessing asks to use "spawn" start metod DDP/GPU	1	1179	August 21, 2023
Runing ddp accross two machines DDP/GPU	3	1387	March 3, 2023
Use DDP to train a single model, on a single GPU, multiple processes	0	141	May 15, 2024

How to use DDP in LightningModule in Apple M1?

Related topics