Multiple GPU runs the scipt twice

Hello,
I’m looking to implement DataParallel training with two GPUs, where each GPU processes a separate batch. When using pl.Trainer(strategy="ddp", accelerator='gpu' if torch.cuda.is_available() else 'cpu') in a two-GPU setup, my script runs twice on two nodes instead of having a single script using both GPUs, with each handling a distinct batch. How can I address this issue? Thanks!

@yonatansverdlov276 This is how it is supposed to work. DDP means there will be processes/scripts running in parallel, one for each GPU. Each of these will process separate batches independently. Having a “single script using both GPUs” is not possible.

Why impossible? torch.nn.DataParalle does it. Just compute teo batches and average all gradients into one.

torch.nn.DataParallel has been dropped in Lightning 1+ year ago. It is quite inefficient for realistic workloads and has lots of caveats, and PyTorch doesn’t recommend it.

See also:
Warnings here: DataParallel — PyTorch 2.2 documentation
Deprecation discussion here: [POLL][RFC] DataParallel Deprecation · Issue #65936 · pytorch/pytorch · GitHub

So assuming I have 2 GPU, how can I use it to make my runnings faster ?

‫בתאריך יום ד׳, 31 בינו׳ 2024 ב-15:26 מאת ‪Adrian via Lightning AI‬‏ <‪[email protected]‬‏>:‬

You can just set devcies=2 in the Trainer:
https://lightning.ai/docs/pytorch/stable/accelerators/gpu_basic.html

But it will run the same script twice no ?

‫בתאריך יום ב׳, 5 בפבר׳ 2024 ב-15:14 מאת ‪Adrian via Lightning AI‬‏ <‪[email protected]‬‏>:‬

Well yes, that’s a basic requirement. If you want to run on 2 GPUs in parallel, then that means the code is executed “twice”, in parallel.

The problem I’m facing now is somewhat related to this topic. I run a training with multiple GPUs with ddp or deepspeed strategy. Once the training is finished, I want to load the best model from saved checkpoints with checkpoint_callback.best_model_path but this gets executed multiple times and only one of them is pointing an existing checkpoint and the others end up with FileNotFound exception, which cause the whole processes being terminated and consequently I can’t run a test on the best model. How can I handle this situation? Maybe I should seperate the test part as an independent program.

This shouldn’t be the case. Normally, all processes will track which path is saved. I’m surprised about your observation. Maybe you included a custom name in the checkpoint folder, like, with a time stamp or something else that is different across the processes?

Maybe I should seperate the test part as an independent program.

Independent of the above problem, this is what I would consider the best practice yes. It’s also more convenient, because if you want to only test, you don’t have to “skip” fitting, if you know what I mean.

My guess is that those missing checkpoints have been deleted according to the top-k rule.
Anyway I’m in the process of reforming the code, separating the test part from the training part.