Multiple GPU runs the scipt twice

yonatansverdlov276 · January 31, 2024, 12:04pm

Hello,
I’m looking to implement DataParallel training with two GPUs, where each GPU processes a separate batch. When using pl.Trainer(strategy="ddp", accelerator='gpu' if torch.cuda.is_available() else 'cpu') in a two-GPU setup, my script runs twice on two nodes instead of having a single script using both GPUs, with each handling a distinct batch. How can I address this issue? Thanks!

awaelchli · January 31, 2024, 12:25pm

@yonatansverdlov276 This is how it is supposed to work. DDP means there will be processes/scripts running in parallel, one for each GPU. Each of these will process separate batches independently. Having a “single script using both GPUs” is not possible.

yonatansverdlov276 · January 31, 2024, 1:00pm

Why impossible? torch.nn.DataParalle does it. Just compute teo batches and average all gradients into one.

awaelchli · January 31, 2024, 1:16pm

torch.nn.DataParallel has been dropped in Lightning 1+ year ago. It is quite inefficient for realistic workloads and has lots of caveats, and PyTorch doesn’t recommend it.

See also:
Warnings here: DataParallel — PyTorch 2.2 documentation
Deprecation discussion here: [POLL][RFC] DataParallel Deprecation · Issue #65936 · pytorch/pytorch · GitHub

yonatansverdlov276 · February 5, 2024, 7:49am

So assuming I have 2 GPU, how can I use it to make my runnings faster ?

‫בתאריך יום ד׳, 31 בינו׳ 2024 ב-15:26 מאת ‪Adrian via Lightning AI‬‏ <‪[email protected]‬‏>:‬

awaelchli · February 5, 2024, 1:04pm

You can just set devcies=2 in the Trainer:
https://lightning.ai/docs/pytorch/stable/accelerators/gpu_basic.html

yonatansverdlov276 · February 5, 2024, 5:31pm

But it will run the same script twice no ?

‫בתאריך יום ב׳, 5 בפבר׳ 2024 ב-15:14 מאת ‪Adrian via Lightning AI‬‏ <‪[email protected]‬‏>:‬

awaelchli · February 5, 2024, 6:35pm

Well yes, that’s a basic requirement. If you want to run on 2 GPUs in parallel, then that means the code is executed “twice”, in parallel.

tf-koichi · February 7, 2024, 6:39am

The problem I’m facing now is somewhat related to this topic. I run a training with multiple GPUs with ddp or deepspeed strategy. Once the training is finished, I want to load the best model from saved checkpoints with checkpoint_callback.best_model_path but this gets executed multiple times and only one of them is pointing an existing checkpoint and the others end up with FileNotFound exception, which cause the whole processes being terminated and consequently I can’t run a test on the best model. How can I handle this situation? Maybe I should seperate the test part as an independent program.

awaelchli · February 7, 2024, 9:38pm

This shouldn’t be the case. Normally, all processes will track which path is saved. I’m surprised about your observation. Maybe you included a custom name in the checkpoint folder, like, with a time stamp or something else that is different across the processes?

Maybe I should seperate the test part as an independent program.

Independent of the above problem, this is what I would consider the best practice yes. It’s also more convenient, because if you want to only test, you don’t have to “skip” fitting, if you know what I mean.

tf-koichi · February 8, 2024, 12:30am

My guess is that those missing checkpoints have been deleted according to the top-k rule.
Anyway I’m in the process of reforming the code, separating the test part from the training part.

Topic		Replies	Views
Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! DDP/GPU	0	803	February 6, 2024
Multi-gpu training is much lower than single gpu (due to additional processes?) DDP/GPU	0	299	May 8, 2024
DistributedDataParallel multi GPU barely faster than single GPU DDP/GPU	2	1662	March 10, 2023
DDP MultiGPU Training does not reduce training time DDP/GPU	3	1728	November 8, 2023
Runing ddp accross two machines DDP/GPU	3	1401	March 3, 2023

Multiple GPU runs the scipt twice

Related topics