How to properly use multiple trainers with ddp in one script?

zhiqiangdon · September 10, 2023, 5:14am

Hi,

For multi-gpu training with ddp, trainer.fit() would trigger multiple processes each of which runs the script from scratch. However, this causes deadlocks if I create multiple trainers and calls their fit() sequentially. What’s the proper way to use multiple trainers with ddp in one script?

zhiqiangdon · September 10, 2023, 5:20am

I ask this question because our MultiModalPredictor (AutoGluon Multimodal - Quick Start - AutoGluon 0.8.2 documentation) uses lightning as the backend. Behind each API call of MultiModalPredictor like fit(), predict(), and predict_proba(), we create one lightning trainer and call the trainer’s fit() or predict() API. Users generally make multiple MultiModalPredictor calls in one script, which leads to my question here. Thanks!

Topic		Replies	Views
How to run Trainer.fit() and Trainer.test() in DDP distributed mode DDP/GPU	6	6394	November 11, 2020
How can I train a model using DDP on two GPUs, but only test on one GPU? DDP/GPU	4	1858	August 17, 2023
Use DDP to train a single model, on a single GPU, multiple processes	0	142	May 15, 2024
DDP MultiGPU Training does not reduce training time DDP/GPU	3	1610	November 8, 2023
Multiple GPU runs the scipt twice DDP/GPU	10	348	February 8, 2024

How to properly use multiple trainers with ddp in one script?

Related topics