Hi,
For multi-gpu training with ddp
, trainer.fit()
would trigger multiple processes each of which runs the script from scratch. However, this causes deadlocks if I create multiple trainers and calls their fit() sequentially. What’s the proper way to use multiple trainers with ddp
in one script?
I ask this question because our MultiModalPredictor
(AutoGluon Multimodal - Quick Start - AutoGluon 0.8.2 documentation) uses lightning as the backend. Behind each API call of MultiModalPredictor
like fit()
, predict()
, and predict_proba()
, we create one lightning trainer and call the trainer’s fit()
or predict()
API. Users generally make multiple MultiModalPredictor
calls in one script, which leads to my question here. Thanks!