How to run Trainer.fit() and Trainer.test() in DDP distributed mode

awaelchli · September 20, 2020, 7:35am

You cannot run trainer.test after trainer.fit (or multiple trainer.fit/test in general) in ddp mode.
This only works with ddp_spawn . You need to either

remove the trainer.test call
move the trainer.test call to a new test script
choose ddp_spawn (but has it’s own limitations)

This is simply a limitation of multiprocessing and a tradeoff between ddp and ddp_spawn.
More information in this section towards the bottom
https://pytorch-lightning.readthedocs.io/en/latest/multi_gpu.html#distributed-data-parallel