You cannot run trainer.test
after trainer.fit
(or multiple trainer.fit/test
in general) in ddp mode.
This only works with ddp_spawn
. You need to either
- remove the trainer.test call
- move the trainer.test call to a new test script
- choose ddp_spawn (but has it’s own limitations)
This is simply a limitation of multiprocessing and a tradeoff between ddp and ddp_spawn.
More information in this section towards the bottom
https://pytorch-lightning.readthedocs.io/en/latest/multi_gpu.html#distributed-data-parallel