Hi,
I’m running a DDP training loop inside a remote function. I use Pyro4 to communicate between the calling object and remote function. I use ddp_spwan
as suggested. My expectation is to be able to send the training metrics via the callback function. The Pyro4 code is stable and works with CPUs, single GPU and if a dummy result is sent.
However when I try to log trainer metrics with self.log
or self.log_dict
while training with multiple GPUs, the callback fails to trigger correctly. The code trains fine without logging statements.
Is there a workaround for this? I need to be able to send the logging metrics. I also tried using the wandb
and tensorboard
libraries (not the in-built Loggers) and that throws an error during pickling.
Any suggestions would help
Thank you!