I’m using a Lightning script as part of a hyperparameter optimisation script. I have tried using the
DDPStrategy ='"ddp", but face DDP synchronisation errors:
DDP expects same model across all ranks,....
Given that the training script is not callable from command line, I use startegy
ddp_spawn. The model trains and I am able to save the model state via checkpoints. The
logged_metrics property of the Trainer is empty. I assume this is due to the child process running the training while the main process waits. I can see values being logged while training.
Is it possible to obtain the logged metrics from the Trainer object after the training is done? Or do I have to save it in a file or as part of the Checkpoint then load it after the