Hi,
I’m using a Lightning script as part of a hyperparameter optimisation script. I have tried using the DDPStrategy ='"ddp"
, but face DDP synchronisation errors:
DDP expects same model across all ranks,....
Given that the training script is not callable from command line, I use startegy ddp_spawn
. The model trains and I am able to save the model state via checkpoints. The logged_metrics
property of the Trainer is empty. I assume this is due to the child process running the training while the main process waits. I can see values being logged while training.
Is it possible to obtain the logged metrics from the Trainer object after the training is done? Or do I have to save it in a file or as part of the Checkpoint then load it after the fit()
?
Thank you