Logging metrics when training with "ddp_spawn"

Hi,

I’m using a Lightning script as part of a hyperparameter optimisation script. I have tried using the DDPStrategy ='"ddp", but face DDP synchronisation errors:

DDP expects same model across all ranks,....

Given that the training script is not callable from command line, I use startegy ddp_spawn. The model trains and I am able to save the model state via checkpoints. The logged_metrics property of the Trainer is empty. I assume this is due to the child process running the training while the main process waits. I can see values being logged while training.

Is it possible to obtain the logged metrics from the Trainer object after the training is done? Or do I have to save it in a file or as part of the Checkpoint then load it after the fit()?

Thank you :slight_smile:

The logged_metrics property doesn’t get synced back to the main process, most of the Trainer state doesn’t. But callback_metrics does and it includes all the logged metrics from process 0. So you should be able to access this in the main process.

1 Like