Resume from checkpoint with elastic training

I don’t have any experience with TorchElastic, but perhaps you could pass your own ModelCheckpoint callback, with a defined filepath so you can always know where the checkpoint is saved.