Resume from checkpoint with elastic training

ananthsub · September 13, 2020, 12:47am

This doesn’t work for us. On the first instance of training, there is no checkpoint to resume from. However, if training fails for some reason, then torchelastic will transparently restart the training job, and at that point we want to pick up the latest checkpoint and resume training.

We could hardcode this. One approach would be to determine the checkpoint directory ahead of time. If it exists, then we would try to find the latest checkpoint inside the directory to resume from, likely based on the naming or file creation time. But this feels a bit clunky. I’m open to suggestions for how we can write the training code to be as independent from restarting from failures as possible.