I have a run on a remote cluster that has model checkpoints automatically uploaded to wandb as artifacts and I’d like to be able to take advantage of this feature to resume runs locally by downloading the latest checkpoint to my local machine in order to re-load the plmodule when the ckpt_path=‘last’ is not able to find the checkpoint locally.
I tried naively just resuming the run with the trainer’s ckpt_path set to last
, and wandblogger’s id
set to the same as the run I’d like to resume, as well as resume
set to 'must'
. However, it seems if the directory does not exist, the run will actually resume on wandb side, but without actually loading the checkpoint and in fact, just starts training again from scratch.
It would have been nice if lightning recognized that I wanted to strictly enforce loading the plmodule from a checkpoint rather than starting from scratch and thrown an error instead. Also, perhaps a feature request if this isn’t already possible, but I would love if lightning would use wandb’s artifact downloader to automatically download the checkpoint for me instead of looking to the local filepath by passing in the wandb’s run path into the ckpt_path
argument. I’m guessing I would have to implement the above feature myself, but if this is already implemented somewhere, I would love to know!