I am looking at ways to resume training using pl.Trainer
where I only want to load the optimizer states and not the model weights. Trainer.fit(...ckpt_path=...)
or the checkpoint path passed in Trainer() both need the full model state to continue resume. I am using deepspeed stage 2 for training the model, so the optimizer states are sharded and the saved states are also split.
I tried using on_train_start()
hook and the trainer.model.load_checkpoint to load the optimizer states by providing a checkpoint path with file containing only the optimizer states and removing the files with model states, but it didn’t work.
What could be a way to achieve this with deepspeed sharded files in lightning?