Resume training by loading only the optimizer states in deepspeed enabled training

sarkarsoumajyoti · October 20, 2023, 5:35pm

I am looking at ways to resume training using pl.Trainer where I only want to load the optimizer states and not the model weights. Trainer.fit(...ckpt_path=...) or the checkpoint path passed in Trainer() both need the full model state to continue resume. I am using deepspeed stage 2 for training the model, so the optimizer states are sharded and the saved states are also split.

I tried using on_train_start() hook and the trainer.model.load_checkpoint to load the optimizer states by providing a checkpoint path with file containing only the optimizer states and removing the files with model states, but it didn’t work.

What could be a way to achieve this with deepspeed sharded files in lightning?

zhoubay · April 18, 2024, 2:45am

I also want to know how to solve this, do you have any solutions?