My codes are like this:
from pytorch_lightning.callbacks import ModelCheckpoint
model_ckp = ModelCheckpoint(
save_top_k=1,
monitor=‘loss’,
mode=‘min’,
save_on_train_epoch_end=False,
dirpath=MODEL_DIR_PATH,
filename=MODEL_SAVE_NAME
)
pl_trainer = pl.Trainer(
accelerator='gpu',
precision=16,
devices=GPU_USAGE_NUM,
max_epochs=total_epoch,
num_sanity_val_steps = 0,
strategy = "deepspeed_stage_2_offload",
accumulate_grad_batches = 1,
callbacks = [
model_ckp,
early_stop
]
)
Then it saves the .ckpt as a directory.In this case,I can not load from the checkpoint.
Pretrained_Albert_MLP_pl.load_from_checkpoint(“/data/xc_data/green/nlp/model_save/Pretrained_AlbertMLP_Multicf_pl_20221127-v1.ckpt”)
I do not know where I did wrong.
Thanks a lot. 
DeepSpeed saves a directory because in stage 2, optimizer states are sharded across processes/nodes. The same happens to parameters in stage 3. In order to avoid memory peaks and OOM when gathering the states, each process saves a shard of the checkpoint. This means you end up with as many files as there are processes (in a directory with the checkpoint file).
LightningModule.load_from_checkpoint
does NOT support loading sharded checkpoints. You can convert the checkpoint directory to a regular file using the deepspeed utility here: Train 1 trillion+ parameter models — PyTorch Lightning 1.8.3.post1 documentation
Hi, I had faced this issue recently, with a slightly different preposition. I am going to validate the data after training, so I did this.
ckpt_path = trainer.checkpoint_callback.best_model_path
trainer.validate(model=model, datamodule=datamodule, ckpt_path=ckpt_path)
I will like to ask if loading the model checkpoint into the Trainer also ensure that the Deepspeed stage 2/3 parameters get updated?
What do you mean by “Deepspeed stage 2/3 parameters”? The checkpoint contains the model weights and optimizer states, and they get loaded once fit/validate/test starts after process creation. That’s the basic expectation. Maybe there are other special parameters that you mean? The configuration for deepspeed needs to be managed by the user.