Why my pl save checkpoint into a directory

My codes are like this:

from pytorch_lightning.callbacks import ModelCheckpoint
model_ckp = ModelCheckpoint(
save_top_k=1,
monitor=‘loss’,
mode=‘min’,
save_on_train_epoch_end=False,
dirpath=MODEL_DIR_PATH,
filename=MODEL_SAVE_NAME
)

pl_trainer = pl.Trainer(
    accelerator='gpu',
    precision=16,
    devices=GPU_USAGE_NUM,
    max_epochs=total_epoch,
    num_sanity_val_steps = 0,
    strategy = "deepspeed_stage_2_offload",
    accumulate_grad_batches = 1,
    callbacks = [
        model_ckp,
        early_stop
    ]

)

Then it saves the .ckpt as a directory.In this case,I can not load from the checkpoint.

Pretrained_Albert_MLP_pl.load_from_checkpoint(“/data/xc_data/green/nlp/model_save/Pretrained_AlbertMLP_Multicf_pl_20221127-v1.ckpt”)

I do not know where I did wrong.

Thanks a lot. :rofl:

DeepSpeed saves a directory because in stage 2, optimizer states are sharded across processes/nodes. The same happens to parameters in stage 3. In order to avoid memory peaks and OOM when gathering the states, each process saves a shard of the checkpoint. This means you end up with as many files as there are processes (in a directory with the checkpoint file).

LightningModule.load_from_checkpoint does NOT support loading sharded checkpoints. You can convert the checkpoint directory to a regular file using the deepspeed utility here: Train 1 trillion+ parameter models — PyTorch Lightning 1.8.3.post1 documentation

Hi, I had faced this issue recently, with a slightly different preposition. I am going to validate the data after training, so I did this.

ckpt_path = trainer.checkpoint_callback.best_model_path
trainer.validate(model=model, datamodule=datamodule, ckpt_path=ckpt_path)

I will like to ask if loading the model checkpoint into the Trainer also ensure that the Deepspeed stage 2/3 parameters get updated?

What do you mean by “Deepspeed stage 2/3 parameters”? The checkpoint contains the model weights and optimizer states, and they get loaded once fit/validate/test starts after process creation. That’s the basic expectation. Maybe there are other special parameters that you mean? The configuration for deepspeed needs to be managed by the user.