Why my pl save checkpoint into a directory

My codes are like this:

from pytorch_lightning.callbacks import ModelCheckpoint
model_ckp = ModelCheckpoint(

pl_trainer = pl.Trainer(
    num_sanity_val_steps = 0,
    strategy = "deepspeed_stage_2_offload",
    accumulate_grad_batches = 1,
    callbacks = [


Then it saves the .ckpt as a directory.In this case,I can not load from the checkpoint.


I do not know where I did wrong.

Thanks a lot. :rofl:

DeepSpeed saves a directory because in stage 2, optimizer states are sharded across processes/nodes. The same happens to parameters in stage 3. In order to avoid memory peaks and OOM when gathering the states, each process saves a shard of the checkpoint. This means you end up with as many files as there are processes (in a directory with the checkpoint file).

LightningModule.load_from_checkpoint does NOT support loading sharded checkpoints. You can convert the checkpoint directory to a regular file using the deepspeed utility here: Train 1 trillion+ parameter models — PyTorch Lightning 1.8.3.post1 documentation

Hi, I had faced this issue recently, with a slightly different preposition. I am going to validate the data after training, so I did this.

ckpt_path = trainer.checkpoint_callback.best_model_path
trainer.validate(model=model, datamodule=datamodule, ckpt_path=ckpt_path)

I will like to ask if loading the model checkpoint into the Trainer also ensure that the Deepspeed stage 2/3 parameters get updated?

What do you mean by “Deepspeed stage 2/3 parameters”? The checkpoint contains the model weights and optimizer states, and they get loaded once fit/validate/test starts after process creation. That’s the basic expectation. Maybe there are other special parameters that you mean? The configuration for deepspeed needs to be managed by the user.