Why my pl save checkpoint into a directory

MissyLee2018 · November 27, 2022, 5:31am

My codes are like this:

from pytorch_lightning.callbacks import ModelCheckpoint
model_ckp = ModelCheckpoint(
save_top_k=1,
monitor=‘loss’,
mode=‘min’,
save_on_train_epoch_end=False,
dirpath=MODEL_DIR_PATH,
filename=MODEL_SAVE_NAME
)

pl_trainer = pl.Trainer(
    accelerator='gpu',
    precision=16,
    devices=GPU_USAGE_NUM,
    max_epochs=total_epoch,
    num_sanity_val_steps = 0,
    strategy = "deepspeed_stage_2_offload",
    accumulate_grad_batches = 1,
    callbacks = [
        model_ckp,
        early_stop
    ]

)

Then it saves the .ckpt as a directory.In this case,I can not load from the checkpoint.

Pretrained_Albert_MLP_pl.load_from_checkpoint(“/data/xc_data/green/nlp/model_save/Pretrained_AlbertMLP_Multicf_pl_20221127-v1.ckpt”)

I do not know where I did wrong.

Thanks a lot.

awaelchli · November 27, 2022, 9:05pm

DeepSpeed saves a directory because in stage 2, optimizer states are sharded across processes/nodes. The same happens to parameters in stage 3. In order to avoid memory peaks and OOM when gathering the states, each process saves a shard of the checkpoint. This means you end up with as many files as there are processes (in a directory with the checkpoint file).

LightningModule.load_from_checkpoint does NOT support loading sharded checkpoints. You can convert the checkpoint directory to a regular file using the deepspeed utility here: Train 1 trillion+ parameter models — PyTorch Lightning 1.8.3.post1 documentation

Lee_Jiahe · December 9, 2022, 1:01pm

Hi, I had faced this issue recently, with a slightly different preposition. I am going to validate the data after training, so I did this.

ckpt_path = trainer.checkpoint_callback.best_model_path
trainer.validate(model=model, datamodule=datamodule, ckpt_path=ckpt_path)

I will like to ask if loading the model checkpoint into the Trainer also ensure that the Deepspeed stage 2/3 parameters get updated?

awaelchli · December 10, 2022, 4:09pm

What do you mean by “Deepspeed stage 2/3 parameters”? The checkpoint contains the model weights and optimizer states, and they get loaded once fit/validate/test starts after process creation. That’s the basic expectation. Maybe there are other special parameters that you mean? The configuration for deepspeed needs to be managed by the user.

Topic		Replies	Views
Saving checkpoints inside version folder callbacks	2	639	July 3, 2023
How to get path where checkpoints are saved in a Callback?	4	4632	October 20, 2020
Resume training / load module from DeepSpeed checkpoint Trainer	14	4553	May 6, 2023
Checkpoints are overwritten automatically callbacks	1	1483	February 7, 2022
Checkpoint without saving to a file	2	2570	November 13, 2020

Why my pl save checkpoint into a directory

Related topics