Resume training / load module from DeepSpeed checkpoint

I have a DeepSpeed checkpoint where one part is on the machine with rank 0 and the other part is on the machine with rank 1, both are stored in the same folder “best.ckpt” but not on the same machine.

How can I set the ModelCheckpoint or the DeepSpeedStrategy in order to save all checkpoints in one machine or how can I resume the training from this DeepSpeed checkpoint passing the folder path?

At the moment I have all of my progress lost as I’m not able to find a solution to this problem

@santurini This is normal.
Lightning can’t transfer the checkpoints between machines. I recommend that you put the path for the checkpoint saving on the shared filesystem (normally your homefolder is shared between the machines on the cluster). This will solve your problem of not having all checkpoints together. If you can’t set up a shared filesystem between the machines, you could use rsync to gather them.

rsync -a path/to/checkpoints user@machine2:path/to/checkpoints 

Regarding loading: Even if the checkpoints are split between machines, you should still be able to load them by passing in the location to the trainer’s fit

trainer.fit(model, ckpt_path="path/to/checkpoint")

Note that for deepspeed, the checkpoint is a directory. So just pass the path for that (not the individual files).

I hope this answer steers you in the right direction.

That’s exactly what I did, I use rsync to simulate a shared filesystem and then pass the path to checkpoint.

I would also suggest if possible to modify the deepspeed utils to collate the folder to a single checkpoint, as I’ve seen that it only allows to save a unified file but it would also be nice to just return the state dict if someone only desires to load it on fly!

Thank you very much @awaelchli

Yes I agree. I think what you mean is this utility from DeepSpeed:
https://deepspeed.readthedocs.io/en/latest/model-checkpointing.html#deepspeed.utils.zero_to_fp32.convert_zero_checkpoint_to_fp32_state_dict

We could provide something similar for Lightning that is easy to use.

It is already available here: lightning.pytorch.utilities.deepspeed — PyTorch Lightning 2.0.1 documentation

I will do a pull request to add the return statement in the function is it ok?

1 Like

I don’t see anything immediately wrong with this, so yes please give it a shot. Please open a feature request issue first with the motivation for this and your suggested change. You can reference this post here as well.

I opened the issue here: Returning unified state dict from DeepSpeed checkpoint folder · Issue #17341 · Lightning-AI/lightning · GitHub

Thank you @awaelchli

1 Like

Hi @santurini
Can you tell which deepspeed strategy are you using here? I am trying to understand how lightning + deepspeed works when saving and loading models.
Thank you

If I am not mistaken, we cannot resume training with deepspeed, after using this utility.

In fact, I am struggling to find a way to resume training from a deepspeed checkpoint.

I was using the DeepSpeedStrategy class with stage 3 and cpu offload on 2 gpu located on different machines. The problem for me was that each machine had its own checkpoints and in order to restore the training I had to use rsync in order to have all the checkpoints on both machines.
In order to restore training this is the only thing you have to do if you are in my case, while on single gpu as stated in the docs you just have to pass the path to the folder with sharded ckpt (with deep speed “last.ckpt” is actually a folder that contains the folder “checkpoint”) and lightning will do the job.
Instead if you want to just load the weights you should use the deepspeed utils in lightning to convert a shader checkpoint to an fp32 (a single .ckpt file in simple words).

Hope I’ve helped you!

You are right, to restore directly from the trainer you have to pass the shared checkpoint.
If you use the utility and deleted the folder there is nothing you can do to revert back

@Haris_Jabbar @santurini
There is also a DeepSpeedStrategy(load_full_weights=True/False) argument to load from a single consolidated checkpoint file. Maybe that’s what you were looking for?

Sorry @awaelchli, is out of context, but is some days I am not able to access the forum via browser (Safari) from Italy.
When I try to access via GitHub it starts reloading the page over and over.
You know why this may happen?

Hey @santurini
Yes, I saw this happen to me too recently. I wasn’t sure if this was only for me. I think what solved it for me was deleting the cookies for lightning.ai. First, log out of the account. In safari go to settings → privacy → manage website data then search for lightning.ai and click remove all. Then log in again.

Hope this helps.

1 Like

It worked, you legend! :purple_heart:

1 Like