Resume training / load module from DeepSpeed checkpoint

santurini · April 7, 2023, 3:10pm

I have a DeepSpeed checkpoint where one part is on the machine with rank 0 and the other part is on the machine with rank 1, both are stored in the same folder “best.ckpt” but not on the same machine.

How can I set the ModelCheckpoint or the DeepSpeedStrategy in order to save all checkpoints in one machine or how can I resume the training from this DeepSpeed checkpoint passing the folder path?

At the moment I have all of my progress lost as I’m not able to find a solution to this problem

awaelchli · April 11, 2023, 9:58am

@santurini This is normal.
Lightning can’t transfer the checkpoints between machines. I recommend that you put the path for the checkpoint saving on the shared filesystem (normally your homefolder is shared between the machines on the cluster). This will solve your problem of not having all checkpoints together. If you can’t set up a shared filesystem between the machines, you could use rsync to gather them.

rsync -a path/to/checkpoints user@machine2:path/to/checkpoints

Regarding loading: Even if the checkpoints are split between machines, you should still be able to load them by passing in the location to the trainer’s fit

trainer.fit(model, ckpt_path="path/to/checkpoint")

Note that for deepspeed, the checkpoint is a directory. So just pass the path for that (not the individual files).

I hope this answer steers you in the right direction.

santurini · April 11, 2023, 11:38am

That’s exactly what I did, I use rsync to simulate a shared filesystem and then pass the path to checkpoint.

I would also suggest if possible to modify the deepspeed utils to collate the folder to a single checkpoint, as I’ve seen that it only allows to save a unified file but it would also be nice to just return the state dict if someone only desires to load it on fly!

Thank you very much @awaelchli

awaelchli · April 11, 2023, 9:37pm

Yes I agree. I think what you mean is this utility from DeepSpeed:
https://deepspeed.readthedocs.io/en/latest/model-checkpointing.html#deepspeed.utils.zero_to_fp32.convert_zero_checkpoint_to_fp32_state_dict

We could provide something similar for Lightning that is easy to use.

santurini · April 11, 2023, 10:18pm

It is already available here: lightning.pytorch.utilities.deepspeed — PyTorch Lightning 2.0.1 documentation

I will do a pull request to add the return statement in the function is it ok?

awaelchli · April 11, 2023, 10:21pm

I don’t see anything immediately wrong with this, so yes please give it a shot. Please open a feature request issue first with the motivation for this and your suggested change. You can reference this post here as well.

santurini · April 12, 2023, 7:33am

I opened the issue here: Returning unified state dict from DeepSpeed checkpoint folder · Issue #17341 · Lightning-AI/lightning · GitHub

Thank you @awaelchli

kushalj001 · May 2, 2023, 6:26am

Hi @santurini
Can you tell which deepspeed strategy are you using here? I am trying to understand how lightning + deepspeed works when saving and loading models.
Thank you

Haris_Jabbar · May 2, 2023, 4:14pm

If I am not mistaken, we cannot resume training with deepspeed, after using this utility.

In fact, I am struggling to find a way to resume training from a deepspeed checkpoint.

santurini · May 2, 2023, 5:30pm

I was using the DeepSpeedStrategy class with stage 3 and cpu offload on 2 gpu located on different machines. The problem for me was that each machine had its own checkpoints and in order to restore the training I had to use rsync in order to have all the checkpoints on both machines.
In order to restore training this is the only thing you have to do if you are in my case, while on single gpu as stated in the docs you just have to pass the path to the folder with sharded ckpt (with deep speed “last.ckpt” is actually a folder that contains the folder “checkpoint”) and lightning will do the job.
Instead if you want to just load the weights you should use the deepspeed utils in lightning to convert a shader checkpoint to an fp32 (a single .ckpt file in simple words).

Hope I’ve helped you!

santurini · May 2, 2023, 5:31pm

You are right, to restore directly from the trainer you have to pass the shared checkpoint.
If you use the utility and deleted the folder there is nothing you can do to revert back

awaelchli · May 5, 2023, 1:57am

@Haris_Jabbar @santurini
There is also a DeepSpeedStrategy(load_full_weights=True/False) argument to load from a single consolidated checkpoint file. Maybe that’s what you were looking for?

santurini · May 5, 2023, 7:06am

Sorry @awaelchli, is out of context, but is some days I am not able to access the forum via browser (Safari) from Italy.
When I try to access via GitHub it starts reloading the page over and over.
You know why this may happen?

awaelchli · May 6, 2023, 12:33am

Hey @santurini
Yes, I saw this happen to me too recently. I wasn’t sure if this was only for me. I think what solved it for me was deleting the cookies for lightning.ai. First, log out of the account. In safari go to settings → privacy → manage website data then search for lightning.ai and click remove all. Then log in again.

Hope this helps.

santurini · May 6, 2023, 7:44am

It worked, you legend!

Topic		Replies	Views
Resume training by loading only the optimizer states in deepspeed enabled training	1	544	April 18, 2024
Why my pl save checkpoint into a directory callbacks	3	1733	December 10, 2022
How to resume training Trainer	9	43086	July 31, 2023
Checkpointing and Restoring callbacks	0	223	April 7, 2023
Trainer.validate/test with ckpt_path does not resume global_step Trainer	3	344	April 7, 2023

Resume training / load module from DeepSpeed checkpoint

Related topics