How do i continue training a deepspeed strategy in different decice

jackhu3301 · November 7, 2023, 7:45am

I have trained a model in 2 node * 2gpus in deepspeed, and saved the checkpoint. But we also want to continue training in 4 node * 4 gpus, what should i do ?

I try this link : Automatic adjustment of ZeRO's optimizer state partitioning with a new world size is not currently supported. · Issue #3810 · microsoft/DeepSpeed · GitHub
zero_to_fp32.py extracts fp32 consolidated weights from a zero 2 and 3 DeepSpeed checkpoints.

Some epoch and global setp saved in checkpoint. What should do next? Thank you very much.

Topic		Replies	Views
Resume training by loading only the optimizer states in deepspeed enabled training	1	544	April 18, 2024
Lack of documentation on deepspeed / fsdp DDP/GPU	0	745	April 24, 2023
Resume training / load module from DeepSpeed checkpoint Trainer	14	4399	May 6, 2023
DeepSpeed: how to execute certain code once? implementation help	0	366	March 22, 2023
Any example to launch multiple nodes distributed training with deepspeed strategy? implementation help	3	2120	June 20, 2024

How do i continue training a deepspeed strategy in different decice

Related topics