Any example to launch multiple nodes distributed training with deepspeed strategy?

ssyang1999 · March 8, 2023, 12:34pm

Hi, everyone.
The current PL docs shows only examples to launch training with deepspeed strategy on a single node. Is there any example to launch multiple nodes deepspeed training please?

awaelchli · March 9, 2023, 12:10am

Hi
Launching experiments on multi-node isn’t different when using deepspeed vs. another strategy. You should fist check what kind of cluster you are on (SLURM, self-managed, etc.) and then choose the appropriate guide from here.

If you don’t want to set up a cluster, you can also try running in the cloud. There is a multi-node example here on this page Lightning AI with deepspeed as well.

richardsunvoyager · June 28, 2023, 2:55am

Hi Awaelchli. Are there any training programs offered by your company? The guide regarding multi-node is not clear enough to me who lacks such engineering experience. We would like to train a large language model on our own cluster because the dataset is sensitive.

kuiwang · June 20, 2024, 9:17am

Hi，Have you found the way to train on multi nodes? I’m facing the same problem now. Hope you can offer help. Very appreciate.

Topic		Replies	Views
DeepSpeed: how to execute certain code once? implementation help	0	371	March 22, 2023
How do i continue training a deepspeed strategy in different decice Trainer	0	778	November 7, 2023
Distributed training using ddp, how to add node	2	849	October 29, 2020
Does lightning supports multi-node settings? DDP/GPU	0	274	August 26, 2023
Lack of documentation on deepspeed / fsdp DDP/GPU	0	752	April 24, 2023

Any example to launch multiple nodes distributed training with deepspeed strategy?

Related topics