Hi, everyone.
The current PL docs shows only examples to launch training with deepspeed strategy on a single node. Is there any example to launch multiple nodes deepspeed training please?
Hi
Launching experiments on multi-node isn’t different when using deepspeed vs. another strategy. You should fist check what kind of cluster you are on (SLURM, self-managed, etc.) and then choose the appropriate guide from here.
If you don’t want to set up a cluster, you can also try running in the cloud. There is a multi-node example here on this page Lightning AI with deepspeed as well.
Hi Awaelchli. Are there any training programs offered by your company? The guide regarding multi-node is not clear enough to me who lacks such engineering experience. We would like to train a large language model on our own cluster because the dataset is sensitive.
Hi,Have you found the way to train on multi nodes? I’m facing the same problem now. Hope you can offer help. Very appreciate.