The current PL docs shows only examples to launch training with deepspeed strategy on a single node. Is there any example to launch multiple nodes deepspeed training please?
Launching experiments on multi-node isn’t different when using deepspeed vs. another strategy. You should fist check what kind of cluster you are on (SLURM, self-managed, etc.) and then choose the appropriate guide from here.
If you don’t want to set up a cluster, you can also try running in the cloud. There is a multi-node example here on this page Lightning AI with deepspeed as well.
Hi Awaelchli. Are there any training programs offered by your company? The guide regarding multi-node is not clear enough to me who lacks such engineering experience. We would like to train a large language model on our own cluster because the dataset is sensitive.