Training on unreliable mixed GPUs across the internet (Basic)¶

Collaborative Training tries to solve the need for top-tier multi-GPU servers by allowing you to train across unreliable machines, such as local machines or even preemptible cloud compute across the internet.

Under the hood, we use Hivemind which provides de-centralized training across the internet.

To use Collaborative Training, you need to first install Hivemind.

pip install hivemind

The HivemindStrategy accumulates gradients from all processes that are collaborating until they reach a target_batch_size. By default, we use the batch size of the first batch to determine what each local machine batch contributes towards the target_batch_size. Once the target_batch_size is reached, an optimizer step is made on all processes.

Warning

When using HivemindStrategy note that you cannot use gradient accumulation (accumulate_grad_batches). This is because Hivemind manages accumulation internally.

import pytorch_lightning as pl
from pytorch_lightning.strategies import HivemindStrategy

trainer = pl.Trainer(strategy=HivemindStrategy(target_batch_size=8192), accelerator="gpu", devices=1)

python train.py
# Other machines can connect running the same command:
# INITIAL_PEERS=... python train.py
# or passing the peers to the strategy:"
# HivemindStrategy(initial_peers=...)"

A helper message is printed once your training begins, which shows you how to start training on other machines using the same code.