class pytorch_lightning.strategies.HivemindStrategy(target_batch_size, run_id='lightning_run', batch_size=None, delay_state_averaging=False, delay_optimizer_step=None, delay_grad_averaging=False, offload_optimizer=None, reuse_grad_buffers=False, scheduler_fn=None, matchmaking_time=5.0, averaging_timeout=30.0, verbose=False, averager_opts=None, host_maddrs=None, initial_peers=None, **optimizer_kwargs)[source]

Bases: pytorch_lightning.strategies.strategy.Strategy

Provides capabilities to train using the Hivemind Library, training collaboratively across the internet with unreliable machines. For more information, refer to the docs.


HivemindStrategy is experimental and subject to change.

  • target_batch_size (int) – When training, the batch size to accumulate to before running a step. The larger this batch size, the more work can be done asynchronously without communication.

  • run_id (str) – A unique identifier of this training run, used as a common prefix for all DHT keys. See

  • batch_size (Optional[int]) – The local batch size per process. If not provided, we infer this from the first batch of data passed in at training (lazy). Note that this should not change throughout training.

  • delay_state_averaging (bool) – If enabled (default), average parameters and extra tensors in a background thread; if set to False, average parameters synchronously within the corresponding hivemind.Optimizer.step() call.

  • delay_optimizer_step (Optional[bool]) – Run optimizer in background, apply results in future .step. requires offload_optimizer.

  • delay_grad_averaging (bool) – Average gradients in background; requires offload_optimizer and delay_optimizer_step.

  • offload_optimizer (Optional[bool]) – Offload the optimizer to host memory, saving GPU memory for parameters and gradients.

  • reuse_grad_buffers (bool) – Use the model’s gradient buffers (params.grad) for gradient accumulation which is more memory efficient. Lightning will automatically disable zero_grad in the LightningModule.

  • scheduler_fn (Optional[Callable]) – callable(optimizer) -> PyTorch LRScheduler or a pre-initialized PyTorch scheduler. When using offload_optimizer/delay_optimizer_step/delay_state_averaging scheduler_fn is required to be passed to the HivemindStrategy. This is because the optimizer is re-created and the scheduler needs to be re-created as well.

  • matchmaking_time (float) – When looking for group, wait for peers to join for up to this many seconds. Increase if you see “averaged gradients with N peers” where N is below 0.9x on >=25% of epochs. Training with low-latency network, decreasing matchmaking_time allows training with smaller batch sizes.

  • averaging_timeout (float) – If an averaging step hangs for this long, it will be cancelled automatically. Increase averaging_timeout if you see “Proceeding with local gradients” at least 25% of the time. Do not set this timeout too high, as it may cause your optimizer to hang after some types of network errors.

  • verbose (bool) – Report internal Hivemind events such as accumulating gradients and running background tasks.

  • averager_opts (Optional[Dict]) – Additional keyword arguments forwarded to both GradientAverager and TrainingStateAverager.

  • host_maddrs (Optional[List]) – List of multi-addrs to create visible peers for other processes.

  • initial_peers (Union[str, List, None]) – If connecting to a running process, a list of initial peers needs to be passed in. This can also be set via the env variable INITIAL_PEERS.

  • **optimizer_kwargs (Any) – kwargs are passed to the hivemind.Optimizer class.

all_gather(tensor, group=None, sync_grads=False)[source]

Perform an all_gather on all processes.

  • tensor (Tensor) – the tensor to all_gather

  • group (Optional[Any]) – the process group to gather results from

  • sync_grads (bool) – flag that allows users to synchronize gradients for all_gather op

Return type:


barrier(*args, **kwargs)[source]

Synchronizes all processes which blocks processes until the whole group enters this function.


name – an optional name to pass into barrier.

Return type:


broadcast(obj, src=0)[source]

Broadcasts an object to all processes.

  • obj (TypeVar(TBroadcast)) – the object to broadcast

  • src (int) – source rank

Return type:



Moves the model to the correct device.

Return type:


on_train_batch_start(batch, batch_idx, dataloader_idx=0)[source]

Called in the training loop before anything happens for that batch.

Return type:


reduce(tensor, *args, **kwargs)[source]

Reduces the given tensor (e.g. across GPUs/processes).

  • tensor (Union[Any, Tensor]) – the tensor to sync and reduce

  • group – the process group to reduce

  • reduce_op – the reduction operation. Defaults to ‘mean’. Can also be a string ‘sum’ or ReduceOp.

Return type:

Union[Any, Tensor]


Setup plugins for the trainer fit and creates optimizers.


trainer (Trainer) – the trainer instance

Return type:



This method is called to teardown the training process.

It is the right place to release memory and free other resources.

Return type:


property is_global_zero: bool

Whether the current process is the rank zero process not only on the local node, but for all nodes.

property root_device: torch.device

Returns the root device.