- class pytorch_lightning.strategies.HivemindStrategy(target_batch_size, run_id='lightning_run', batch_size=None, delay_state_averaging=False, delay_optimizer_step=None, delay_grad_averaging=False, offload_optimizer=None, reuse_grad_buffers=False, scheduler_fn=None, matchmaking_time=5.0, averaging_timeout=30.0, verbose=False, averager_opts=None, host_maddrs=None, initial_peers=None, **optimizer_kwargs)¶
Provides capabilities to train using the Hivemind Library, training collaboratively across the internet with unreliable machines. For more information, refer to the docs.
HivemindStrategyis experimental and subject to change.
int]) – The local batch size per process. If not provided, we infer this from the first batch of data passed in at training (lazy). Note that this should not change throughout training.
bool) – If enabled (default), average parameters and extra tensors in a background thread; if set to False, average parameters synchronously within the corresponding
bool) – Use the model’s gradient buffers (params.grad) for gradient accumulation which is more memory efficient. Lightning will automatically disable
Callable]) – callable(optimizer) -> PyTorch LRScheduler or a pre-initialized PyTorch scheduler. When using offload_optimizer/delay_optimizer_step/delay_state_averaging
scheduler_fnis required to be passed to the
HivemindStrategy. This is because the optimizer is re-created and the scheduler needs to be re-created as well.
float) – When looking for group, wait for peers to join for up to this many seconds. Increase if you see “averaged gradients with N peers” where N is below 0.9x on >=25% of epochs. Training with low-latency network, decreasing matchmaking_time allows training with smaller batch sizes.
float) – If an averaging step hangs for this long, it will be cancelled automatically. Increase averaging_timeout if you see “Proceeding with local gradients” at least 25% of the time. Do not set this timeout too high, as it may cause your optimizer to hang after some types of network errors.
- all_gather(tensor, group=None, sync_grads=False)¶
Perform an all_gather on all processes.
- barrier(*args, **kwargs)¶
Synchronizes all processes which blocks processes until the whole group enters this function.
- broadcast(obj, src=0)¶
Broadcasts an object to all processes.
- on_train_batch_start(batch, batch_idx, dataloader_idx=0)¶
Called in the training loop before anything happens for that batch.
- Return type:
- reduce(tensor, *args, **kwargs)¶
Reduces the given tensor (e.g. across GPUs/processes).
Setup plugins for the trainer fit and creates optimizers.
This method is called to teardown the training process.
It is the right place to release memory and free other resources.
- Return type:
- property is_global_zero: bool¶
Whether the current process is the rank zero process not only on the local node, but for all nodes.