HivemindStrategy¶
- class pytorch_lightning.strategies.HivemindStrategy(target_batch_size, run_id='lightning_run', batch_size=None, delay_state_averaging=False, delay_optimizer_step=None, delay_grad_averaging=False, offload_optimizer=None, reuse_grad_buffers=False, scheduler_fn=None, matchmaking_time=5.0, averaging_timeout=30.0, verbose=False, averager_opts=None, host_maddrs=None, initial_peers=None, **optimizer_kwargs)[source]¶
Bases:
pytorch_lightning.strategies.strategy.Strategy
Provides capabilities to train using the Hivemind Library, training collaboratively across the internet with unreliable machines. For more information, refer to the docs.
Warning
HivemindStrategy
is experimental and subject to change.- Parameters
target_batch_size¶ (
int
) – When training, the batch size to accumulate to before running a step. The larger this batch size, the more work can be done asynchronously without communication.run_id¶ (
str
) – A unique identifier of this training run, used as a common prefix for all DHT keys. Seehttps://learning-at-home.readthedocs.io/en/latest/user/dht.html
.batch_size¶ (
Optional
[int
]) – The local batch size per process. If not provided, we infer this from the first batch of data passed in at training (lazy). Note that this should not change throughout training.delay_state_averaging¶ (
bool
) – If enabled (default), average parameters and extra tensors in a background thread; if set to False, average parameters synchronously within the correspondinghivemind.Optimizer.step()
call.delay_optimizer_step¶ (
Optional
[bool
]) – Run optimizer in background, apply results in future .step. requiresoffload_optimizer
.delay_grad_averaging¶ (
bool
) – Average gradients in background; requiresoffload_optimizer
anddelay_optimizer_step
.offload_optimizer¶ (
Optional
[bool
]) – Offload the optimizer to host memory, saving GPU memory for parameters and gradients.reuse_grad_buffers¶ (
bool
) – Use the model’s gradient buffers (params.grad) for gradient accumulation which is more memory efficient. Lightning will automatically disablezero_grad
in theLightningModule
.scheduler_fn¶ (
Optional
[Callable
]) – callable(optimizer) -> PyTorch LRScheduler or a pre-initialized PyTorch scheduler. When using offload_optimizer/delay_optimizer_step/delay_state_averagingscheduler_fn
is required to be passed to theHivemindStrategy
. This is because the optimizer is re-created and the scheduler needs to be re-created as well.matchmaking_time¶ (
float
) – When looking for group, wait for peers to join for up to this many seconds. Increase if you see “averaged gradients with N peers” where N is below 0.9x on >=25% of epochs. Training with low-latency network, decreasing matchmaking_time allows training with smaller batch sizes.averaging_timeout¶ (
float
) – If an averaging step hangs for this long, it will be cancelled automatically. Increase averaging_timeout if you see “Proceeding with local gradients” at least 25% of the time. Do not set this timeout too high, as it may cause your optimizer to hang after some types of network errors.verbose¶ (
bool
) – Report internal Hivemind events such as accumulating gradients and running background tasks.averager_opts¶ (
Optional
[Dict
]) – Additional keyword arguments forwarded to bothGradientAverager
andTrainingStateAverager
.host_maddrs¶ (
Optional
[List
]) – List of multi-addrs to create visible peers for other processes. https://learning-at-home.readthedocs.io/en/latest/user/dht.html#running-across-the-internetinitial_peers¶ (
Union
[str
,List
,None
]) – If connecting to a running process, a list of initial peers needs to be passed in. This can also be set via the env variableINITIAL_PEERS
.**optimizer_kwargs¶ – kwargs are passed to the
hivemind.Optimizer
class.
- barrier(*args, **kwargs)[source]¶
Synchronizes all processes which blocks processes until the whole group enters this function.
- on_train_batch_start(batch, batch_idx, dataloader_idx=0)[source]¶
Called in the training loop before anything happens for that batch.
- Return type
- teardown()[source]¶
This method is called to teardown the training process.
It is the right place to release memory and free other resources.
- Return type
- property is_global_zero: bool¶
Whether the current process is the rank zero process not only on the local node, but for all nodes.
- Return type
- property root_device: torch.device¶
Returns the root device.
- Return type