Shortcuts

BaguaStrategy

class pytorch_lightning.strategies.BaguaStrategy(algorithm='gradient_allreduce', flatten=True, accelerator=None, parallel_devices=None, cluster_environment=None, checkpoint_io=None, precision_plugin=None, **bagua_kwargs)[source]

Bases: pytorch_lightning.strategies.ddp.DDPStrategy

Strategy for training using the Bagua library, with advanced distributed training algorithms and system optimizations.

This strategy requires the bagua package to be installed. See installation guide for more information.

The BaguaStrategy is only supported on GPU and on Linux systems.

Parameters:
  • algorithm (str) – Distributed algorithm used to do the actual communication and update. Built-in algorithms include “gradient_allreduce”, “bytegrad”, “decentralized”, “low_precision_decentralized”, “qadam” and “async”.

  • flatten (bool) – Whether to flatten the Bagua communication buckets. The flatten operation will reset data pointer of bucket tensors so that they can use faster code paths.

  • bagua_kwargs (Union[Any, Dict[str, Any]]) – Additional keyword arguments that will be passed to initialize the Bagua algorithm. More details on keyword arguments accepted for each algorithm can be found in the documentation.

barrier(*args, **kwargs)[source]

Synchronizes all processes which blocks processes until the whole group enters this function.

Parameters:

name – an optional name to pass into barrier.

Return type:

None

broadcast(obj, src=0)[source]

Broadcasts an object to all processes.

Parameters:
  • obj (TypeVar(TBroadcast)) – the object to broadcast

  • src (int) – source rank

Return type:

TypeVar(TBroadcast)

reduce(tensor, group=None, reduce_op='mean')[source]

Reduces a tensor from several distributed processes to one aggregated tensor.

Parameters:
  • tensor (Tensor) – The tensor to sync and reduce.

  • group (Optional[Any]) – The process group to gather results from. Defaults to all processes (world).

  • reduce_op (Union[ReduceOp, str, None]) – The reduction operation. Can also be a string ‘sum’ or ReduceOp.

Return type:

Tensor

Returns:

The reduced value, except when the input was not a tensor the output remains is unchanged.

setup(trainer)[source]

Setup plugins for the trainer fit and creates optimizers.

Parameters:

trainer (Trainer) – the trainer instance

Return type:

None

teardown()[source]

This method is called to teardown the training process.

It is the right place to release memory and free other resources.

Return type:

None