FSDPStrategy¶

class lightning_fabric.strategies.FSDPStrategy(accelerator=None, parallel_devices=None, cluster_environment=None, checkpoint_io=None, precision=None, process_group_backend=None, timeout=datetime.timedelta(seconds=1800), cpu_offload=None, backward_prefetch=None, mixed_precision=None, activation_checkpointing=None, **kwargs)[source]¶

Bases: lightning_fabric.strategies.parallel.ParallelStrategy, lightning_fabric.strategies.strategy._Sharded

Strategy for Fully Sharded Data Parallel provided by torch.distributed.

Warning

FSDPStrategy is in BETA and subject to change. The interface can bring breaking changes and new features with the next release of PyTorch.

Fully Sharded Training shards the entire model across all available GPUs, allowing you to scale model size, whilst using efficient communication to reduce overhead. In practice, this means we can remain at parity with PyTorch DDP, whilst scaling our model sizes dramatically. The technique is similar to ZeRO-Stage 3.

For more information check out.

Defaults have been set and options have been exposed, but may require configuration based on your level of memory/speed efficiency. We suggest having a look at this tutorial for more information.

Parameters:

cpu_offload¶ (Union[bool, CPUOffload, None]) – Enable offloading parameters and gradients to CPU to save GPU memory at the cost of speed. You can also pass a config: cpu_offload=CPUOffload(offload_params=True). Note that this currently implicitly enables gradient offloading to CPU in order for parameters and gradients to be on same device to work with the optimizer. This API is subject to change. Default: no offoading
backward_prefetch¶ (Optional[BackwardPrefetch]) – This is an experimental feature that is subject to change in the near future. It allows users to enable two different backward prefetching algorithms to help backward communication and computation overlapping. The pros and cons of each algorithm is explained in the class BackwardPrefetch.
mixed_precision¶ (Optional[MixedPrecision]) – Mixed Precision config. By default, Lightning will enable FP16 if precision=16 or BF16 if precision=bf16 unless a config is passed in. This is only available in PyTorch 1.12 and later.
activation_checkpointing¶ (Union[Type[Module], List[Type[Module]], None]) – A single layer or a list of layer classes for which you want to enable activation checkpointing. This is typically your transformer block (including attention + feed-forward). Enabling this can free up a significant amount of memory at the cost of speed since activations in these layers need to be recomputed during backpropagation.
**kwargs¶ (Any) – Optional keywoard arguments passed to the FSDP context manager which will configure the FSDP class when wrapping modules.

all_reduce(tensor, group=None, reduce_op='mean')[source]¶

Reduces the given tensor (e.g. across GPUs/processes).

Parameters:

tensor¶ (Tensor) – the tensor to sync and reduce
group¶ (Optional[Any]) – the process group to reduce
reduce_op¶ (Union[ReduceOp, str, None]) – the reduction operation. Defaults to ‘mean’. Can also be a string ‘sum’ or ReduceOp.

Return type:

Tensor

barrier(*args, **kwargs)[source]¶

Synchronizes all processes which blocks processes until the whole group enters this function.

Parameters:: name¶ – an optional name to pass into barrier.
Return type:: None

broadcast(obj, src=0)[source]¶

Broadcasts an object to all processes.

Parameters:

obj¶ (TypeVar(TBroadcast)) – the object to broadcast
src¶ (int) – source rank

Return type:

TypeVar(TBroadcast)

module_sharded_context()[source]¶

A context manager that goes over the instantiation of an torch.nn.Module and handles sharding of parameters on creation.

By sharding layers directly on instantiation, one can reduce peak memory usage and initialization time.

Return type:: Generator

module_to_device(module)[source]¶

Moves the model to the correct device.

Return type:: None

setup_environment()[source]¶

Setup any processes or distributed connections.

This must be called by the framework at the beginning of every process, before any distributed communication takes place.

Return type:: None

setup_module(module)[source]¶

Wraps the model into a FullyShardedDataParallel module.

Return type:: FullyShardedDataParallel

setup_module_and_optimizers(module, optimizers)[source]¶

Set up a model and multiple optimizers together.

The returned objects are expected to be in the same order they were passed in. The default implementation will call setup_module() and setup_optimizer() on the inputs.

Return type:: Tuple[Module, List[Optimizer]]

setup_optimizer(optimizer)[source]¶

Set up an optimizer for a model wrapped with FSDP.

This setup method doesn’t modify the optimizer or wrap the optimizer. The only thing it currently does is verify that the optimizer was created after the model was wrapped with setup_module() with a reference to the flattened parameters.

Return type:: Optimizer

property distributed_sampler_kwargs: Dict[str, Any]¶

Arguments for the DistributedSampler.

If this method is not defined, or it returns None, then the DistributedSampler will not be used.

property root_device: torch.device¶: Returns the root device.