DDPFullyShardedNativeStrategy¶
- class pytorch_lightning.strategies.DDPFullyShardedNativeStrategy(accelerator=None, parallel_devices=None, cluster_environment=None, checkpoint_io=None, precision_plugin=None, process_group_backend=None, cpu_offload=None, backward_prefetch=None, mixed_precision=None, **kwargs)[source]¶
Bases:
pytorch_lightning.strategies.parallel.ParallelStrategy
Strategy for Fully Sharded Data Parallel provided by torch.distributed.
Warning
DDPFullyShardedNativeStrategy
is in BETA and subject to change. The interface can bring breaking changes and new features with the next release of PyTorch.Fully Sharded Training shards the entire model across all available GPUs, allowing you to scale model size, whilst using efficient communication to reduce overhead. In practice, this means we can remain at parity with PyTorch DDP, whilst scaling our model sizes dramatically. The technique is similar to ZeRO-Stage 3.
For more information check out.
Defaults have been set and options have been exposed, but may require configuration based on your level of memory/speed efficiency. We suggest having a look at this tutorial for more information.
- Parameters:
cpu_offload¶ (
Optional
[CPUOffload
]) – CPU offloading config. Currently, only parameter and gradient CPU offload is supported. It can be enabled via passing incpu_offload=CPUOffload(offload_params=True)
. Note that this currently implicitly enables gradient offloading to CPU in order for params and grads to be on same device to work with optimizer. This API is subject to change. Default isNone
in which case there will be no offloading.backward_prefetch¶ (
Optional
[BackwardPrefetch
]) – This is an experimental feature that is subject to change in the the near future. It allows users to enable two different backward_prefetch algorithms to help backward communication and computation overlapping. The pros and cons of each algorithm is explained in the classBackwardPrefetch
.mixed_precision¶ (
Optional
[MixedPrecision
]) – Mixed Precision config. By default, Lightning will enable FP16 ifprecision=16
or BF16 ifprecision=bf16
unless a config is passed in. This is only available in PyTorch 1.12 and later.**kwargs¶ (
Any
) – Passed to the FSDP context manager which will configure the FSDP class when wrapping modules.
- barrier(name=None)[source]¶
Synchronizes all processes which blocks processes until the whole group enters this function.
- model_sharded_context()[source]¶
Provide hook to create modules in a distributed aware context. This is useful for when we’d like to shard the model instantly, which is useful for extremely large models which can save memory and initialization time.
Returns: Model parallel context.
- Return type:
- reduce(tensor, group=None, reduce_op='mean')[source]¶
Reduces a tensor from several distributed processes to one aggregated tensor.
- Parameters:
tensor¶ (
Union
[Tensor
,Any
]) – the tensor to sync and reducegroup¶ (
Optional
[Any
]) – the process group to gather results from. Defaults to all processes (world)reduce_op¶ (
Union
[ReduceOp
,str
,None
]) – the reduction operation. Defaults to ‘mean’/’avg’. Can also be a string ‘sum’ to calculate the sum during reduction.
- Return type:
- Returns:
reduced value, except when the input was not a tensor the output remains is unchanged
- setup_environment()[source]¶
Setup any processes or distributed connections.
This is called before the LightningModule/DataModule setup hook which allows the user to access the accelerator environment before setup is complete.
- Return type:
- teardown()[source]¶
This method is called to teardown the training process.
It is the right place to release memory and free other resources.
- Return type:
- training_step(*args, **kwargs)[source]¶
The actual training step.
See
training_step()
for more details
- validation_step(*args, **kwargs)[source]¶
The actual validation step.
See
validation_step()
for more details
- property root_device: torch.device¶
Return the root device.