ModelParallelStrategy

class lightning.fabric.strategies.ModelParallelStrategy(parallelize_fn, data_parallel_size='auto', tensor_parallel_size='auto', save_distributed_checkpoint=True, process_group_backend=None, timeout=datetime.timedelta(seconds=1800))[source]

Bases: ParallelStrategy

Enables user-defined parallelism applied to a model.

Warning

This is an experimental feature.

Currently supports up to 2D parallelism. Specifically, it supports the combination of Fully Sharded Data-Parallel 2 (FSDP2) with Tensor Parallelism (DTensor). These PyTorch APIs are currently still experimental in PyTorch. Requires PyTorch 2.4 or newer.

Parameters:
  • parallelize_fn (Callable[[TypeVar(TModel, bound= Module), DeviceMesh], TypeVar(TModel, bound= Module)]) – A function that applies parallelisms to a module. The strategy will provide the model and device mesh as input.

  • data_parallel_size (Union[Literal['auto'], int]) – The number of devices within a data-parallel group. Defaults to "auto", which sets this size to the number of nodes in the cluster.

  • tensor_parallel_size (Union[Literal['auto'], int]) – The number of devices within a tensor-parallel group. Defaults to "auto", which sets this size to the number of GPUs in a single node.

  • save_distributed_checkpoint (bool) – If True, each rank saves its shard of weights and optimizer states to a file. The checkpoint is a folder with as many files as the world size. If False, the full weights and optimizer states get assembled on rank 0 and saved to a single file.

_configure_launcher()[source]

Attach the launcher based on Strategy.

Return type:

None

all_reduce(tensor, group=None, reduce_op='mean')[source]

Reduces the given tensor (e.g. across GPUs/processes).

Parameters:
  • tensor (Tensor) – the tensor to sync and reduce

  • group (Optional[Any]) – the process group to reduce

  • reduce_op (Union[ReduceOp, str, None]) – the reduction operation. Defaults to ‘mean’. Can also be a string ‘sum’ or ReduceOp.

Return type:

Tensor

barrier(*args, **kwargs)[source]

Synchronizes all processes which blocks processes until the whole group enters this function.

Parameters:

name – an optional name to pass into barrier.

Return type:

None

broadcast(obj, src=0)[source]

Broadcasts an object to all processes.

Parameters:
  • obj (TypeVar(TBroadcast)) – the object to broadcast

  • src (int) – source rank

Return type:

TypeVar(TBroadcast)

load_checkpoint(path, state=None, strict=True)[source]

Load the contents from a checkpoint and restore the state of the given objects.

Return type:

Dict[str, Any]

module_init_context(empty_init=None)[source]

A context manager wrapping the model instantiation.

Here, the strategy can control how the parameters of the model get created (device, dtype) and or apply other patches to the model.

Parameters:

empty_init (Optional[bool]) – Whether to initialize the model with empty weights (uninitialized memory). If None, the strategy will decide. Some strategies may not support all options.

Return type:

ContextManager

module_to_device(module)[source]

Moves the model to the correct device.

Return type:

None

save_checkpoint(path, state, storage_options=None, filter=None)[source]

Save model, optimizer, and other state to a checkpoint on disk.

If distributed checkpointing is enabled (default), the checkpoint gets saved as a directory containing one file per process, with model- and optimizer shards stored per file. Additionally, it creates a metadata file meta.pt with the rest of the user’s state (only saved from rank 0). If distributed checkpointing is disabled (save_distributed_checkpoint=False), the checkpoint will be written to a single file containing the weights, optimizer state and other metadata.

Return type:

None

setup_environment()[source]

Setup any processes or distributed connections.

This must be called by the framework at the beginning of every process, before any distributed communication takes place.

Return type:

None

setup_module(module)[source]

Performs setup for the model, e.g., by wrapping it by another class.

Return type:

Module

property distributed_sampler_kwargs: Dict[str, Any]

Arguments for the DistributedSampler.

If this method is not defined, or it returns None, then the DistributedSampler will not be used.

property root_device: device

Returns the root device.