ModelParallelStrategy¶

class lightning.pytorch.strategies.ModelParallelStrategy(data_parallel_size='auto', tensor_parallel_size='auto', save_distributed_checkpoint=True, process_group_backend=None, timeout=datetime.timedelta(seconds=1800))[source]¶

Bases: ParallelStrategy

Enables user-defined parallelism applied to a model.

Warning

This is an experimental feature.

Currently supports up to 2D parallelism. Specifically, it supports the combination of Fully Sharded Data-Parallel 2 (FSDP2) with Tensor Parallelism (DTensor). These PyTorch APIs are currently still experimental in PyTorch (see https://pytorch.org/docs/stable/distributed.tensor.parallel.html). Requires PyTorch 2.4 or newer.

Parameters:

data_parallel_size¶ (Union[Literal['auto'], int]) – The number of devices within a data-parallel group. Defaults to "auto", which sets this size to the number of nodes in the cluster.
tensor_parallel_size¶ (Union[Literal['auto'], int]) – The number of devices within a tensor-parallel group. Defaults to "auto", which sets this size to the number of GPUs in a single node.
save_distributed_checkpoint¶ (bool) – If True, each rank saves its shard of weights and optimizer states to a file. The checkpoint is a folder with as many files as the world size. If False, the full weights and optimizer states get assembled on rank 0 and saved to a single file.

barrier(name=None)[source]¶

Synchronizes all processes which blocks processes until the whole group enters this function.

Parameters:: name¶ (Optional[str]) – an optional name to pass into barrier.
Return type:: None

broadcast(obj, src=0)[source]¶

Broadcasts an object to all processes.

Parameters:

obj¶ (TypeVar(TBroadcast)) – the object to broadcast
src¶ (int) – source rank

Return type:

TypeVar(TBroadcast)

lightning_module_state_dict()[source]¶

Collects the state dict of the model.

Only returns a non-empty state dict on rank 0 if save_distributed_checkpoint=False.

Return type:: Dict[str, Any]

model_to_device()[source]¶

Moves the model to the correct device.

Return type:: None

optimizer_state(optimizer)[source]¶

Collects the state of the given optimizer.

Only returns a non-empty state dict on rank 0 if save_distributed_checkpoint=False.

Return type:: Dict[str, Any]

reduce(tensor, group=None, reduce_op='mean')[source]¶

Reduces the given tensor (e.g. across GPUs/processes).

Parameters:

tensor¶ (Union[Tensor, Any]) – the tensor to sync and reduce
group¶ (Optional[Any]) – the process group to reduce
reduce_op¶ (Union[ReduceOp, str, None]) – the reduction operation. Defaults to ‘mean’. Can also be a string ‘sum’ or ReduceOp.

Return type:

Tensor

save_checkpoint(checkpoint, filepath, storage_options=None)[source]¶

Save model/training states as a checkpoint file through state-dump and file-write.

Parameters:

checkpoint¶ (Dict[str, Any]) – dict containing model and trainer state
filepath¶ (Union[str, Path]) – write-target file’s path
storage_options¶ (Optional[Any]) – parameter for how to save to storage, passed to CheckpointIO plugin

Return type:

None

setup(trainer)[source]¶

Sets up the accelerator, plugins and initializes the optimizers (if needed).

Parameters:: trainer¶ (Trainer) – the trainer instance
Return type:: None

setup_environment()[source]¶

Setup any processes or distributed connections.

This is called before the LightningModule/DataModule setup hook which allows the user to access the accelerator environment before setup is complete.

Return type:: None

setup_optimizers(trainer)[source]¶

Creates optimizers and schedulers.

Parameters:: trainer¶ (Trainer) – the Trainer, these optimizers should be connected to
Return type:: None

teardown()[source]¶

This method is called to teardown the training process.

It is the right place to release memory and free other resources.

Return type:: None

tensor_init_context(empty_init=None)[source]¶

Controls how tensors get created (device, dtype).

Parameters:: empty_init¶ (Optional[bool]) – Whether to initialize the model with empty weights (uninitialized memory). If None, the strategy will decide. Some strategies may not support all options.
Return type:: Generator[None, None, None]

property lightning_restore_optimizer: bool¶

Override to disable Lightning restoring optimizers/schedulers.

This is useful for strategies which manage restoring optimizers/schedulers.

property restore_checkpoint_after_setup: bool¶

Override to delay restoring from checkpoint till after the setup phase has completed. This is useful when the strategy requires all the setup hooks to run before loading checkpoint.

Returns:: If True, restore checkpoint after strategy setup.

property root_device: device¶: Return the root device.