ModelParallelStrategy

class lightning.pytorch.strategies.ModelParallelStrategy(data_parallel_size='auto', tensor_parallel_size='auto', save_distributed_checkpoint=True, process_group_backend=None, timeout=datetime.timedelta(seconds=1800))[source]

Bases: ParallelStrategy

Enables user-defined parallelism applied to a model.

Warning

This is an experimental feature.

Currently supports up to 2D parallelism. Specifically, it supports the combination of Fully Sharded Data-Parallel 2 (FSDP2) with Tensor Parallelism (DTensor). These PyTorch APIs are currently still experimental in PyTorch (see https://pytorch.org/docs/stable/distributed.tensor.parallel.html). Requires PyTorch 2.4 or newer.

Parameters:
  • data_parallel_size (Union[Literal['auto'], int]) – The number of devices within a data-parallel group. Defaults to "auto", which sets this size to the number of nodes in the cluster.

  • tensor_parallel_size (Union[Literal['auto'], int]) – The number of devices within a tensor-parallel group. Defaults to "auto", which sets this size to the number of GPUs in a single node.

  • save_distributed_checkpoint (bool) – If True, each rank saves its shard of weights and optimizer states to a file. The checkpoint is a folder with as many files as the world size. If False, the full weights and optimizer states get assembled on rank 0 and saved to a single file.

barrier(name=None)[source]

Synchronizes all processes which blocks processes until the whole group enters this function.

Parameters:

name (Optional[str]) – an optional name to pass into barrier.

Return type:

None

broadcast(obj, src=0)[source]

Broadcasts an object to all processes.

Parameters:
  • obj (TypeVar(TBroadcast)) – the object to broadcast

  • src (int) – source rank

Return type:

TypeVar(TBroadcast)

lightning_module_state_dict()[source]

Collects the state dict of the model.

Only returns a non-empty state dict on rank 0 if save_distributed_checkpoint=False.

Return type:

dict[str, Any]

model_to_device()[source]

Moves the model to the correct device.

Return type:

None

optimizer_state(optimizer)[source]

Collects the state of the given optimizer.

Only returns a non-empty state dict on rank 0 if save_distributed_checkpoint=False.

Return type:

dict[str, Any]

reduce(tensor, group=None, reduce_op='mean')[source]

Reduces the given tensor (e.g. across GPUs/processes).

Parameters:
  • tensor (Union[Tensor, Any]) – the tensor to sync and reduce

  • group (Optional[Any]) – the process group to reduce

  • reduce_op (Union[ReduceOp, str, None]) – the reduction operation. Defaults to ‘mean’. Can also be a string ‘sum’ or ReduceOp.

Return type:

Tensor

save_checkpoint(checkpoint, filepath, storage_options=None)[source]

Save model/training states as a checkpoint file through state-dump and file-write.

Parameters:
  • checkpoint (dict[str, Any]) – dict containing model and trainer state

  • filepath (Union[str, Path]) – write-target file’s path

  • storage_options (Optional[Any]) – parameter for how to save to storage, passed to CheckpointIO plugin

Return type:

None

setup(trainer)[source]

Sets up the accelerator, plugins and initializes the optimizers (if needed).

Parameters:

trainer (Trainer) – the trainer instance

Return type:

None

setup_environment()[source]

Setup any processes or distributed connections.

This is called before the LightningModule/DataModule setup hook which allows the user to access the accelerator environment before setup is complete.

Return type:

None

setup_optimizers(trainer)[source]

Creates optimizers and schedulers.

Parameters:

trainer (Trainer) – the Trainer, these optimizers should be connected to

Return type:

None

teardown()[source]

This method is called to teardown the training process.

It is the right place to release memory and free other resources.

Return type:

None

tensor_init_context(empty_init=None)[source]

Controls how tensors get created (device, dtype).

Parameters:

empty_init (Optional[bool]) – Whether to initialize the model with empty weights (uninitialized memory). If None, the strategy will decide. Some strategies may not support all options.

Return type:

Generator[None, None, None]

property lightning_restore_optimizer: bool

Override to disable Lightning restoring optimizers/schedulers.

This is useful for strategies which manage restoring optimizers/schedulers.

property restore_checkpoint_after_setup: bool

Override to delay restoring from checkpoint till after the setup phase has completed. This is useful when the strategy requires all the setup hooks to run before loading checkpoint.

Returns:

If True, restore checkpoint after strategy setup.

property root_device: device

Return the root device.