ModelParallelStrategy¶
- class lightning.pytorch.strategies.ModelParallelStrategy(data_parallel_size='auto', tensor_parallel_size='auto', save_distributed_checkpoint=True, process_group_backend=None, timeout=datetime.timedelta(seconds=1800))[source]¶
Bases:
ParallelStrategy
Enables user-defined parallelism applied to a model.
Warning
This is an experimental feature.
Currently supports up to 2D parallelism. Specifically, it supports the combination of Fully Sharded Data-Parallel 2 (FSDP2) with Tensor Parallelism (DTensor). These PyTorch APIs are currently still experimental in PyTorch (see https://pytorch.org/docs/stable/distributed.tensor.parallel.html). Requires PyTorch 2.4 or newer.
- Parameters:
data_parallel_size¶ (
Union
[Literal
['auto'
],int
]) – The number of devices within a data-parallel group. Defaults to"auto"
, which sets this size to the number of nodes in the cluster.tensor_parallel_size¶ (
Union
[Literal
['auto'
],int
]) – The number of devices within a tensor-parallel group. Defaults to"auto"
, which sets this size to the number of GPUs in a single node.save_distributed_checkpoint¶ (
bool
) – IfTrue
, each rank saves its shard of weights and optimizer states to a file. The checkpoint is a folder with as many files as the world size. IfFalse
, the full weights and optimizer states get assembled on rank 0 and saved to a single file.
- barrier(name=None)[source]¶
Synchronizes all processes which blocks processes until the whole group enters this function.
- lightning_module_state_dict()[source]¶
Collects the state dict of the model.
Only returns a non-empty state dict on rank 0 if
save_distributed_checkpoint=False
.
- optimizer_state(optimizer)[source]¶
Collects the state of the given optimizer.
Only returns a non-empty state dict on rank 0 if
save_distributed_checkpoint=False
.
- reduce(tensor, group=None, reduce_op='mean')[source]¶
Reduces the given tensor (e.g. across GPUs/processes).
- save_checkpoint(checkpoint, filepath, storage_options=None)[source]¶
Save model/training states as a checkpoint file through state-dump and file-write.
- setup(trainer)[source]¶
Sets up the accelerator, plugins and initializes the optimizers (if needed).
- setup_environment()[source]¶
Setup any processes or distributed connections.
This is called before the LightningModule/DataModule setup hook which allows the user to access the accelerator environment before setup is complete.
- Return type:
- teardown()[source]¶
This method is called to teardown the training process.
It is the right place to release memory and free other resources.
- Return type:
- property lightning_restore_optimizer: bool¶
Override to disable Lightning restoring optimizers/schedulers.
This is useful for strategies which manage restoring optimizers/schedulers.