ModelParallelStrategy
- class lightning.pytorch.strategies.ModelParallelStrategy(data_parallel_size='auto', tensor_parallel_size='auto', save_distributed_checkpoint=True, process_group_backend=None, timeout=datetime.timedelta(seconds=1800))[source]
Bases:
ParallelStrategy
Enables user-defined parallelism applied to a model.
Warning
This is an experimental feature.
Currently supports up to 2D parallelism. Specifically, it supports the combination of Fully Sharded Data-Parallel 2 (FSDP2) with Tensor Parallelism (DTensor). These PyTorch APIs are currently still experimental in PyTorch (see https://pytorch.org/docs/stable/distributed.tensor.parallel.html). Requires PyTorch 2.4 or newer.
- Parameters:
data_parallel_size (
Union
[Literal
['auto'
],int
]) – The number of devices within a data-parallel group. Defaults to"auto"
, which sets this size to the number of nodes in the cluster.tensor_parallel_size (
Union
[Literal
['auto'
],int
]) – The number of devices within a tensor-parallel group. Defaults to"auto"
, which sets this size to the number of GPUs in a single node.save_distributed_checkpoint (
bool
) – IfTrue
, each rank saves its shard of weights and optimizer states to a file. The checkpoint is a folder with as many files as the world size. IfFalse
, the full weights and optimizer states get assembled on rank 0 and saved to a single file.
- barrier(name=None)[source]
Synchronizes all processes which blocks processes until the whole group enters this function.
- broadcast(obj, src=0)[source]
Broadcasts an object to all processes.
- lightning_module_state_dict()[source]
Collects the state dict of the model.
Only returns a non-empty state dict on rank 0 if
save_distributed_checkpoint=False
.
- optimizer_state(optimizer)[source]
Collects the state of the given optimizer.
Only returns a non-empty state dict on rank 0 if
save_distributed_checkpoint=False
.
- reduce(tensor, group=None, reduce_op='mean')[source]
Reduces the given tensor (e.g. across GPUs/processes).
- save_checkpoint(checkpoint, filepath, storage_options=None)[source]
Save model/training states as a checkpoint file through state-dump and file-write.
- setup(trainer)[source]
Sets up the accelerator, plugins and initializes the optimizers (if needed).
- setup_environment()[source]
Setup any processes or distributed connections.
This is called before the LightningModule/DataModule setup hook which allows the user to access the accelerator environment before setup is complete.
- Return type:
- setup_optimizers(trainer)[source]
Creates optimizers and schedulers.
- teardown()[source]
This method is called to teardown the training process.
It is the right place to release memory and free other resources.
- Return type:
- tensor_init_context(empty_init=None)[source]
Controls how tensors get created (device, dtype).
- property lightning_restore_optimizer: bool
Override to disable Lightning restoring optimizers/schedulers.
This is useful for strategies which manage restoring optimizers/schedulers.
- property restore_checkpoint_after_setup: bool
Override to delay restoring from checkpoint till after the setup phase has completed. This is useful when the strategy requires all the setup hooks to run before loading checkpoint.
- Returns:
If
True
, restore checkpoint after strategy setup.
- property root_device: device
Return the root device.