IPUStrategy¶

class pytorch_lightning.strategies.IPUStrategy(accelerator=None, device_iterations=1, autoreport=False, autoreport_dir=None, parallel_devices=None, cluster_environment=None, checkpoint_io=None, precision_plugin=None, training_opts=None, inference_opts=None)[source]¶

Bases: pytorch_lightning.strategies.parallel.ParallelStrategy

Plugin for training on IPU devices.

Parameters:

device_iterations¶ – Number of iterations to run on device at once before returning to host. This can be used as an optimization to speed up training. https://docs.graphcore.ai/projects/poptorch-user-guide/en/latest/batching.html
autoreport¶ – Enable auto-reporting for IPUs using PopVision https://docs.graphcore.ai/projects/graphcore-popvision-user-guide/en/latest/graph/graph.html
autoreport_dir¶ – Optional directory to store autoReport output.
training_opts¶ – Optional poptorch.Options to override the default created options for training.
inference_opts¶ – Optional poptorch.Options to override the default created options for validation/testing and predicting.

all_gather(tensor, group=None, sync_grads=False)[source]¶

Perform a all_gather on all processes.

Return type:: Tensor

barrier(name=None)[source]¶

Synchronizes all processes which blocks processes until the whole group enters this function.

Parameters:: name¶ (Optional[str]) – an optional name to pass into barrier.
Return type:: None

batch_to_device(batch, device=None, dataloader_idx=0)[source]¶

Moves the batch to the correct device.

The returned batch is of the same type as the input batch, just having all tensors on the correct device.

Parameters:

batch¶ (Any) – The batch of samples to move to the correct device
device¶ (Optional[device]) – The target device
dataloader_idx¶ (int) – The index of the dataloader to which the batch belongs.

Return type:

Any

broadcast(obj, src=0)[source]¶

Broadcasts an object to all processes.

Parameters:

obj¶ (TypeVar(TBroadcast)) – the object to broadcast
src¶ (int) – source rank

Return type:

TypeVar(TBroadcast)

model_to_device()[source]¶

Moves the model to the correct device.

Return type:: None

on_predict_end()[source]¶

Called when predict ends.

Return type:: None

on_predict_start()[source]¶

Called when predict begins.

Return type:: None

on_test_end()[source]¶

Called when test end.

Return type:: None

on_test_start()[source]¶

Called when test begins.

Return type:: None

on_train_batch_start(batch, batch_idx)[source]¶

Called in the training loop before anything happens for that batch.

Return type:: None

on_train_end()[source]¶

Called when train ends.

Return type:: None

on_train_start()[source]¶

Called when train begins.

Return type:: None

on_validation_end()[source]¶

Called when validation ends.

Return type:: None

on_validation_start()[source]¶

Called when validation begins.

Return type:: None

predict_step(*args, **kwargs)[source]¶

The actual predict step.

See predict_step() for more details

Return type:: Union[Tensor, Dict[str, Any]]

reduce(tensor, *args, **kwargs)[source]¶

Reduces the given tensor (e.g. across GPUs/processes).

Parameters:

tensor¶ (Union[Tensor, Any]) – the tensor to sync and reduce
group¶ – the process group to reduce
reduce_op¶ – the reduction operation. Defaults to ‘mean’. Can also be a string ‘sum’ or ReduceOp.

Return type: