• Docs >
  • Fault-tolerant Training (basic)
Shortcuts

Fault-tolerant Training (basic)

Audience: User who want to run on the cloud or a cluster environment.


What is fault-tolerant training?

When developing models on the cloud or cluster environments, you may be forced to restart from scratch in the event of a software or hardware failure (ie: a fault). Lightning models can run fault-proof.

With Fault Tolerant Training, when Trainer.fit() fails in the middle of an epoch during training or validation, Lightning will restart exactly where it failed, and everything will be restored (down to the batch it was on even if the dataset was shuffled).

Warning

Fault-tolerant Training is currently an experimental feature within Lightning.


© Copyright Copyright (c) 2018-2023, Lightning AI et al...

Built with Sphinx using a theme provided by Read the Docs.