• Docs >
  • Fault-tolerant Training (expert)
Shortcuts

Fault-tolerant Training (expert)

Audience: Experts looking to enable and handle their own fault-tolerance.

Pre-requisites: Users must have first read Fault-tolrance Training (basic)


Enable fault-tolerant behavior anywhere

To enable fault tolerance on your own cloud or cluster environment enable the PL_FAULT_TOLERANT_TRAINING environment variable:

PL_FAULT_TOLERANT_TRAINING=1 python script.py

Although Lighting will now be fault-tolerant, you’ll have to handle all the nuances of making sure the models are automatically restarted.


© Copyright Copyright (c) 2018-2023, Lightning AI et al...

Built with Sphinx using a theme provided by Read the Docs.