GlossaryΒΆ 2D Parallelism Combine Tensor Parallelism with FSDP (2D Parallel) to train efficiently on 100s of GPUs Accelerators Accelerators connect the Trainer to hardware to train faster Callback Add self-contained extra functionality during training execution Checkpointing Save and load progress with checkpoints Cluster Run on your own group of servers Cloud checkpoint Save your models to cloud filesystems Compile Use torch.compile to speed up models on modern hardware Console Logging Capture more visible logs Debugging Fix errors in your code DeepSpeed Distribute models with billions of parameters across hundreds GPUs Distributed Checkpoints Save and load very large models efficiently with distributed checkpoints Early stopping Stop the training when no improvement is observed Experiment manager (Logger) Tools for tracking and visualizing artifacts and logs Finetuning Technique for training pretrained models FSDP Distribute models with billions of parameters across hundreds GPUs GPU Graphics Processing Unit for faster training Half precision Using different numerical formats to save memory and run faster HPU Habana Gaudi AI Processor Unit for faster training Inference Making predictions by applying a trained model to unlabeled examples Lightning CLI A Command-line Interface (CLI) to interact with Lightning code via a terminal LightningDataModule A shareable, reusable class that encapsulates all the steps needed to process data LightningModule A base class organizug your neural network module Log Outputs or results used for visualization and tracking Metrics A statistic used to measure performance or other objectives we want to optimize Model The set of parameters and structure for a system to make predictions Model Parallelism A way to scale training that splits a model between multiple devices. Plugins Custom trainer integrations such as custom precision, checkpointing or cluster environment implementation Progress bar Output printed to the terminal to visualize the progression of training Production Using ML models in real world systems Prediction Computing a model's output Pretrained models Models that have already been trained for a particular task Profiler Tool to identify bottlenecks and performance of different parts of a model Pruning A technique to eliminae some of the model weights to reduce the model size and decrease inference requirements Quantization A technique to accelerate the model inference speed and decrease the memory load while still maintaining the model accuracy Remote filesystem and FSSPEC Accessing files from cloud storage providers Strategy Ways the trainer controls the model distribution across training, evaluation, and prediction Strategy registry A class that holds information about training strategies and allows adding new custom strategies Style guide Best practices to improve readability and reproducibility SWA Stochastic Weight Averaging (SWA) can make your models generalize better SLURM Simple Linux Utility for Resource Management, or simply Slurm, is a free and open-source job scheduler for Linux clusters Tensor Parallelism Parallelize the computation of model layers across multiple GPUs, reducing memory usage and communication overhead Transfer learning Using pre-trained models to improve learning Trainer The class that automates and customizes model training Torch distributed Setup for running on distributed environments Warnings Disable false-positive warnings emitted by Lightning