Lightning AI Studios: Never set up a local environment again →

← Back to blog

Lightning 1.8: Colossal-AI, Secrets for Apps, and more

We’re excited to announce the release of Lightning 1.8 ⚡️ (release notes).

v1.8 of Lightning is the culmination of work from 52 contributors who have worked on features, bug fixes, and documentation for a total of over 550 commits since 1.7.0.

Highlights

  • Colossal-AI strategy
  • Secrets for Lightning Apps
  • CLI Commands for Lightning Apps
  • Auto-wrapping for FSDP

These four highlights in the Lightning 1.8 release are focused on making your machine learning development faster, easier, and more powerful than ever before. As models get larger, more complex, and require more resources to train, we all need the ability to train ever-expanding models with more flexible requirements for hardware without sacrificing speed and performance.

 

. . .

 

Colossal-AI

Colossal-AI focuses on improving efficiency when training large-scale AI models with billions of parameters. With the new Colossal-AI strategy in Lightning 1.8, you can train existing models like GPT-3 with up to half as many GPUs as usually needed. You can also train models up to twice as big with the same number of GPUs, saving you significant cost. Here is how you use it:

You can find Colossal-AI’s benchmarks with Lightning on GPT-2 here.

Under the hood, Colossal-AI implements different parallelism algorithms that are especially interesting for the development of SOTA transformer models:

  • Data Parallelism
  • Pipeline Parallelism
  • 1D, 2D, 2.5D, 3D Tensor Parallelism
  • Sequence Parallelism
  • Zero Redundancy Optimization

Learn how to install and use Colossal-AI effectively with Lightning here.

NOTE: This strategy is marked as experimental. Stay tuned for more updates in the future.

 

. . .

 

Secrets for Lightning Apps

Introducing encrypted secrets, a feature requested by Lightning App users 🎉!

Encrypted secrets allow you to securely pass private data to your apps, like API keys, access tokens, database passwords, or other credentials, without exposing them in your code.

  1. Add a secret to your Lightning account.
  2. Add an environment variable to your app to read the secret:
  3. Pass the secret to your app run with the following command:

These secrets are encrypted and stored in the Lightning database. Nothing except your app can access the value.

NOTE: This strategy is marked as experimental. Stay tuned for more updates in the future.

 

. . .

 

CLI Commands for Lightning Apps

Introducing CLI commands for apps (#13602)!

As a Lightning App builder, if you want to easily create a CLI interface for users to interact with your app, then this is for you.

Here is an example where users can dynamically create notebooks from the CLI.

All you need to do is implement the configure_commands hook on the LightningFlow:

Once the app is running with lightning run app app.py, you can connect to the app with the following command:

and run the command that was configured:

NOTE: This strategy is marked as experimental. Stay tuned for more updates in the future.

 

. . .

 

Auto-wrapping for FSDP Strategy

In Lightning v1.7, we introduced an integration for PyTorch’s Fully Sharded Data Parallel (FSDP) in the form of our FSDP strategy, which allows you to train huge models with billions of parameters across hundreds of GPUs and machines:

We are continuing to improve the support for this feature by adding automatic wrapping of layers for use cases where the model fits into CPU memory, but not into GPU memory (#14383).

Here are some examples:

Case 1: Model is so large that it does not fit into CPU memory.
Construct your layers in the configure_sharded_model hook and wrap the large ones you want to shard across GPUs:

Case 2: Model fits into CPU memory, but not into GPU memory. In Lightning v1.8, you no longer need to do anything special here, as we can automatically wrap the layers for you using FSDP’s policy:

Case 3: Model fits into GPU memory. No action required, use any strategy you want.

Note: if you want to manually wrap layers for more control, you can still do that!

Read more about FSDP and how layer wrapping works in our docs.

 

. . .

 

New Tuner Callbacks

In this release, we focused on Tuner improvements and introduced two new callbacks that can help you customize the batch size finder and learning rate finder as per your use case.

Batch Size Finder (#11089)

  1. You can customize the BatchSizeFinder callback to run at different epochs. This feature is useful while fine-tuning models since you can’t always use the same batch size after unfreezing the backbone.
  2. Run batch size finder for validate/test/predict.

 

. . .

 

Learning Rate Finder (#13802)

You can now use the LearningRateFinder callback to run at different intervals. This feature is useful when fine-tuning models, for example.

 

. . .

 

LightningCLI Improvements

Even though the LightningCLI class is designed to help in the implementation of command line tools, there are instances when it might be more desirable to run directly from Python. In Lightning 1.8, you can now do this (#14596):

Anywhere in your program, you can now call the CLI directly:

Learn about all features of the LightningCLI!

 

. . .

 

Improvements to the SLURM Support

Multi-node training on a SLURM cluster has been supported since the inception of Lightning Trainer, and has seen several improvements over time thanks to many community contributions. And we just keep going! In this release, we’ve added two quality of life improvements:

  • The preemption/termination signal is now configurable (#14626):
  • Automatic requeuing of jobs now also works for array jobs (#15040)! Array jobs are a convenient way to group/launch several scripts at once. When the SLURM scheduler interrupts your jobs, Lightning will save a checkpoint, resubmit a new job, and, once the scheduler allocates resources, the Trainer will resume from where it left off.

Read more about our SLURM integration here.