Lightning AI Studios: Never set up a local environment again →

← Back to blog

Lightning 1.7: Apple Silicon, Multi-GPU and more

We’re excited to announce the release of PyTorch Lightning 1.7 ⚡️ (release notes!).

v1.7 of PyTorch Lightning is the culmination of work from 106 contributors who have worked on features, bug fixes, and documentation for a total of over 492 commits since 1.6.0.


  • Support for Apple Silicon
  • Native FSDP
  • Newly-enabled support for multi-GPU in notebooks
  • Collaborative training

In addition to a host of bug fixes as well as feature upgrades and implementations, these four highlights embody the latest and greatest aspects of PyTorch Lightning. As models get larger, more complex, and require more resources to train, we all need the ability to train ever-expanding models with more flexible requirements for hardware without sacrificing speed and performance.

Our mission has always been to make machine learning faster, easier, and more accessible, and these four key points of PyTorch Lightning 1.7 reflect that goal. Whether it’s new training strategies or novel ways to interact with your projects, Lightning enables you to build faster for less money.

. . .

Apple Silicon Support

What it is: Accelerated GPU training on Apple M1/M2 machines
Why we built it: Apple’s Metal Performance Shaders (MPS) framework helps you more easily extract data from images, run neural networks, and more.

For those using PyTorch 1.12 on M1 or M2 Apple machines, we have created the MPSAcceleratorMPSAccelerator enables accelerated GPU training on Apple’s Metal Performance Shaders (MPS) as a backend process.


Support for this accelerator is currently marked as experimental in PyTorch. Because many operators are still missing, you may run into a few rough edges.

. . .

Native Fully Sharded Data Parallel Strategy

What it is: Support for FSDP directly within PyTorch
Why we built it: Now, this natively supported strategy makes training large models easier to save you time.

PyTorch 1.12 also added native support for Fully Sharded Data Parallel (FSDP). Previously, Lightning enabled this by using the fairscale project. You can now choose between both options.


Support for this strategy is marked as beta in PyTorch.

. . .

A Collaborative Training strategy using Hivemind

What it is: Easily train across multiple machines.
Why we built it: Collaborative training removes the need for, and cost of, training across multiple expensive GPUs.

Collaborative Training solves the need for top-tier multi-GPU servers by allowing you to train across unreliable machines such as local ones or even preemptible cloud compute across the Internet.

Under the hood, we use Hivemind. This provides de-centralized training across the Internet.

For more information, check out the docs.

. . .

Distributed support in Jupyter Notebooks

What it is: Scale to multiple devices, even when prototyping in Jupyter.
Why we built it: Distributed training means faster training — now available in Jupyter Notebooks.

So far, the only multi-GPU strategy supported in Jupyter notebooks (including Grid.aiGoogle Colab, and Kaggle, for example) has been the Data-Parallel (DP) strategy (strategy="dp"). DP, however, has several limitations that often obstruct users’ workflows. It can be slow, it’s incompatible with TorchMetrics, it doesn’t persist state changes on replicas, and it’s difficult to use with non-primitive input- and output structures.

In this release, we’ve added support for Distributed Data Parallel in Jupyter notebooks using the fork mechanism to address these shortcomings. This is only available for MacOS and Linux (sorry Windows!).


This feature is experimental.

This is how you use multi-device in notebooks now:

By default, the Trainer detects the interactive environment and selects the right strategy for you. Learn more in the full documentation.

. . .

Other New Features

Versioning of “last” checkpoints

If a run is configured to save to the same directory as a previous run and ModelCheckpoint(save_last=True) is enabled, the “last” checkpoint is now versioned with a simple -v1 suffix to avoid overwriting the existing “last” checkpoint. This mimics the behavior for checkpoints that monitor a metric.

Automatically reload the “last” checkpoint

In certain scenarios, like when running in a cloud spot instance with fault-tolerant training enabled, it is useful to load the latest available checkpoint. It is now possible to pass the string ckpt_path="last" in order to load the latest available checkpoint from the set of existing checkpoints.


Validation every N batches across epochs

In some cases, for example iteration based training, it is useful to run validation after every N number of training batches without being limited by the epoch boundary. Now, you can enable validation based on total training batches.

For example, given 5 epochs of 10 batches, setting N=25 would run validation in the 3rd and 5th epoch.

CPU stats monitoring

Lightning provides the DeviceStatsMonitor callback to monitor the stats of the hardware currently used. However, users often also want to monitor the stats of other hardware. In this release, we have added an option to additionally monitor CPU stats:

The CPU stats are gathered using the psutil package.

Automatic distributed samplers

It is now possible to use custom samplers in a distributed environment without the need to set replace_ddp_sampler=False and wrap your sampler manually with the DistributedSampler.

Inference mode support

PyTorch 1.9 introduced torch.inference_mode, which is a faster alternative for torch.no_grad. Lightning will now use inference_mode wherever possible during evaluation.

Support for warn-level determinism

In Pytorch 1.11, operations that do not have a deterministic implementation can be set to throw a warning instead of an error when ran in deterministic mode. This is now supported by our Trainer:


LightningCLI improvements

After the latest updates to jsonargparse, the library supporting the LightningCLI, there’s now complete support for shorthand notation. This includes automatic support for shorthand notation to all arguments, not just the ones that are part of the registries, plus support inside configuration files.

A header with the version that generated the config is now included.

All subclasses for a given base class can be specified by name, so there’s no need to explicitly register them. The only requirement is that the module where the subclass is defined is imported prior to parsing.

The new version renders the registries and the auto_registry flag, introduced in 1.6.0, unnecessary, so we have deprecated them.

Support was also added for list appending; for example, to add a callback to an existing list that might be already configured:


Callback registration through entry points

Entry Points are an advanced feature in Python’s setuptools that allow packages to expose metadata to other packages. In Lightning, we allow an arbitrary package to include callbacks that the Lightning Trainer can automatically use when installed, without you having to manually add them to the Trainer. This is useful in production environments where it is common to provide specialized monitoring and logging callbacks globally for every application. file for a callbacks plugin package could look something like this:

Read more about callback entry points in our documentation.

Rank-zero only EarlyStopping messages

Our EarlyStopping callback implementation, by default, logs the stopping messages on every rank when it’s run in a distributed environment. This was done in case the monitored values were not synchronized. However, some users found this verbose. To avoid this, you can now set a flag:


A base Checkpoint class for extra customization

If you want to customize ModelCheckpoint callback, without all the extra functionality this class provides, this release provides an empty class Checkpoint for easier inheritance. In all internal code, the check is made against the Checkpoint class in order to ensure everything works properly for custom classes.

Validation now runs in overfitting mode

Setting overfit_batches=N, now enables validation and runs N number of validation batches during


Device Stats Monitoring support for HPUs

DeviceStatsMonitor callback can now be used to automatically monitor and log device stats during the training stage with Habana devices.


New Hooks


Now, hyper-parameters from LightningDataModule save to checkpoints and reload when training is resumed. And just like you use LightningModule.load_from_checkpoint to load a model using a checkpoint filepath, you can now load LightningDataModule using the same hook.

. . .

Experimental Features


ServableModule and its Servable Module Validator Callback

When serving models in production, it generally is a good pratice to ensure that the model can be served and optimzed before starting training to avoid wasting money.

To do so, you can import a ServableModule (an nn.Module) and add it as an extra base class to your base model as follows:

To make your model servable, you would need to implement three hooks:

  • configure_payload: Describe the format of the payload (data sent to the server).
  • configure_serialization: Describe the functions used to convert the payload to tensors (de-serialization) and tensors to payload (serialization)
  • serve_step: The method used to transform the input tensors to a dictionary of prediction tensors.

Finally, add the ServableModuleValidator callback to the Trainer to validate the model is servable on_train_start. This uses a FastAPI server.

Have a look at the full example here.

Asynchronous Checkpointing

You can now save checkpoints asynchronously using the AsyncCheckpointIO plugin without blocking your training process. To enable this, you can pass a AsyncCheckpointIO plugin to the Trainer.

Have a look at the full example here.


We’re very excited about this new release, and we hope you enjoy it. Stay tuned for upcoming posts where we will dive deeper into some of the key features of the new release. If you have any feedback we’d love to hear from you on the Community Slack!