Lightning AI Studios: Never set up a local environment again →

← Back to blog

Creating Runs & Attaching Datastores

Overview of Runs

When you’re ready to train your models at scale, you can use Runs. A Run is a collection of experiments.

Runs allow you to scale your machine learning code to hundreds of GPUs and model configurations without needing to change a single line of code. Grid Runs support all major machine learning frameworks, enabling full hyperparameter sweeps, native logging, artifacts, and Spot Instances all out of the box without the need to modify a single line of machine learning code.

Runs are “serverless”, which means that you only pay for the time your scripts are actually running. When running on your own infrastructure, this results in massive cost savings.

Grid Runs respect the use of .ignore files, which are used to tell a program which files it should ignore during execution. Grid gives preference to the .gridignore file. In the absence of a .gridignore file, Grid will concatenate the .gitignore and .dockerignore files to determine which files should be ignored. When creating a run, you do not have to provide these files to the CLI or UI – they are just expected to reside in the project root directory.

Note: The examples used in this tutorial assume you have already installed and set up Grid. If you haven’t done this already, please visit The First TIme You Use Grid to learn more. 

How to Create Runs

Runs are customizable and provide serverless compute. Here, we cover all available methods to customize Runs for any specific use case. The examples in this tutorial cover the following:

  1. Creating vanilla Runs
  2. Creating Runs with script dependencies
    1. Handling requirements
    2. Runs with specified requirements.txt
    3. Runs with specified environment.yml
  3. Attaching Datastores to Runs
  4. Interruptible Runs

Creating Vanilla Runs

A “vanilla” Run is a basic Run that only executes a script. This hello_world repo will be used in the following example.

git clone https://github.com/PyTorchLightning/grid-tutorials.git
cd features-intro/runs
grid run --name hello hello.py

The above code is passing a script named hello.py to the Run. The script will print out ‘hello_world’.

For instructions on how to view logs, check out viewing logs produced by Runs.

Creating Runs with Script Dependencies

If you’ve taken a look at the grid-tutorials repo, you may have noticed three things:

  1. It has a requirements.txt in the root directory
  2. There is a directory called “pip”
  3. There is a directory called “conda”

Let’s quickly discuss how Grid handles requirements before touching on each of these topics.

Handling Requirements

Any time you create a Run, Grid attempts to resolve as many dependencies as it can automatically for you. Nested requirements are not currently supported.

We do, however, recommend that your projects have a requirements.txt file in the root.

Runs with Specified requirements.txt

Runs allow you to specify which requirements.txt you want to use for package installation. This is especially useful when your directory isn’t ordered in such a way that the requirements.txt resides at the root project level, or if you have more than one requirements.txt file. 

In these cases, you can use the below example as a template for specifying which requirements.txt file should be used for package installation.

git clone https://github.com/PyTorchLightning/grid-tutorials.git
cd features-intro/runs
grid run --name specified-requirements-pip --dependency_file ./pip/requirements.txt hello.py

You may have noticed that we did something different here than in prior examples: we used the --dependency_file flag. This flag tells Grid what file should be used for package installation in the Run.

Runs with Specified environment.yml

Runs allow you to specify the environment.yml you want to use for package installation. This is the only way to get Runs to use the Conda package manager without using a config file. 

When running on a non-Linux machine, we recommend using conda env export --from-history before creating a Run, as mentioned in the official Conda documentation. This is because conda export will output dependencies specifically for your operating system. 

You can use the example below as a template for specifying which environment.yml file should be used for package installation:

git clone https://github.com/PyTorchLightning/grid-tutorials.git
cd features-intro/runs
grid run --name specified-requirements-conda --dependency_file ./conda/environemnt.yml hello.py

Attaching Datastores to Runs

To speed up training iteration time, you can store your data in a Grid Datastore. Datastores are high-performance, low-latency, versioned datasets. If you have large-scale data, Datastores can resolve blockers in your workflow by eliminating the need to download the large dataset every time your script runs. 

If you haven’t done so already, create a Datastore from the cifar5 dataset using the following commands:

# download
curl https://pl-flash-data.s3.amazonaws.com/cifar5.zip -o cifar5.zip
# unzip
unzip cifar5.zip
grid datastore create cifar5/ --name cifar5

Now let’s mount this Datastore to a Run:

git clone https://github.com/PyTorchLightning/grid-tutorials.git
cd features-intro/runs
grid run --name attaching-datastore --datastore_name cifar5 --datastore_version 1 datastore.py --data_dir /datastores/cifar5/1

This code passes a script named datastore.py to the Run. The script prints the contents of the Datastore to the root directory. You should see the following output in your stdout logs:

['test', 'train']

Interruptible Runs

Interruptible Runs powered by spot instances are 50-90% cheaper than on-demand instances, but they can be interrupted at any time if the machine gets taken away. Here is how you launch a Run with spot instances:

grid run --use_spot train.py

What happens to your models if the Run gets interrupted? 

Grid keeps all the artifacts that you saved during training, including logs, checkpoints and other files. This means that if you write your training script such that it periodically saves checkpoint files with all the states needed to resume your training, you can restart the Grid Run from where it was interrupted:

grid run --use_spot train.py --checkpoint_path "https://grid.ai/url/to/ckpt"

Writing the logic for checkpointing and resuming the training loop correctly, however, can be difficult and time consuming. 

PyTorch Lightning removes the need to write all this boilerplate code. In fact, if you implement your training script with PyTorch Lightning, you will have to change zero lines of code to use interruptible Runs in Grid. All you have to do is add the --auto-resume flag to the grid run command and to make your experiments fault-tolerant:

grid run --use_spot --auto_resume train.py

If this Run gets interrupted, PyTorch Lightning will save a fault-tolerant checkpoint automatically, Grid will collect it, provision a new machine, restart the Run for you and let PyTorch Lightning restore the training state where it left off. Mind-blowing! Learn more about auto-resuming experiments in Grid or the fault-tolerance feature in PyTorch Lightning.

And that’s it! You can check out other Grid tutorials, or browse the Grid Docs to learn more about anything not covered in this tutorial.

As always, Happy Grid-ing!