Overview of Runs
When you’re ready to train your models at scale, you can use Runs. A Run is a collection of experiments.
Runs allow you to scale your machine learning code to hundreds of GPUs and model configurations without needing to change a single line of code. Grid Runs support all major machine learning frameworks, enabling full hyperparameter sweeps, native logging, artifacts, and Spot Instances all out of the box without the need to modify a single line of machine learning code.
Runs are “serverless”, which means that you only pay for the time your scripts are actually running. When running on your own infrastructure, this results in massive cost savings.
Grid Runs respect the use of .ignore files, which are used to tell a program which files it should ignore during execution. Grid gives preference to the .gridignore file. In the absence of a .gridignore file, Grid will concatenate the .gitignore and .dockerignore files to determine which files should be ignored. When creating a run, you do not have to provide these files to the CLI or UI – they are just expected to reside in the project root directory.
Note: The examples used in this tutorial assume you have already installed and set up Grid. If you haven’t done this already, please visit The First TIme You Use Grid to learn more.
How to Create Runs
Runs are customizable and provide serverless compute. Here, we cover all available methods to customize Runs for any specific use case. The examples in this tutorial cover the following:
- Creating vanilla Runs
- Creating Runs with script dependencies
- Handling requirements
- Runs with specified requirements.txt
- Runs with specified environment.yml
- Attaching Datastores to Runs
- Interruptible Runs
Creating Vanilla Runs
A “vanilla” Run is a basic Run that only executes a script. This hello_world repo will be used in the following example.
git clone https://github.com/PyTorchLightning/grid-tutorials.git cd features-intro/runs grid run --name hello hello.py
The above code is passing a script named hello.py to the Run. The script will print out ‘hello_world’.
For instructions on how to view logs, check out viewing logs produced by Runs.
Creating Runs with Script Dependencies
If you’ve taken a look at the grid-tutorials repo, you may have noticed three things:
- It has a requirements.txt in the root directory
- There is a directory called “pip”
- There is a directory called “conda”
Let’s quickly discuss how Grid handles requirements before touching on each of these topics.
Handling Requirements
Any time you create a Run, Grid attempts to resolve as many dependencies as it can automatically for you. Nested requirements are not currently supported.
We do, however, recommend that your projects have a requirements.txt
file in the root.
Runs with Specified requirements.txt
Runs allow you to specify which requirements.txt you want to use for package installation. This is especially useful when your directory isn’t ordered in such a way that the requirements.txt
resides at the root project level, or if you have more than one requirements.txt
file.
In these cases, you can use the below example as a template for specifying which requirements.txt
file should be used for package installation.
git clone https://github.com/PyTorchLightning/grid-tutorials.git cd features-intro/runs grid run --name specified-requirements-pip --dependency_file ./pip/requirements.txt hello.py
You may have noticed that we did something different here than in prior examples: we used the --dependency_file
flag. This flag tells Grid what file should be used for package installation in the Run.
Runs with Specified environment.yml
Runs allow you to specify the environment.yml
you want to use for package installation. This is the only way to get Runs to use the Conda package manager without using a config file.
When running on a non-Linux machine, we recommend using conda env export --from-history
before creating a Run, as mentioned in the official Conda documentation. This is because conda export
will output dependencies specifically for your operating system.
You can use the example below as a template for specifying which environment.yml
file should be used for package installation:
git clone https://github.com/PyTorchLightning/grid-tutorials.git cd features-intro/runs grid run --name specified-requirements-conda --dependency_file ./conda/environemnt.yml hello.py
Attaching Datastores to Runs
To speed up training iteration time, you can store your data in a Grid Datastore. Datastores are high-performance, low-latency, versioned datasets. If you have large-scale data, Datastores can resolve blockers in your workflow by eliminating the need to download the large dataset every time your script runs.
If you haven’t done so already, create a Datastore from the cifar5 dataset using the following commands:
# download curl https://pl-flash-data.s3.amazonaws.com/cifar5.zip -o cifar5.zip # unzip unzip cifar5.zip grid datastore create cifar5/ --name cifar5
Now let’s mount this Datastore to a Run:
git clone https://github.com/PyTorchLightning/grid-tutorials.git cd features-intro/runs grid run --name attaching-datastore --datastore_name cifar5 --datastore_version 1 datastore.py --data_dir /datastores/cifar5/1
This code passes a script named datastore.py to the Run. The script prints the contents of the Datastore to the root directory. You should see the following output in your stdout logs:
['test', 'train']
Interruptible Runs
Interruptible Runs powered by spot instances are 50-90% cheaper than on-demand instances, but they can be interrupted at any time if the machine gets taken away. Here is how you launch a Run with spot instances:
grid run --use_spot train.py
What happens to your models if the Run gets interrupted?
Grid keeps all the artifacts that you saved during training, including logs, checkpoints and other files. This means that if you write your training script such that it periodically saves checkpoint files with all the states needed to resume your training, you can restart the Grid Run from where it was interrupted:
grid run --use_spot train.py --checkpoint_path "https://grid.ai/url/to/ckpt"
Writing the logic for checkpointing and resuming the training loop correctly, however, can be difficult and time consuming.
PyTorch Lightning removes the need to write all this boilerplate code. In fact, if you implement your training script with PyTorch Lightning, you will have to change zero lines of code to use interruptible Runs in Grid. All you have to do is add the --auto-resume
flag to the grid run
command and to make your experiments fault-tolerant:
grid run --use_spot --auto_resume train.py
If this Run gets interrupted, PyTorch Lightning will save a fault-tolerant checkpoint automatically, Grid will collect it, provision a new machine, restart the Run for you and let PyTorch Lightning restore the training state where it left off. Mind-blowing! Learn more about auto-resuming experiments in Grid or the fault-tolerance feature in PyTorch Lightning.
And that’s it! You can check out other Grid tutorials, or browse the Grid Docs to learn more about anything not covered in this tutorial.
As always, Happy Grid-ing!