• Docs >
  • Train on the cloud (intermediate)
Shortcuts

Train on the cloud (intermediate)

Audience: User looking to run many models at once


What is a sweep?

A sweep is the term giving to running the same model multiple times with different hyperparameters to find the one that performs the best (according to your definition of performance).

Let’s say I have a python script that trains a Lighting model to classify images. We run this file like so:

grid run file.py --batch_size 8

with such a model, I would be interested in knowing how it performs with different batch size. In this case, I’m going to train many versions of this model.

# run 4 models in parallel
grid run file.py --batch_size 8
grid run file.py --batch_size 16
grid run file.py --batch_size 32
grid run file.py --batch_size 64

Now I can see how my model performs according to the layers and based on time and cost I can pick my “best” model:

Training speed vs cost

Batch size

classification accuracy (%)

training time

cost

8

0.80

5 minutes

$0.15

16

0.85

10 minutes

$0.30

32

0.90

30 minutes

$0.50

64

0.95

60 minutes

$1.01


Start a Sweep

First, recall that in the previous tutorial we ran a single model using this command:

grid run --datastore_name cifar5 cifar5.py --data_dir /datastores/cifar5

Now we’re going to run that same model 4 different times each with a different number of layers:

grid run --datastore_name cifar5 cifar5.py --data_dir /datastores/cifar5 --batch_size 8
grid run --datastore_name cifar5 cifar5.py --data_dir /datastores/cifar5 --batch_size 16
grid run --datastore_name cifar5 cifar5.py --data_dir /datastores/cifar5 --batch_size 32
grid run --datastore_name cifar5 cifar5.py --data_dir /datastores/cifar5 --batch_size 64

Grid has a special syntax based on python that gives you shortcuts for sweeps. The shortcut for the above commands is:

grid run --datastore_name cifar5 cifar5.py --data_dir /datastores/cifar5 --batch_size "[8, 16, 32, 64]"

Syntax Shortcuts

List

grid run file.py --batch_size "[8, 16, 32, 64]"

equivalent to:

grid run file.py --batch_size 8
grid run file.py --batch_size 16
grid run file.py --batch_size 32
grid run file.py --batch_size 64

Range

grid run file.py --batch_size "range(1, 10, 2)"

equivalent to:

grid run main.py --batch_size 1
grid run main.py --batch_size 3
grid run main.py --batch_size 5
grid run main.py --batch_size 7
grid run main.py --batch_size 9

String list

grid run file.py --model_backbone "['resnet18' 'transformer', 'resnet50']"

equivalent to:

grid run file.py --model_backbone 'resnet18'
grid run file.py --model_backbone 'transformer'
grid run file.py --model_backbone 'resnet50'

Sampling

grid run file.py --learning_rate "uniform(1e-5, 1e-1, 3)"

equivalent to:

grid run file.py --learning_rate 0.03977392
grid run file.py --learning_rate 0.04835479
grid run file.py --learning_rate 0.05200016

Sweep strategies

Models often have dozens of hyperparameters. We usually don’t run all combinations because it would be too prohibitive. Grid supports two strategies:




Next Steps

Here are the recommended next steps depending on your workflow.