Single-GPU benchmarking in Thunder
After reading this section you’ll be able to understand what benchmarks are available in Thunder, how to run them and how to create one yourself.
Introduction
In Thunder there are a two of ways of benchmarking the compiler:
One is by running a synthetic end-to-end training of a model from LitGPT ; and
the second is by microbenchmarking specific snippets of code.
Before starting, you need to install Thunder and the devel packages with:
pip install -r requirements/devel.txt
pip install -e .
LitGPT benchmarks
The easiest way to run a single benchmark in Thunder is by running an instance of the LitGPT end-to-end training script that can be found in thunder/benchmarks/benchmark_litgpt.py
.
To run a benchmark all we need is the following command:
python thunder/benchmarks/benchmark_litgpt.py --model_name <model name> --compile "thunder"
All the command line options can be queried by passing --help
as argument or can be seen here. However, the most important options for single GPU benchmarks are:
--compile
: specifies the compile mode for the run--model_name
: LitGPT model name, a list of these can be found here--n_layers
: specifies the number of layers in the model and can be useful to test a reduced version of the model in the case the original version does not fit into memory.--nsys_enabled
: inserts markers to profile the run with NVIDIA Nsight System
The output from this end-to-end benchmark will look something like this:
Time to instantiate model: 0.12 seconds.
iter 0: loss 10.5000, iter time: 73490.96ms, t: 4096
...
iter 44: loss 4.6250, iter time: 385.25ms, t: 4096
Model name: Llama-2-7b-hf
Seq Length: 4096
Micro BS: 1
Global BS: 1
Number of Layers: 32
Number of parameters: 6.74B
Distributed Mode: none
Compiler: thunder
Low Precision Mode: none
Average iter time: 383.11 ms
Memory used: 64.22 GB
Tokens/s: 10690.01
Tokens/s/GPU: 10690.01
TFLOP/s: 492.65
Note
Beware the memory footprint of certain models! In this example, running Llama-2-7b-hf
on H100 with default Thunder compile option requires upwards of ~65GB of memory. Pro tip: you can always play with the --n_layers
option to run reduced versions of the model that can fit in memory.
Compile options
With the --compile
option, you can test:
torch.compile by specifying
inductor
torch eager mode by specifying
eager
, orThunder by specifying
thunder
To customize Thunder executors, in addition to nvFuser, you can append the any combination of the following to the string with an underscore:
inductor_cat
cudnn
transformerengine
As an example, if you want to use cudnn as executor your terminal will looks something like:
python thunder/benchmarks/benchmark_litgpt.py --model_name Llama-2-7b-hf --compile thunder_cudnn
and if you are testing torch.compile
then it will look something like this:
python thunder/benchmarks/benchmark_litgpt.py --model_name Llama-2-7b-hf --compile inductor
pytest benchmarks
If instead of running an e2e training benchmark you want to be more specific, Thunder has you covered with the pytest based benchmarks (more specifically pytest-benchmark).
These benchmarks are defined in two parts, the implementation is in thunder/benchmarks/__init__.py
and the hook for pytest is in thunder/benchmarks/targets.py
.
In the next section you’ll see more of the details, but for now let’s start by listing all the available benchmarks with:
pytest thunder/benchmarks/targets.py --collect-only
To run all the available benchmarks, it’s as simple as calling:
pytest thunder/benchmarks/targets.py
However, more realistically you’d want to filter and run just specific benchmarks. To do so, you can use the filter syntax along with the -k option:
pytest thunder/benchmarks/targets.py -k 'nanogpt_gpt2 and not torch.compile and not xl and not inference' --benchmark-group-by='param:compute_type'
This example will select the benchmarks run them and print the results grouped the results by compute type(forward and backward in this case) thanks to the --benchmark-group-by
flag.
The output will look something like this(it’s pretty wide so it looks a bit weird on narrow windows):
------------------------------------------------------------------- benchmark 'compute_type=ComputeType.TRAINING_BACKWARD': 2 tests ---------------------------------------------------------------
Name (time in ms) Min Max Mean StdDev Median IQR Outliers OPS Rounds Iterations
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_nanogpt_gpt2[backward-torch] 11.1503 (1.0) 11.7122 (1.0) 11.2785 (1.0) 0.0973 (1.65) 11.2674 (1.0) 0.1069 (1.12) 16;4 88.6641 (1.0) 93 1
test_nanogpt_gpt2[backward-thunder] 11.4634 (1.03) 11.7805 (1.01) 11.6194 (1.03) 0.0590 (1.0) 11.6087 (1.03) 0.0952 (1.0) 28;0 86.0632 (0.97) 91 1
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
------------------------------------------------------------------ benchmark 'compute_type=ComputeType.TRAINING_FORWARD': 2 tests -----------------------------------------------------------------
Name (time in ms) Min Max Mean StdDev Median IQR Outliers OPS Rounds Iterations
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_nanogpt_gpt2[forward-torch] 5.0307 (1.0) 5.5468 (1.0) 5.1072 (1.0) 0.0901 (1.0) 5.0885 (1.0) 0.0402 (1.0) 11;15 195.8038 (1.0) 228 1
test_nanogpt_gpt2[forward-thunder] 7.5619 (1.50) 8.0979 (1.46) 7.6878 (1.51) 0.1358 (1.51) 7.6421 (1.50) 0.0602 (1.50) 15;15 130.0763 (0.66) 133 1
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Legend:
Outliers: 1 Standard Deviation from Mean; 1.5 IQR (InterQuartile Range) from 1st Quartile and 3rd Quartile.
OPS: Operations Per Second, computed as 1 / Mean
================================================================ 4 passed, 598 deselected in 113.92s (0:01:53) ====================================================================================
Comparing pytest runs
Another tool at your disposal is the comparison offered by pytest-benchmark
:
pytest thunder/benchmarks/targets.py --benchmark-autosave -k "thunder]"
[... your changes ...]
pytest thunder/benchmarks/targets.py --benchmark-autosave -k "thunder]"
pytest-benchmark compare 0001 0002 --group-by='name'
By using --autosave
pytest will save the results so that you can read or compare them later.
Writing your own benchmark
Now that you’ve seen how the benchmarks work, it’s time to add your own benchmark to Thunder by:
Creating a class that is a subclass of
thunder.benchmark.Benchmark
and define it’s methods;Declaring a function with name starting with
test_
that uses the class created in the previous step; andParametrizing the function with all the options needed.
Let’s take a deeper dive for each point.
Creating a benchmarking class
As stated before, you need to create a class that inherits from thunder.benchmark.Benchmark
as following:
from thunder.benchmarks import Benchmark, BenchmarkArg
class FooBenchmark(Benchmark):
@classmethod
@property
def name(cls) -> str:
return "foo_bench"
@classmethod
@property
def description(cls) -> str:
return "Benchmark for foo function"
Note
The name
should be short, distinct, and a valid filename like “nanogpt” or “llamba-block” and
the description
should be a short sentence describing the benchmark like “NanoGPT’s LayerNorm module forward”.
The next step is to declare a list of accepted arguments from this benchmark as a property of the class and a class method that returns those arguments:
_args = (
BenchmarkArg(name="device", description="A string representing the device. Default is 'cuda'."),
BenchmarkArg(name="dtype", description="The dtype of the tensors. Default is thunder.float32."),
)
@classmethod
@property
def args(cls) -> tuple[BenchmarkArg, ...]:
return cls._args
Now that the arguments are setup, the __init__()
method must be implemented:
def __init__(self, device="cuda", dtype=thunder.float32):
super().__init__(self)
self.device: str = device
self.dtype: dtypes.dtype = dtype
Note
__init__()
should call super()
and it can accept additional optional parameters, like parameters with default values or kwargs other than the BenchmarkArg
, but these parameters must be after the benchmark arg parameters.
Next, you’ll want to create the data for your benchmark. To do so, you must implement a make_batch()
method that prepares a valid input for the benchmark, possibly modified by the initialization arguments:
def make_batch(self) -> tuple[list, dict]:
make = partial(make_tensor, device=self.device, dtype=self.dtype)
return (make(10, 10),), {}
Now comes the best part, the fn()
method, which should return the callable that will be benchmarked. The return callable should accept the output of make_batch()
def fn(self) -> Callable:
def foo(a):
return a + a
return foo
If your benchmark doesn’t need any further steps you’d be done here however, consider the case where you want to benchmark a model, then you fn()
method would look something like:
def fn(self) -> Callable:
class FooNetwork(torch.nn.Module):
def __init__(self):
super().__init__()
self.layer = torch.nn.Linear(10, 10)
def forward(self, x):
return self.layer(x)
foo = FooNetwork().to(device=self.device, dtype=self.dtype).requires_grad_()
return foo
Now this is just half of the test, what about the backward pass? In this case, you’ll need to implement a postprocess_for_backward()
method to take care of that:
def postprocess_for_backward(self, out: torch.Tensor) -> torch.Tensor | None:
# Check if backward it's needed at all
if not self.requires_grad:
return
targets = make_tensor_like(out) # fake targets
loss = torch.nn.functional.mse_loss(out, targets)
return loss
Note
This method will be given the output of fn(), and if it returns a torch.Tensor t that requires grad then the benchmark will call t.backward(torch.randn_like(t)). By default, postprocess_for_backward() returns the output of fn(), or the first element of the output of fn() if fn() returns a Sequence.
Declaring a test function and its parametrization
Now that your benchmarking class is ready you have nowhere to call it. To address this issue, let’s write a test_
prefixed function in thunder/benchmarks/targets.py
that will use the newly created FooBenchmark
class:
def test_foo(benchmark):
bench: Benchmark = FooBenchmark(device="cuda", dtype=thunder.bfloat16)
args, kwargs = bench.make_batch()
benchmark(bench.fn(), *args, **kwargs)
Great! You are ready to benchmark foo()
! But what if you want to test it with different Thunder executors? Here comes parametrization to help. To parametrize the function all it’s needed it’s the use of the @pytest.mark.parametrize
decorator as following:
@pytest.mark.parametrize(
"executor",
(
torch_executor,
torch_compile_executor,
thunder_executor,
),
ids=("torch", "torch.compile", "thunder"),
)
def test_foo(benchmark, executor):
bench: Benchmark = FooBenchmark(device="cuda", dtype=thunder.bfloat16)
args, kwargs = bench.make_batch()
fn = executor(bench.fn())
benchmark(fn, *args, **kwargs)
Here you go, now you are ready to start benchmarking! For more information about the parametrization syntax you can get a look here.
Benchmarking forward and backward separately
As seen earlier, it’s possible to write benchmarks for models and not just standalone functions. What if you want to benchmark forward and backward pass separately? It’s possible by tweaking the test_
function you just declared in thunder/benchmarks/targets.py
like so:
#[...previous parametrization omitted here...]
@parametrize_compute_type
def test_foo(benchmark, executor, compute_type: ComputeType):
bench: Benchmark = FooBenchmark(device="cuda", dtype=thunder.bfloat16)
args, kwargs = bench.make_batch()
fn = executor(bench.fn())
benchmark_for_compute_type(compute_type, benchmark, fn, *args, **kwargs)
And that’s as simple as that! Just add the decorator @parametrize_compute_type
after your parametrization, add the compute_type
argument, and use benchmark_for_compute_type
to call the benchmark function.
Isolate benchmarks to avoid OutOfMemory errors
When running multiple benchmarks in sequence, pytest
does not always do a good job cleaning up, and sometimes it happens that, while they work when called standalone, benchmarks fail anyway. The main problem we observed is that memory is not entirely freed before running the next benchmark, therefore the option --isolate-benchmarks
comes in rescue. It will separate the benchmark runs, creating a sub-process for each benchmark configuration and run them one after the other. Logs of failures will be saved in the failed_benchmarks_logs
folder and benchmark results will be saved in the form of json in the benchmarks_reports
folder unless the THUNDER_BENCH_DIR
environment variable is specified.