Validate and test a model (intermediate)¶

During and after training we need a way to evaluate our models to make sure they are not overfitting while training and generalize well on unseen or real-world data. There are generally 2 stages of evaluation: validation and testing. To some degree they serve the same purpose, to make sure models works on real data but they have some practical differences.

Validation is usually done during training, traditionally after each training epoch. It can be used for hyperparameter optimization or tracking model performance during training. It’s a part of the training process.

Testing is usually done once we are satisfied with the training and only with the best model selected from the validation metrics.

Let’s see how these can be performed with Lightning.

Testing¶

Lightning allows the user to test their models with any compatible test dataloaders. This can be done before/after training and is completely agnostic to fit() call. The logic used here is defined under test_step().

Testing is performed using the Trainer object’s .test() method.

Trainer.test(model=None, dataloaders=None, ckpt_path=None, verbose=True, datamodule=None, weights_only=None)[source]

Perform one evaluation epoch over the test set. It’s separated from fit to make sure you never run on your test set until you want to.

Parameters:

model¶ (Optional[LightningModule]) – The model to test.
dataloaders¶ (Union[Any, LightningDataModule, None]) – An iterable or collection of iterables specifying test samples. Alternatively, a LightningDataModule that defines the test_dataloader hook.
ckpt_path¶ (Union[str, Path, None]) – Either "best", "last", "hpc", "registry" or path to the checkpoint you wish to test. If None and the model instance was passed, use the current weights. Otherwise, the best model checkpoint from the previous trainer.fit call will be loaded if a checkpoint callback is configured.
verbose¶ (bool) – If True, prints the test results.
datamodule¶ (Optional[LightningDataModule]) – A LightningDataModule that defines the test_dataloader hook.
weights_only¶ (Optional[bool]) – Defaults to None. If True, restricts loading to state_dicts of plain torch.Tensor and other primitive types. If loading a checkpoint from a trusted source that contains an nn.Module, use weights_only=False. If loading checkpoint from an untrusted source, we recommend using weights_only=True. For more information, please refer to the PyTorch Developer Notes on Serialization Semantics.

For more information about multiple dataloaders, see this section.

Return type:

list[Mapping[str, float]]

Returns:

List of dictionaries with metrics logged during the test phase, e.g., in model- or callback hooks like test_step() etc. The length of the list corresponds to the number of test dataloaders used.

Raises:

TypeError – If no model is passed and there was no LightningModule passed in the previous run. If model passed is not LightningModule or torch._dynamo.OptimizedModule.
MisconfigurationException – If both dataloaders and datamodule are passed. Pass only one of these.
RuntimeError – If a compiled model is passed and the strategy is not supported.

Test after Fit¶

To run the test set after training completes, use this method.

# run full training
trainer.fit(model)

# (1) load the best checkpoint automatically (lightning tracks this for you during .fit())
trainer.test(ckpt_path="best")

# (2) load the last available checkpoint (only works if `ModelCheckpoint(save_last=True)`)
trainer.test(ckpt_path="last")

# (3) test using a specific checkpoint
trainer.test(ckpt_path="/path/to/my_checkpoint.ckpt")

# (4) test with an explicit model (will use this model and not load a checkpoint)
trainer.test(model)

Warning

It is recommended to test with Trainer(devices=1) since distributed strategies such as DDP use DistributedSampler internally, which replicates some samples to make sure all devices have same batch size in case of uneven inputs. This is helpful to make sure benchmarking for research papers is done the right way.

Test Multiple Models¶

You can run the test set on multiple models using the same trainer instance.

model1 = LitModel()
model2 = GANModel()

trainer = Trainer()
trainer.test(model1)
trainer.test(model2)

Test Pre-Trained Model¶

To run the test set on a pre-trained model, use this method.

model = MyLightningModule.load_from_checkpoint(
    checkpoint_path="/path/to/pytorch_checkpoint.ckpt",
    hparams_file="/path/to/experiment/version/hparams.yaml",
    map_location=None,
)

# init trainer with whatever options
trainer = Trainer(...)

# test (pass in the model)
trainer.test(model)

In this case, the options you pass to trainer will be used when running the test set (ie: 16-bit, dp, ddp, etc…)

Test with Additional DataLoaders¶

You can still run inference on a test dataset even if the test_dataloader() method hasn’t been defined within your lightning module instance. This would be the case when your test data is not available at the time your model was declared.

# setup your data loader
test_dataloader = DataLoader(...)

# test (pass in the loader)
trainer.test(dataloaders=test_dataloader)

You can either pass in a single dataloader or a list of them. This optional named parameter can be used in conjunction with any of the above use cases. Additionally, you can also pass in an datamodules that have overridden the test_dataloader method.

class MyDataModule(L.LightningDataModule):
    ...

    def test_dataloader(self):
        return DataLoader(...)


# setup your datamodule
dm = MyDataModule(...)

# test (pass in datamodule)
trainer.test(datamodule=dm)

Test with Multiple DataLoaders¶

When you need to evaluate your model on multiple test datasets simultaneously (e.g., different domains, conditions, or evaluation scenarios), PyTorch Lightning supports multiple test dataloaders out of the box.

To use multiple test dataloaders, simply return a list of dataloaders from your test_dataloader() method:

class LitModel(L.LightningModule):
    def test_dataloader(self):
        return [
            DataLoader(clean_test_dataset, batch_size=32),
            DataLoader(noisy_test_dataset, batch_size=32),
            DataLoader(adversarial_test_dataset, batch_size=32),
        ]

When using multiple test dataloaders, your test_step method must include a dataloader_idx parameter:

def test_step(self, batch, batch_idx, dataloader_idx: int = 0):
    x, y = batch
    y_hat = self(x)
    loss = F.cross_entropy(y_hat, y)

    # Use dataloader_idx to handle different test scenarios
    return {'test_loss': loss}

Logging Metrics Per Dataloader¶

Lightning provides automatic support for logging metrics per dataloader:

def test_step(self, batch, batch_idx, dataloader_idx: int = 0):
    x, y = batch
    y_hat = self(x)
    loss = F.cross_entropy(y_hat, y)
    acc = (y_hat.argmax(dim=1) == y).float().mean()

    # Lightning automatically adds "/dataloader_idx_X" suffix
    self.log('test_loss', loss, add_dataloader_idx=True)
    self.log('test_acc', acc, add_dataloader_idx=True)

    return loss

This will create metrics like test_loss/dataloader_idx_0, test_loss/dataloader_idx_1, etc.

For more meaningful metric names, you can use custom naming where you need to make sure that individual names are unique across dataloaders.

def test_step(self, batch, batch_idx, dataloader_idx: int = 0):
    # Define meaningful names for each dataloader
    dataloader_names = {0: "clean", 1: "noisy", 2: "adversarial"}
    dataset_name = dataloader_names.get(dataloader_idx, f"dataset_{dataloader_idx}")

    # Log with custom names
    self.log(f'test_loss_{dataset_name}', loss, add_dataloader_idx=False)
    self.log(f'test_acc_{dataset_name}', acc, add_dataloader_idx=False)

Processing Entire Datasets Per Dataloader¶

To perform calculations on the entire test dataset for each dataloader (e.g., computing overall metrics, creating visualizations), accumulate results during test_step and process them in on_test_epoch_end:

class LitModel(L.LightningModule):
    def __init__(self):
        super().__init__()
        # Store outputs per dataloader
        self.test_outputs = {}

    def test_step(self, batch, batch_idx, dataloader_idx: int = 0):
        x, y = batch
        y_hat = self(x)
        loss = F.cross_entropy(y_hat, y)

        # Initialize and store results
        if dataloader_idx not in self.test_outputs:
            self.test_outputs[dataloader_idx] = {'predictions': [], 'targets': []}
        self.test_outputs[dataloader_idx]['predictions'].append(y_hat)
        self.test_outputs[dataloader_idx]['targets'].append(y)
        return loss

    def on_test_epoch_end(self):
        for dataloader_idx, outputs in self.test_outputs.items():
            # Concatenate all predictions and targets for this dataloader
            all_predictions = torch.cat(outputs['predictions'], dim=0)
            all_targets = torch.cat(outputs['targets'], dim=0)

            # Calculate metrics on the entire dataset, log and create visualizations
            overall_accuracy = (all_predictions.argmax(dim=1) == all_targets).float().mean()
            self.log(f'test_overall_acc_dataloader_{dataloader_idx}', overall_accuracy)
            self._save_results(all_predictions, all_targets, dataloader_idx)

        self.test_outputs.clear()

Note

When using multiple test dataloaders, trainer.test() returns a list of results, one for each dataloader:

results = trainer.test(model)
print(f"Results from {len(results)} test dataloaders:")
for i, result in enumerate(results):
    print(f"Dataloader {i}: {result}")

Validation¶

Lightning allows the user to validate their models with any compatible val dataloaders. This can be done before/after training. The logic associated to the validation is defined within the validation_step().

Apart from this .validate has same API as .test, but would rely respectively on validation_step() and test_step().

Note

.validate method uses the same validation logic being used under validation happening within fit() call.

Warning

When using trainer.validate(), it is recommended to use Trainer(devices=1) since distributed strategies such as DDP uses DistributedSampler internally, which replicates some samples to make sure all devices have same batch size in case of uneven inputs. This is helpful to make sure benchmarking for research papers is done the right way.

Trainer.validate(model=None, dataloaders=None, ckpt_path=None, verbose=True, datamodule=None, weights_only=None)[source]

Perform one evaluation epoch over the validation set.

Parameters:

model¶ (Optional[LightningModule]) – The model to validate.
dataloaders¶ (Union[Any, LightningDataModule, None]) – An iterable or collection of iterables specifying validation samples. Alternatively, a LightningDataModule that defines the val_dataloader hook.
ckpt_path¶ (Union[str, Path, None]) – Either "best", "last", "hpc", "registry" or path to the checkpoint you wish to validate. If None and the model instance was passed, use the current weights. Otherwise, the best model checkpoint from the previous trainer.fit call will be loaded if a checkpoint callback is configured.
verbose¶ (bool) – If True, prints the validation results.
datamodule¶ (Optional[LightningDataModule]) – A LightningDataModule that defines the val_dataloader hook.
weights_only¶ (Optional[bool]) –
Defaults to None. If True, restricts loading to state_dicts of plain torch.Tensor and other primitive types. If loading a checkpoint from a trusted source that contains an nn.Module, use weights_only=False. If loading checkpoint from an untrusted source, we recommend using weights_only=True. For more information, please refer to the PyTorch Developer Notes on Serialization Semantics.

For more information about multiple dataloaders, see this section.

Return type:

list[Mapping[str, float]]

Returns:

List of dictionaries with metrics logged during the validation phase, e.g., in model- or callback hooks like validation_step() etc. The length of the list corresponds to the number of validation dataloaders used.

Raises:

TypeError – If no model is passed and there was no LightningModule passed in the previous run. If model passed is not LightningModule or torch._dynamo.OptimizedModule.
MisconfigurationException – If both dataloaders and datamodule are passed. Pass only one of these.
RuntimeError – If a compiled model is passed and the strategy is not supported.