Does not run validation step after epoch when running with all data

jmr · April 27, 2023, 11:14am

Hi,

I’ve got the following module:

class EvaluationModel(pl.LightningModule):
    def __init__(
            self,
            train_data: List[pathlib.Path],
            val_data: List[pathlib.Path],
            batch_size=1024,
            learning_rate=1e-3,
            hidden_layers=10,
            hidden_layer_width=256
    ):
        super().__init__()
        self.train_data = train_data
        self.val_data = val_data
        self.batch_size = batch_size
        self.learning_rate = learning_rate
        self.save_hyperparameters()

        layers: List[Tuple[str, any]] = [
            (f'linear-entry', nn.Linear((12 * 8 * 8) + 3, hidden_layer_width, dtype=torch_dtype, bias=False)),
            (f'activation-entry', nn.ReLU())
        ]

        for i in range(hidden_layers):
            layers.append(
                (f'linear-{i}', nn.Linear(hidden_layer_width, hidden_layer_width, dtype=torch_dtype))
            )
            layers.append(
                (f'activation-{i}', nn.ReLU())
            )

        layers.append(('linear', nn.Linear(hidden_layer_width, 1, dtype=torch_dtype)))
        self.seq = nn.Sequential(collections.OrderedDict(layers))

    def forward(self, board, features):
        x = torch.cat([
            torch.flatten(board, 1),
            features
        ], 1)
        return self.seq(x)

    def training_step(self, batch, batch_idx):
        y = batch['score']
        y_hat = self.forward(batch['board'], batch['features'])
        loss = F.l1_loss(y_hat, y)
        self.log("train_loss", loss, on_step=True, on_epoch=True, prog_bar=True)
        return loss

    def validation_step(self, batch, batch_idx):
        y = batch['score']
        y_hat = self.forward(batch['board'], batch['features'])
        loss = F.l1_loss(y_hat, y)
        self.log("val_loss", loss, prog_bar=True)
        return loss

    def train_dataloader(self) -> DataLoader:
        dataset = EvaluationDataset(self.train_data)
        return DataLoader(dataset, batch_size=self.batch_size, num_workers=32, pin_memory=True, drop_last=True)

    def val_dataloader(self) -> DataLoader:
        dataset = EvaluationDataset(self.val_data)
        return DataLoader(dataset, batch_size=self.batch_size, num_workers=32, drop_last=True)

    def configure_optimizers(self):
        optimizer = optim.Adam(self.parameters(), lr=self.learning_rate)
        scheduler = ReduceLROnPlateau(optimizer, mode="min", verbose=True)
        return {
            "optimizer": optimizer,
            "lr_scheduler": {
                "scheduler": scheduler,
                "interval": "epoch",
                "frequency": 1,
                "monitor": "val_loss",
                "strict": True,
            }
        }

and the following trainer setup:

    lr = 1e-3
    model = EvaluationModel(
        train_paths, val_paths,
        batch_size=1024 * 4,
        learning_rate=lr,
        hidden_layers=6,
        hidden_layer_width=2048
    )
    callbacks = [
        StochasticWeightAveraging(swa_lrs=lr, device=None),
        EarlyStopping(monitor="val_loss", verbose=True, check_on_train_epoch_end=False),
        LearningRateMonitor(logging_interval='epoch', log_momentum=True),
        ModelCheckpoint(
            filename='epoch={epoch}-step={step}-val_loss={val_loss:.3f}-train_loss={train_loss_epoch:.3f}',
            save_top_k=-1,
        )
    ]
    accumulate_grad_batches = 7
    tb_logger = TensorBoardLogger(save_dir="logs_train3_2/")
    trainer = pl.Trainer(
        accelerator="gpu",
        max_epochs=2000,
        callbacks=callbacks,
        accumulate_grad_batches=accumulate_grad_batches,
        #precision='16-mixed',
        logger=tb_logger,
        # limit_train_batches=4,
        # limit_val_batches=4,
        # log_every_n_steps=1,
    )

    trainer.fit(
        model,
        # ckpt_path=list(pathlib.Path(r"logs_train3/lightning_logs/version_0/checkpoints").rglob("*.ckpt"))[0]
    )

Seems that if I un-comment the following lines:

        # limit_train_batches=4,
        # limit_val_batches=4,
        # log_every_n_steps=1,

everything works as expected, i.e, after each train epoch, it runs a validation epoch, and early stopping callback works as expected.

However if I comment these out, it doesn’t seem to run validation after epoch 0.
So far with check_on_train_epoch_end=False in EarlyStopping, is hasn’t failed on epoch 0, and is still running (epoch 1 now, but no val_loss metric in the progress bar).
However, if I remove check_on_train_epoch_end=False from EarlyStopping, it actually fails with an exception suggesting it can’t find val_loss metric after epoch 0 is finished.

Am I missing something? Why is it not running validation after epoch 0 when running with all the batches?

jmr · April 27, 2023, 11:44am

seems it moved to epoch 2 without having run avlidation

jmr · April 27, 2023, 7:23pm

I suspect I am hitting:

github.com/Lightning-AI/lightning

IterableDataset with wrong length causes validation loop to be skipped.

opened 12:48PM - 01 Nov 21 UTC

closed 02:46AM - 09 Jan 22 UTC

jopo666

bug help wanted won't fix

## 🐛 Bug When `IterableDataset` has a wrong length defined, specifically a hi…gher than the actual number of iterations, the validation epoch is skipped. ### To Reproduce ``` import os import torch from torch.utils.data import DataLoader, IterableDataset, get_worker_info from pytorch_lightning import LightningModule, Trainer class RandomDataset(IterableDataset): def __init__(self, size, length): self.len = length self.data = torch.randn(length, size) def sample_queue(self, indices): for index in indices: yield self.data[index] def __len__(self): return self.len * 2 def __iter__(self): indices = list(range(len(self.data))) # Get worker info. worker_info = get_worker_info() if worker_info is None: # Only a single process, so it gets all the data. return self.sample_queue(indices) else: # Divide indices to workers. worker_indices = indices[worker_info.id::worker_info.num_workers] return self.sample_queue(worker_indices) class BoringModel(LightningModule): def __init__(self): super().__init__() self.layer = torch.nn.Linear(32, 2) def forward(self, x): return self.layer(x) def training_step(self, batch, batch_idx): loss = self(batch).sum() self.log("train_loss", loss) return {"loss": loss} def validation_step(self, batch, batch_idx): loss = self(batch).sum() self.log("valid_loss", loss) def test_step(self, batch, batch_idx): loss = self(batch).sum() self.log("test_loss", loss) def configure_optimizers(self): return torch.optim.SGD(self.layer.parameters(), lr=0.1) def run(): train_data = DataLoader(RandomDataset(32, 64), batch_size=2) val_data = DataLoader(RandomDataset(32, 64), batch_size=2) test_data = DataLoader(RandomDataset(32, 64), batch_size=2) model = BoringModel() trainer = Trainer( default_root_dir=os.getcwd(), max_epochs=2, ) trainer.fit(model, train_dataloaders=train_data, val_dataloaders=val_data) trainer.test(model, dataloaders=test_data) if __name__ == "__main__": run() ``` ### Expected behavior Run validations loops even with the wrong length. ### Environment ``` Versions Collecting environment information... PyTorch version: 1.10.0+cu113 Is debug build: False CUDA used to build PyTorch: 11.3 ROCM used to build PyTorch: N/A OS: Ubuntu 20.04.3 LTS (x86_64) GCC version: (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 Clang version: Could not collect CMake version: Could not collect Libc version: glibc-2.33 Python version: 3.8.12 (default, Oct 12 2021, 13:49:34) [GCC 7.5.0] (64-bit runtime) Python platform: Linux-5.11.0-38-generic-x86_64-with-glibc2.17 Is CUDA available: True CUDA runtime version: Could not collect GPU models and configuration: GPU 0: Tesla V100-PCIE-16GB Nvidia driver version: 470.63.01 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Versions of relevant libraries: [pip3] numpy==1.21.3 [pip3] pytorch-lightning==1.4.9 [pip3] torch==1.10.0+cu113 [pip3] torchmetrics==0.6.0 [pip3] torchvision==0.11.1+cu113 ```

awaelchli · April 28, 2023, 6:12pm

@jmr Are you working with an iterable-style dataset here?

jmr · April 28, 2023, 6:42pm

I am, as I have a few hundred GiBs of training data chunked into 4GIB files, and the dataset reads one file and moves to the next.

Also, I know exactly how many samples there are of that data, and how many batches that will end up being, as each sample is of the same size, and sum of sizes of all files / sample size gives me that number.

Seems that if I don’t declare len on my iterable dataset it works, but then I get no progress reports, which is disappointing (given I know exactly how much data there is).

Is there a better way to do this?
Perhaps use normal Dataset for each 4GiB chunk and then some sort of ChainedDataset or something like that? Or will that still pre-read all of it into memory?

jmr · May 1, 2023, 5:24pm

Seems that setting check_val_every_n_epoch=None on the trainer causes validation to run each epoch and allows having Iterable datasets that declare length.

Topic		Replies	Views
Validation_step and validation_epoch_end won't get called in trainer.fit() routine LightningModule	4	6847	November 2, 2022
Running multiple validation steps after each training epoch implementation help	1	582	December 16, 2023
Torch.no_grad() calls implementations	4	3877	August 2, 2023
Training_epoch_end is never called LightningModule	3	1560	February 22, 2021
Run Validation and Checkpoint every n steps implementation help	0	245	April 5, 2024

Does not run validation step after epoch when running with all data

Related topics