Steps vs Iteration in Training

I’m a little confused about the difference between steps and iterations.

I’m training a large model on a big dataset, and I want to arbitrarily define an epoch as training on 2048 batches. This will allow me to get frequent checkpoints that I can test.

I wrote this code to achieve that:

checkpoint_callback = pl.callbacks.ModelCheckpoint(

trainer = pl.Trainer(limit_train_batches=2048,  
                     precision=16), train_dataloaders=train_loader.loader, val_dataloaders=val_loader.loader)

The dataloader uses a batch size of 768, so 2048 * 768 ~= 1.5M images. I estimated this would take 3 hours to train 1 epoch.

However, the progress bar was counting iterations and the epoch went on for a very long time (~9 hours). In the end it completed 23871 iterations before ending the epoch. The next epoch had a different number of iterations (31K)!

When I view the checkpoint the filename says steps=2048. But I can’t see any logical pattern between steps and iterations.

If someone could explain the difference and why I’m seeing this behaviour I would be really grateful.

In case anyone comes across this the issue appeared to be that there was no limit_val_batches passed to the trainer, and our validation dataset was huge.

Our code spent more time evaluating than training.

Once I added limit_val_batches=64 then the steps issue was resolved.