Steps vs Iteration in Training

I’m a little confused about the difference between steps and iterations.

I’m training a large model on a big dataset, and I want to arbitrarily define an epoch as training on 2048 batches. This will allow me to get frequent checkpoints that I can test.

I wrote this code to achieve that:

checkpoint_callback = pl.callbacks.ModelCheckpoint(
    save_top_k=2,
    monitor="val_loss",
    mode="min",
)

trainer = pl.Trainer(limit_train_batches=2048,  
                     accelerator='gpu',
                     devices=4, 
                     strategy='dp',
                     max_epochs=100, 
                     log_every_n_steps=5,
                     callbacks=[checkpoint_callback],
                     precision=16)

trainer.fit(model, train_dataloaders=train_loader.loader, val_dataloaders=val_loader.loader)

The dataloader uses a batch size of 768, so 2048 * 768 ~= 1.5M images. I estimated this would take 3 hours to train 1 epoch.

However, the progress bar was counting iterations and the epoch went on for a very long time (~9 hours). In the end it completed 23871 iterations before ending the epoch. The next epoch had a different number of iterations (31K)!

When I view the checkpoint the filename says steps=2048. But I can’t see any logical pattern between steps and iterations.

If someone could explain the difference and why I’m seeing this behaviour I would be really grateful.

In case anyone comes across this the issue appeared to be that there was no limit_val_batches passed to the trainer, and our validation dataset was huge.

Our code spent more time evaluating than training.

Once I added limit_val_batches=64 then the steps issue was resolved.