I’m a little confused about the difference between steps and iterations.
I’m training a large model on a big dataset, and I want to arbitrarily define an epoch as training on 2048 batches. This will allow me to get frequent checkpoints that I can test.
I wrote this code to achieve that:
checkpoint_callback = pl.callbacks.ModelCheckpoint( save_top_k=2, monitor="val_loss", mode="min", ) trainer = pl.Trainer(limit_train_batches=2048, accelerator='gpu', devices=4, strategy='dp', max_epochs=100, log_every_n_steps=5, callbacks=[checkpoint_callback], precision=16) trainer.fit(model, train_dataloaders=train_loader.loader, val_dataloaders=val_loader.loader)
The dataloader uses a batch size of 768, so 2048 * 768 ~= 1.5M images. I estimated this would take 3 hours to train 1 epoch.
However, the progress bar was counting
iterations and the epoch went on for a very long time (~9 hours). In the end it completed
23871 iterations before ending the epoch. The next epoch had a different number of iterations (31K)!
When I view the checkpoint the filename says
steps=2048. But I can’t see any logical pattern between steps and iterations.
If someone could explain the difference and why I’m seeing this behaviour I would be really grateful.