I’m a little confused about the difference between steps and iterations.
I’m training a large model on a big dataset, and I want to arbitrarily define an epoch as training on 2048 batches. This will allow me to get frequent checkpoints that I can test.
I wrote this code to achieve that:
checkpoint_callback = pl.callbacks.ModelCheckpoint(
save_top_k=2,
monitor="val_loss",
mode="min",
)
trainer = pl.Trainer(limit_train_batches=2048,
accelerator='gpu',
devices=4,
strategy='dp',
max_epochs=100,
log_every_n_steps=5,
callbacks=[checkpoint_callback],
precision=16)
trainer.fit(model, train_dataloaders=train_loader.loader, val_dataloaders=val_loader.loader)
The dataloader uses a batch size of 768, so 2048 * 768 ~= 1.5M images. I estimated this would take 3 hours to train 1 epoch.
However, the progress bar was counting iterations
and the epoch went on for a very long time (~9 hours). In the end it completed 23871
iterations before ending the epoch. The next epoch had a different number of iterations (31K)!
When I view the checkpoint the filename says steps=2048
. But I can’t see any logical pattern between steps and iterations.
If someone could explain the difference and why I’m seeing this behaviour I would be really grateful.