Saving a Fabric model mid-epoch in multi-GPU setting

kh562camacuk · February 26, 2024, 6:39pm

Is there a recommend way to save a model mid-epoch using fabric, when training on multiple nodes/devices? (i.e. save after n training steps, instead of at the end of an epoch). I’m currently trying the following:

if fabric.global_rank == 0:
  if num_steps % 100 == 0:
    state = {
      "model": model.state_dict(),
      "optimizer": optimizer.state_dict(),
      "epoch": epoch,
      "step": num_steps,
    }
    fabric.save(checkpoint_path, state)

However the training seems to hang after the checkpoint is saved?

Topic		Replies	Views
Question about how Fabric models are saved/loaded Fabric	4	1557	June 19, 2023
Multi GPU computing Fabric	0	251	September 3, 2023
Saving model checkpoint during training epoch	0	13	October 27, 2024
Error resuming from checkpoint with multiple GPUs	3	3161	January 30, 2022
Save checkpoints after specific number of steps instead of epochs callbacks	2	7117	September 28, 2020

Saving a Fabric model mid-epoch in multi-GPU setting

Related topics