Error Loading a saved checkpoint

shreeshail · January 14, 2023, 4:59pm

I have saved a model during training. However, I am facing issues when trying to load the model from a saved checkpoint. The model class name is CSLRModel. On the python code CSLRModel.load_from_checkpoint(ckpt_path, **kwargs), I am getting the following error:

return CSLRModel.load_from_checkpoint(ckpt_path, encoder_seq, config, config.num_classes_gloss)
  File "/data/envs/ohdev/lib/python3.8/site-packages/pytorch_lightning/core/saving.py", line 137, in load_from_checkpoint
    return _load_from_checkpoint(
  File "/data/envs/ohdev/lib/python3.8/site-packages/pytorch_lightning/core/saving.py", line 158, in _load_from_checkpoint
    checkpoint = pl_load(checkpoint_path, map_location=map_location)
  File "/data/envs/ohdev/lib/python3.8/site-packages/lightning_lite/utilities/cloud_io.py", line 48, in _load
    return torch.load(f, map_location=map_location)
  File "/data/envs/ohdev/lib/python3.8/site-packages/torch/serialization.py", line 789, in load
    return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args)
  File "/data/envs/ohdev/lib/python3.8/site-packages/torch/serialization.py", line 1131, in _load
    result = unpickler.load()
  File "/usr/lib/python3.8/pickle.py", line 1212, in load
    dispatch[key[0]](self)
  File "/usr/lib/python3.8/pickle.py", line 1253, in load_binpersid
    self.append(self.persistent_load(pid))
  File "/data/envs/ohdev/lib/python3.8/site-packages/torch/serialization.py", line 1101, in persistent_load
    load_tensor(dtype, nbytes, key, _maybe_decode_ascii(location))
  File "/data/envs/ohdev/lib/python3.8/site-packages/torch/serialization.py", line 1083, in load_tensor
    wrap_storage=restore_location(storage, location),
  File "/data/envs/ohdev/lib/python3.8/site-packages/torch/serialization.py", line 1058, in restore_location
    result = map_location(storage, location)
  File "/data/envs/ohdev/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/data/envs/ohdev/lib/python3.8/site-packages/torch/nn/modules/module.py", line 246, in _forward_unimplemented
    raise NotImplementedError(f"Module [{type(self).__name__}] is missing the required \"forward\" function")
NotImplementedError: Module [ModuleList] is missing the required "forward" function

Not sure what I am doing wrong here. The CSLRModel class does have forward function, and I was able to use the model class to successfully train and save the checkpoints. It is only in loading from a checkpoint that I am facing an error.

justusschock · January 16, 2023, 11:02am

Can you post your model definition? Otherwise we cannot really help you.

shreeshail · January 17, 2023, 10:09am

class CSLRModel(pl.LightningModule):
  def __init__(self, encoder_seq, config, num_classes_gloss):
    super().__init__()
    self.save_hyperparameters()
    self.epoch_no = 0
    ## path hardcoded. Resolve
    gloss_dict = np.load('/data/cslr_datasets/PHOENIX-2014/phoenix2014-release/phoenix-2014-multisigner/preprocess/gloss_dict.npy', allow_pickle=True).item()
    self.decoder = Decode(gloss_dict, num_classes_gloss, 'beam')
    self.config = config
    self.encoder_seq = encoder_seq
    self.num_encoders = len(encoder_seq)
    self.num_classes_gloss = num_classes_gloss
    self.classifiers = nn.ModuleDict() ## to use features of an encoder to classify
    for enc in self.encoder_seq:
      if enc.use_ctc:
        self.classifiers[f'{enc.encoder_id}'] = nn.Linear(enc.out_size, self.num_classes_gloss)
    self.externals = {}
    self.initialize_losses(config.losses)

      

  def forward(self, x, len_x, label, len_label, is_training=True):
    for i, enc in enumerate(self.encoder_seq):
      x, len_x, internal_losses = enc(x,len_x)
      self.loss_value.update(internal_losses)
      
      if enc.use_ctc:
        logits = self.classifiers[f'{enc.encoder_id}'](x)
        self.externals[f'encoder{i+1}.logits'] = logits
        self.loss_value[f'{enc.encoder_id}.CTCLoss'] = self.loss_fn['CTC'](
                                                                logits.transpose(0,1).log_softmax(-1), 
                                                                label.cpu().int(), 
                                                                len_x.cpu().int(), 
                                                                len_label.cpu().int()).mean()
    
    return self.compute_external_losses(), logits, len_x

Pasting the __init__() and forward() functions of the model definition. Not pasting the rest of the definition since it would make the post long.
Note: I was able to load the model from checkpoint when I used self.save_hyperparameters() in __init__(). Is this required?

justusschock · January 19, 2023, 11:09pm

Hey, it makes a ton of sense now.

Here is how load_from_checkpoint works internally:
1.) We instantiate the class (CSLRModel) with the necessary init arguments
2.) We load the state dict to the class instance

For 1.) we need to get the init arguments somewhere. There are 2 options to do this.

save_hyperparameters() just serializes the init arguments so that you don’t need to do anything (at least for easily serializable stuff, different if you pass a nn.Module)
You pass them in as keyword arguments to load_from_checkpoint

Since you didn’t do any of these it makes sense that you couldn’t load the model. Hope that makes it a bit clearer!

Cheers,
Justus

shreeshail · January 20, 2023, 7:25am

Yeah, I had read about these two options in the docs, and I tried both. My query was as follows:
1] Approach 1: Use self.save_hyperparameters() in model __init__() and only pass ckpt_path when loading during test time i.e., CSLRModel(ckpt_path). This one works for me, but I do not prefer it since the self.save_hyperparameters() statement takes a little long to execute.

2] Approach 2: Do not use self.save_hyperparameters() and load during test time as follows: CSLRModel(ckpt_path, encoder_seq, config, num_classes_gloss) i.e., pass both saved checkpoint path and hyperparameters. However, when I use this approach, I get the above mentioned error.

NotImplementedError: Module [ModuleList] is missing the required "forward" function

Topic		Replies	Views
Cannot load the model from checkpoint	3	1940	December 8, 2023
Problem in load_from_checkpoint	1	888	January 16, 2024
Error loading model from from checkpoint implementation help	2	3899	February 11, 2023
How to load and use model checkpoint (.ckpt)? implementation help	3	58687	July 12, 2022
Error Loading a saved checkpoint outside model directory callbacks	0	373	September 18, 2023

Error Loading a saved checkpoint

Related topics