Saving a LightningModule without a Trainer

Hi all,
I have a fairly specific use case. I am experimenting with neural architecture search, and am trying to convert my code from vanilla PyTorch to PL. My algorithm goes something like this:

  • Generate a neural net’s Blueprint in the form of a graph,
  • Instantiate a Model class object by translating the Blueprint into a sequence of nn.Module components wrapped in a LightningModule.
  • Train the Model, using checkpointing to save either the best-seen model (with or without early stopping) or the model at the end of the last epoch, depending on the stage of the algorithm where this training takes place.

I already implemented all of the above in PL, which allowed me to implement the DDP training strategy with multiple GPUs. Now, occasionally, I will take an existing Blueprint and apply random mutations: for instance, change a layer’s operation or its hyperparameters. When this happens to a Blueprint whose Model has already been trained, I want to retain the trained weights in the unmodified layers. For instance, if the model has four layers and only layer 2 is mutated, I want to keep the trained weights in layers 1, 3, and 4.

To do this, I instantiate the old and new Models, and compare their layers. If they are identically defined, I copy the weights over. The resulting model has pretrained weights in most layers, and random ones in the modified layers. I would now like to save this model to train it later, starting from these inherited weights.

This is where I start struggling, because as far as I can tell, saving a model’s weights with PL can only be done through a Trainer, which itself is not linked to a model until fit/test/predict is called. But as you can see, at that point, I haven’t yet trained the model. Also, with DDP, instantiating a Trainer and a fit/test/predict step is quite slow, so I don’t want to do that over hundreds of models.

What I would like is something like this:

During training (this is taken from a customer Evaluator class):

checkpoint_cb = ModelCheckpoint(
    filename=ntw.filename,
    dirpath=to_path,
    mode="max",
    monitor="val_acc",
    every_n_epochs=1,
    save_last=not self.interim_checkpoints,
    save_weights_only=True,
    verbose=True,
)
checkpoint_cb.FILE_EXTENSION = ""
checkpoint_cb.CHECKPOINT_NAME_LAST = ntw.filename

callbacks = [checkpoint_cb]

if self.early_stop:
    early_stop_cb = EarlyStopping(
        monitor="val_acc",
        min_delta=self.hparams["thresh"],
        patience=self.hparams["patience"],
        mode="max",
        log_rank_zero_only=True,
    )
    callbacks.append(early_stop_cb)

if pretrained:
    model = Model.load_from_checkpoint(
        os.path.join(from_path, ntw.filename),
        ntw=ntw,
        data_provider=self.data_provider,
        hparams=self.hparams,
    )
else:
    model = Model(ntw, self.data_provider, hparams=self.hparams)
    model.random_init()

trainer = Trainer(
    default_root_dir=self.to_path,
    accelerator='gpu',
    strategy='ddp',
    callbacks=callbacks,
    max_epochs=max_epochs,
    gradient_clip_val=self.hparams["grad_norm_clip"],
    gradient_clip_algorithm="norm",
    logger=self.logger,
    check_val_every_n_epoch=1,
    num_sanity_val_steps=0,
    enable_model_summary=False,
)

trainer.fit(model)
[...]

And during mutation:

# old_model has already been trained:
old_model = Model.load_from_checkpoint(
        os.path.join(from_path, ntw.filename),
        ntw=old_ntw,
        data_provider=self.data_provider,
        hparams=self.hparams,
)
# Create a new Model from the mutated blueprint
new_model = Model(
        ntw=new_ntw,
        data_provider=self.data_provider,
        hparams=self.hparams,
)
for layer_i in range(len(new_ntw)):
    if new_ntw[i] == old_ntw[i]:
        copy_parameters(old_model, new_model, layer_i)
    else:
        random_init(new_model, layer_i)

< save new_model weights >

I cannot figure out how to perform the “save new_model weights” step in a format that is identical to that produced by ModelCheckpoint and compatible with LightningModule.load_from_checkpoint(). Possible solutions:

  • Since I don’t really care about the model’s or the trainer’s states, I could override ModelCheckpoint to only keep the PyTorch modules’ state_dict, but how will that play with DDP?
  • Keep using PL’s mechanics, and extend them to the ‘save’ step by instantiating a Trainer, running a validation step (for instance) to attach a model to it, and use trainer.save_checkpoint(), but as I said above, I fear this will be slow if I specify DDP as the trainer’s strategy. And if I don’t, I’m not sure what will happen when I subsequently load this model to train it with DDP.

I would appreciate any ideas!
Thanks in advance

We at Darts (GitHub - unit8co/darts: A python library for user-friendly forecasting and anomaly detection on time series.) are also interested in this. Specifically in this issue.

Is there a way to connect a model to the trainer without having to call fit/validate/test/predict, to be able to use Trainer.save_checkpoint()?

How about this (pseudo code):


# old_model has already been trained:
old_model = Model.load_from_checkpoint(
        os.path.join(from_path, ntw.filename),
        ntw=old_ntw,
        data_provider=self.data_provider,
        hparams=self.hparams,
)
# Create a new Model from the mutated blueprint
new_model = Model(
        ntw=new_ntw,
        data_provider=self.data_provider,
        hparams=self.hparams,
)
for layer_i in range(len(new_ntw)):
    if new_ntw[i] == old_ntw[i]:
        copy_parameters(old_model, new_model, layer_i)
    else:
        random_init(new_model, layer_i)


# load raw checkpoint of old file
mutated_checkpoint = torch.load(os.path.join(from_path, ntw.filename))
mutated_checkpoint["state_dict"] = new_model.state_dict()
# this will overwrite the old file. if not desired, e.g. add a suffix "_new" 
torch.save(mutated_checkpoint, os.path.join(from_path, ntw.filename))
1 Like

A hacky way would be to do:

    trainer = Trainer(...)
    trainer.strategy.connect(model)
    trainer.save_checkpoint(filename)

This just attaches the model so that the save_checkpoint functions can access the model and save it.

4 Likes