[Q] How can I overwrite logs when using PyTorchLightning and wandb?

QinlongHuang · December 16, 2023, 7:13am

I’m a PL + wandb user, and I load my checkpoint using

# checkpoint format:
# 'exp/h31qedvw/checkpoints/model-01841-val_loss=0.31116.ckpt'
wandb_id = args.checkpoint.split('/')[1]
logger = WandbLogger(project='wav2lip-syncnet', id=wandb_id, resume='must')
trainer = pl.Trainer(max_steps=hparams.nsteps, 
                    devices=1, accelerator="gpu", 
                    logger=logger, log_every_n_steps=1, 
                    callbacks=[checkpoint_callback])
trainer.fit(model, ckpt_path=args.checkpoint)

It seems ok on resume training, and I have manually modified some hparams like lr. Note that I have break my last run on epoch 1900, while I load a ckpt on epoch 1841, cuz I save ckpt with top-k metrics. When I watch my logs on wandb, it seems the I cannot overwrite the previous logs from 1841-1900. Is there anyway to do that?

awaelchli · December 16, 2023, 7:24am

As far as I know you cannot overwrite already logged data in wandb. In my experience, it is fine to resume an existing run a bit behind where you left of, the logs will just be discarded until you reach that point where new logs can be appended.

Alternatively, if you want both data, you can just remove id=wandb_id and log to a new run. Then in the dashboard you just overlap the two experiment plots for visualization.

These are the two main approaches I know and have seen people use to resume training with WandB.

Topic		Replies	Views
Use the same logger, when resuming from checkpoint Trainer	1	941	October 25, 2021
Resuming remote run on local using wandb artifact downloading implementation help	0	543	August 2, 2023
Checkpointing and Restoring callbacks	0	223	April 7, 2023
How to save new lr hyperparameter after using LRFinder when using wandb implementation help	2	571	July 10, 2023
Artifacts in log_table function implementations	2	459	February 17, 2023

[Q] How can I overwrite logs when using PyTorchLightning and wandb?

Related topics