Hi,
I am using the following code snippet to train UNet with the Intel Gaudi accelerator. I am trying to execute some additional lines of code before and after each validation/training epoch but not sure which file I need to modify. I believe, I need to modify /usr/local/lib/python3.10/dist-packages/lightning/pytorch/loops/fit_loop.py
, but not sure where I should exactly poke.
from lightning.pytorch import Trainer
trainer = Trainer(
logger=False,
profiler=prof,
precision="bf16-mixed" if args.amp else "32-true",
devices=args.hpus if args.hpus else None,
accelerator=HPUAccelerator() if args.hpus else None,
benchmark=True,
deterministic=False,
min_epochs=args.min_epochs,
max_epochs=args.max_epochs,
sync_batchnorm=args.sync_batchnorm,
gradient_clip_val=args.gradient_clip_val,
callbacks=callbacks,
num_sanity_val_steps=0,
default_root_dir=args.results,
enable_checkpointing=args.save_ckpt,
strategy=HPUParallelStrategy(parallel_devices=parallel_hpus, bucket_cap_mb=args.bucket_cap_mb,gradient_as_bucket_view=True,static_graph=True) if args.hpus > 1 else SingleHPUStrategy() if args.hpus == 1 else None,
limit_train_batches=1.0 if args.train_batches == 0 else args.train_batches,
limit_val_batches=1.0 if args.test_batches == 0 else args.test_batches,
limit_test_batches=1.0 if args.test_batches == 0 else args.test_batches,
)
trainer.fit(model, train_dataloaders=train_dl)
The main implementation can be found here.