Hello everyone,
I’m currently facing an issue while training a model on TPU v3-8 at Google Cloud using Lightning. Although I believe I have set the correct parameters in the pl.Trainer
, it appears that the TPUs might not be utilized as intended.
I’ve encountered the following warning logs from Tensorflow:
2023-07-24 20:06:33.344628: E tensorflow/core/framework/op_kernel.cc:1676] OpKernel ('op: "TPURoundRobin" device_type: "CPU"') for unknown op: TPURoundRobin
2023-07-24 20:06:33.344710: E tensorflow/core/framework/op_kernel.cc:1676] OpKernel ('op: "TpuHandleToProtoKey" device_type: "CPU"') for unknown op: TpuHandleToProtoKey
My pl.Trainer
configuration is as follows:
trainer = pl.Trainer(
accelerator="tpu",
tpu_cores=8,
max_epochs=args.epochs,
sync_batchnorm=args.sync_bn,
default_root_dir=args.output_path,
log_every_n_steps=1,
num_sanity_val_steps=args.num_sanity_val_steps,
callbacks=[LearningRateMonitor()],
logger=wandb_logger
)
trainer.fit(model, datamodule=data)
However, the CPU usage remains at almost 100%, indicating that the TPUs may not be functioning as expected. I’d greatly appreciate any assistance in determining whether the TPUs are being utilized or not.
Thank you in advance for your help and suggestions!