Troubleshooting TPU Usage: Are TPUs Properly Running on Google Cloud?

Hello everyone,

I’m currently facing an issue while training a model on TPU v3-8 at Google Cloud using Lightning. Although I believe I have set the correct parameters in the pl.Trainer, it appears that the TPUs might not be utilized as intended.

I’ve encountered the following warning logs from Tensorflow:

2023-07-24 20:06:33.344628: E tensorflow/core/framework/op_kernel.cc:1676] OpKernel ('op: "TPURoundRobin" device_type: "CPU"') for unknown op: TPURoundRobin
2023-07-24 20:06:33.344710: E tensorflow/core/framework/op_kernel.cc:1676] OpKernel ('op: "TpuHandleToProtoKey" device_type: "CPU"') for unknown op: TpuHandleToProtoKey

My pl.Trainer configuration is as follows:

trainer = pl.Trainer(
    accelerator="tpu",
    tpu_cores=8,
    max_epochs=args.epochs,
    sync_batchnorm=args.sync_bn,
    default_root_dir=args.output_path,
    log_every_n_steps=1,
    num_sanity_val_steps=args.num_sanity_val_steps,
    callbacks=[LearningRateMonitor()],
    logger=wandb_logger
)
trainer.fit(model, datamodule=data)

However, the CPU usage remains at almost 100%, indicating that the TPUs may not be functioning as expected. I’d greatly appreciate any assistance in determining whether the TPUs are being utilized or not.

Thank you in advance for your help and suggestions!

@phillipecardenuto

Given the arguments you use, it seems this is Lightning ~1.5.
I think you will have better luck using a more recent version of Lightning. TPU support has slowly evolved in future versions.

Regarding the monitoring of utilization, I’m not sure what’s the best way to do this. A quick search gave me this answer: How to monitor the TPU utilization and memory usage when training? · Issue #803 · pytorch/xla · GitHub
You might be able to find a dashboard of your VM’s metrics in the Google Cloud console.