Troubleshooting TPU Usage: Are TPUs Properly Running on Google Cloud?

phillipecardenuto · July 24, 2023, 8:28pm

Hello everyone,

I’m currently facing an issue while training a model on TPU v3-8 at Google Cloud using Lightning. Although I believe I have set the correct parameters in the pl.Trainer, it appears that the TPUs might not be utilized as intended.

I’ve encountered the following warning logs from Tensorflow:

2023-07-24 20:06:33.344628: E tensorflow/core/framework/op_kernel.cc:1676] OpKernel ('op: "TPURoundRobin" device_type: "CPU"') for unknown op: TPURoundRobin
2023-07-24 20:06:33.344710: E tensorflow/core/framework/op_kernel.cc:1676] OpKernel ('op: "TpuHandleToProtoKey" device_type: "CPU"') for unknown op: TpuHandleToProtoKey

My pl.Trainer configuration is as follows:

trainer = pl.Trainer(
    accelerator="tpu",
    tpu_cores=8,
    max_epochs=args.epochs,
    sync_batchnorm=args.sync_bn,
    default_root_dir=args.output_path,
    log_every_n_steps=1,
    num_sanity_val_steps=args.num_sanity_val_steps,
    callbacks=[LearningRateMonitor()],
    logger=wandb_logger
)
trainer.fit(model, datamodule=data)

However, the CPU usage remains at almost 100%, indicating that the TPUs may not be functioning as expected. I’d greatly appreciate any assistance in determining whether the TPUs are being utilized or not.

Thank you in advance for your help and suggestions!

awaelchli · July 30, 2023, 8:47pm

@phillipecardenuto

Given the arguments you use, it seems this is Lightning ~1.5.
I think you will have better luck using a more recent version of Lightning. TPU support has slowly evolved in future versions.

Regarding the monitoring of utilization, I’m not sure what’s the best way to do this. A quick search gave me this answer: How to monitor the TPU utilization and memory usage when training? · Issue #803 · pytorch/xla · GitHub
You might be able to find a dashboard of your VM’s metrics in the Google Cloud console.

Topic		Replies	Views
Training on TPU TPU	0	703	March 8, 2021
Training works when using 1 TPU Core, but ProcessExitedException when try to use 8 cores TPU	4	3691	April 21, 2021
Error running on TPU google colab TPU	3	2255	November 30, 2020
Training hangs at Epoch 0 / 0% on TPU TPU	2	2150	February 1, 2024
Running meta-learning tutorial on TPU TPU	0	291	April 2, 2023

Troubleshooting TPU Usage: Are TPUs Properly Running on Google Cloud?

Related topics