Training is slow on GPU

I built a Temporal Fusion Transformer model from Pytorch-Forecasting using the guide here:
https://pytorch-forecasting.readthedocs.io/en/stable/tutorials/stallion.html

I used my own data which is a time-series with 62k samples. I set training to be on GPU by specifying accelerator="gpu" in pl.Trainer. The issue is that training is quite slow considering this dataset is not that large.

I first ran the training on my laptop GPU GTX 1650 Ti, then on a A100 40GB and I got only 2x uplift in performance. A100 is many many times faster than my laptop and performance uplift should be much bigger than 2x. I have NVIDIA drivers installed, cuDNN and other things installed (A100 is on google cloud which comes preinstalled with all of that). The GPU utilisation is low (10-15%), but I can see that the data has been loaded into GPU memory.

Things I tried:

  • Tried small batch sizes (32) and large ones (8192)
  • Double checked training is done on the GPU
  • Set num_workers to 8 in dataloader.

Is there some other bottleneck in my model? Below are the results from the profiler and snippets of my model configuration.

Dataloaders

batch_size = batch_size
train_dataloader = training.to_dataloader(
    train=True,
    batch_size=batch_size,
    num_workers=8,
    batch_sampler="synchronized",
    pin_memory=True,
)
val_dataloader = validation.to_dataloader(
    train=False,
    batch_size=batch_size,
    num_workers=8,
    batch_sampler="synchronized",
    pin_memory=True,
)

Model Configuration

early_stop_callback = EarlyStopping(
monitor=“val_loss”, min_delta=1e-4, patience=10, verbose=False, mode=“min”
)

trainer = pl.Trainer(
logger=wandb_logger,
max_epochs=max_epochs,
accelerator=“gpu”,
devices=-1,
gradient_clip_val=0.1,
limit_train_batches=1.0, # comment in for training, running validation every 30 batches
callbacks=[lr_logger, early_stop_callback],
enable_model_summary=True,
profiler=“simple”,
)

tft = TemporalFusionTransformer.from_dataset(
training,
learning_rate=0.003,
hidden_size=16,
attention_head_size=1,
dropout=0.1,
hidden_continuous_size=8,
output_size=7, # 7 quantiles by default
loss=QuantileLoss(),
reduce_on_plateau_patience=4,
)

trainer.fit(
tft,
train_dataloaders=train_dataloader,
val_dataloaders=val_dataloader,
)

Profiler (Only the most intensive processes)

Hey! sorry for the late reply. Was this issue solved? You can try increasing the batch size if you still have free GPU memory available. Also check this thread about GPU Utilization in PyTorch forum