Training is slow on GPU

mtomic · September 28, 2022, 12:02pm

I built a Temporal Fusion Transformer model from Pytorch-Forecasting using the guide here:
https://pytorch-forecasting.readthedocs.io/en/stable/tutorials/stallion.html

I used my own data which is a time-series with 62k samples. I set training to be on GPU by specifying accelerator="gpu" in pl.Trainer. The issue is that training is quite slow considering this dataset is not that large.

I first ran the training on my laptop GPU GTX 1650 Ti, then on a A100 40GB and I got only 2x uplift in performance. A100 is many many times faster than my laptop and performance uplift should be much bigger than 2x. I have NVIDIA drivers installed, cuDNN and other things installed (A100 is on google cloud which comes preinstalled with all of that). The GPU utilisation is low (10-15%), but I can see that the data has been loaded into GPU memory.

Things I tried:

Tried small batch sizes (32) and large ones (8192)
Double checked training is done on the GPU
Set num_workers to 8 in dataloader.

Is there some other bottleneck in my model? Below are the results from the profiler and snippets of my model configuration.

Dataloaders

batch_size = batch_size
train_dataloader = training.to_dataloader(
    train=True,
    batch_size=batch_size,
    num_workers=8,
    batch_sampler="synchronized",
    pin_memory=True,
)
val_dataloader = validation.to_dataloader(
    train=False,
    batch_size=batch_size,
    num_workers=8,
    batch_sampler="synchronized",
    pin_memory=True,
)

Model Configuration

early_stop_callback = EarlyStopping(
monitor=“val_loss”, min_delta=1e-4, patience=10, verbose=False, mode=“min”
)

trainer = pl.Trainer(
logger=wandb_logger,
max_epochs=max_epochs,
accelerator=“gpu”,
devices=-1,
gradient_clip_val=0.1,
limit_train_batches=1.0, # comment in for training, running validation every 30 batches
callbacks=[lr_logger, early_stop_callback],
enable_model_summary=True,
profiler=“simple”,
)

tft = TemporalFusionTransformer.from_dataset(
training,
learning_rate=0.003,
hidden_size=16,
attention_head_size=1,
dropout=0.1,
hidden_continuous_size=8,
output_size=7, # 7 quantiles by default
loss=QuantileLoss(),
reduce_on_plateau_patience=4,
)

trainer.fit(
tft,
train_dataloaders=train_dataloader,
val_dataloaders=val_dataloader,
)

Profiler (Only the most intensive processes)

aniketmaurya · November 3, 2022, 9:44am

Hey! sorry for the late reply. Was this issue solved? You can try increasing the batch size if you still have free GPU memory available. Also check this thread about GPU Utilization in PyTorch forum

Topic		Replies	Views
DistributedDataParallel multi GPU barely faster than single GPU DDP/GPU	2	1578	March 10, 2023
Training is very slow Trainer	0	258	January 4, 2024
Very slow performance with trainer.fit	1	1823	July 17, 2021
Improving poor training efficiency on A100 40 GB LightningModule	0	685	June 22, 2023
Multi-gpu training is much lower than single gpu (due to additional processes?) DDP/GPU	0	260	May 8, 2024

Training is slow on GPU

Related topics