I built a Temporal Fusion Transformer model from Pytorch-Forecasting using the guide here:
https://pytorch-forecasting.readthedocs.io/en/stable/tutorials/stallion.html
I used my own data which is a time-series with 62k samples. I set training to be on GPU by specifying accelerator="gpu"
in pl.Trainer
. The issue is that training is quite slow considering this dataset is not that large.
I first ran the training on my laptop GPU GTX 1650 Ti, then on a A100 40GB and I got only 2x uplift in performance. A100 is many many times faster than my laptop and performance uplift should be much bigger than 2x. I have NVIDIA drivers installed, cuDNN and other things installed (A100 is on google cloud which comes preinstalled with all of that). The GPU utilisation is low (10-15%), but I can see that the data has been loaded into GPU memory.
Things I tried:
- Tried small batch sizes (32) and large ones (8192)
- Double checked training is done on the GPU
- Set
num_workers
to 8 indataloader
.
Is there some other bottleneck in my model? Below are the results from the profiler
and snippets of my model configuration.
Dataloaders
batch_size = batch_size train_dataloader = training.to_dataloader( train=True, batch_size=batch_size, num_workers=8, batch_sampler="synchronized", pin_memory=True, ) val_dataloader = validation.to_dataloader( train=False, batch_size=batch_size, num_workers=8, batch_sampler="synchronized", pin_memory=True, )
Model Configuration
early_stop_callback = EarlyStopping(
monitor=“val_loss”, min_delta=1e-4, patience=10, verbose=False, mode=“min”
)trainer = pl.Trainer(
logger=wandb_logger,
max_epochs=max_epochs,
accelerator=“gpu”,
devices=-1,
gradient_clip_val=0.1,
limit_train_batches=1.0, # comment in for training, running validation every 30 batches
callbacks=[lr_logger, early_stop_callback],
enable_model_summary=True,
profiler=“simple”,
)tft = TemporalFusionTransformer.from_dataset(
training,
learning_rate=0.003,
hidden_size=16,
attention_head_size=1,
dropout=0.1,
hidden_continuous_size=8,
output_size=7, # 7 quantiles by default
loss=QuantileLoss(),
reduce_on_plateau_patience=4,
)trainer.fit(
tft,
train_dataloaders=train_dataloader,
val_dataloaders=val_dataloader,
)
Profiler (Only the most intensive processes)