Improving poor training efficiency on A100 40 GB

michaelaccetto · June 22, 2023, 9:23am

Hi all!
First, thank you for the amazing framework and blog.

I am training falcon-7b on a custom dataset, with following hyperparams:

batch_size = 2
aggregate_batch = 4
epochs = 10
train set size = 8585
eval set size = 200

On 2 A100 40GB with adapterV2. According to this guide, I should be able to iterate through 52k datapoints in less than 1.26 hrs. The given parameters, however, max out each GPU (peak memory ~ 37 GB), causing training to spend more than 3 hours to iterate over 10k datapoints.
Could you help me improving training efficiency?

Let me add, for the sake of completeness, that some code changes were required, like casting explicitly to bfloat16 whenever a dtype mismatch exception was raised in the attention module.

System:
Ubuntu: 22.04
Nvidia drivers: 530.30.02
CUDA version: 12.1

Topic		Replies	Views
Training is slow on GPU implementation help	1	2387	November 3, 2022
Lightning + multi-GPU + IterableDataset uneven batches	2	572	February 17, 2024
GPU memory surge after training epochs causing CUDA memory error Trainer	0	2400	August 23, 2021
DistributedDataParallel multi GPU barely faster than single GPU DDP/GPU	2	1500	March 10, 2023
Best way to save training/validation tensors for epoch metrics LightningModule	1	1686	February 22, 2021

Improving poor training efficiency on A100 40 GB

Related topics