Improving poor training efficiency on A100 40 GB

Hi all!
First, thank you for the amazing framework and blog.

I am training falcon-7b on a custom dataset, with following hyperparams:

  • batch_size = 2
  • aggregate_batch = 4
  • epochs = 10
  • train set size = 8585
  • eval set size = 200

On 2 A100 40GB with adapterV2. According to this guide, I should be able to iterate through 52k datapoints in less than 1.26 hrs. The given parameters, however, max out each GPU (peak memory ~ 37 GB), causing training to spend more than 3 hours to iterate over 10k datapoints.
Could you help me improving training efficiency?

Let me add, for the sake of completeness, that some code changes were required, like casting explicitly to bfloat16 whenever a dtype mismatch exception was raised in the attention module.

System:
Ubuntu: 22.04
Nvidia drivers: 530.30.02
CUDA version: 12.1