Hi all!
First, thank you for the amazing framework and blog.
I am training falcon-7b on a custom dataset, with following hyperparams:
- batch_size = 2
- aggregate_batch = 4
- epochs = 10
- train set size = 8585
- eval set size = 200
On 2 A100 40GB with adapterV2. According to this guide, I should be able to iterate through 52k datapoints in less than 1.26 hrs. The given parameters, however, max out each GPU (peak memory ~ 37 GB), causing training to spend more than 3 hours to iterate over 10k datapoints.
Could you help me improving training efficiency?
Let me add, for the sake of completeness, that some code changes were required, like casting explicitly to bfloat16
whenever a dtype mismatch exception was raised in the attention module.
System:
Ubuntu: 22.04
Nvidia drivers: 530.30.02
CUDA version: 12.1