The training splits on one gpu

Hello!

I have a problem: when I started the training, I used --nproc_per_node 4, and I saw that there are the 4 processes are on first GPU, and 1 process on each of three GPU. It’s working, but when I declare --nproc_per_node 8, it crashed, because CUDA out of memory.

Here are the code from terminal, when it crashed: CUDA_VISIBLE_DEVICES=1,3,9,10,11,12,13,14 python -m torch.distributed.launch
–nproc_per_node 8 nllb-train-example.py

Here are the report from nvidia-smi (I used 1,3,9,10 GPUs).
±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 1 N/A N/A 1480487 C …envs/env_train/bin/python 27171MiB |
| 1 N/A N/A 1480488 C …envs/env_train/bin/python 923MiB |
| 1 N/A N/A 1480489 C …envs/env_train/bin/python 923MiB |
| 1 N/A N/A 1480490 C …envs/env_train/bin/python 923MiB |
| 3 N/A N/A 1480488 C …envs/env_train/bin/python 27071MiB |
| 9 N/A N/A 1480489 C …envs/env_train/bin/python 27113MiB |
| 10 N/A N/A 1480490 C …envs/env_train/bin/python 27077MiB |

I sincerely ask you to help me solve this problem, please!

I solved this problem just removing '.to(‘cuda’) in the line ‘model = AutoModelForSeq2SeqLM.from_pretrained(“facebook/nllb-200-1.3B”)’

1 Like