Hey everyone,
I am trying to train a model on the GPU server of our lab, however, I am encountering a strange issue. I get a CUDA OOM error when I try to train the model using this trainer configuration:
trainer = pl.Trainer(
max_epochs=10,
gpus=[2, 3],
accelerator="ddp",
precision=16,
callbacks=callbacks,
progress_bar_refresh_rate=20,
deterministic=True,
prepare_data_per_node=False)
This happens also if I set gpus=2
and auto_select_gpus=True
. The server has 10 GPUs (pretty powerful ones too, and, checking using nvidia-smi
there are GPUs which are not used (and those I select manually are free).
The issue happens also if I use some other model (for instance the GAN in PL Bolts). In particular, it happens running the script that can be found here, with the following CLI arguments:
python main.py --gpus 2 --accelerator ddp --auto_select_gpus --data_dir "data"
I think the exception happens during the DDP setup, and the output of my script (stack trace included) is as follows:
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7,8,9]
/home/edoardo.debenedetti/.pyenv/versions/pytorch-miniconda/lib/python3.8/site-packages/pytorch_lightning/utilities/distributed.py:45: UserWarning: you passed in a val_dataloader but have no validation_step. Skipping validation loop
warnings.warn(*args, **kwargs)
Downloading https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz to /home/edoardo.debenedetti/.data/cifar-10-python.tar.gz
100%|█████████████████████████████████████████████████████████████████████████████████████▊| 170172416/170498071 [00:07<00:00, 28156544.60it/s]Extracting /home/edoardo.debenedetti/.data/cifar-10-python.tar.gz to /home/edoardo.debenedetti/.data
Files already downloaded and verified
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7,8,9]
initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/2
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/2
Traceback (most recent call last):
File "gans_mia_unlearning/architectures/gan.py", line 213, in <module>
dm, model, trainer = cli_main()
File "gans_mia_unlearning/architectures/gan.py", line 208, in cli_main
trainer.fit(model, dm)
File "/home/edoardo.debenedetti/.pyenv/versions/pytorch-miniconda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 444, in fit
results = self.accelerator_backend.train()
File "/home/edoardo.debenedetti/.pyenv/versions/pytorch-miniconda/lib/python3.8/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 148, in train
results = self.ddp_train(process_idx=self.task_idx, model=model)
File "/home/edoardo.debenedetti/.pyenv/versions/pytorch-miniconda/lib/python3.8/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 238, in ddp_train
self.init_ddp_connection(
File "/home/edoardo.debenedetti/.pyenv/versions/pytorch-miniconda/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 215, in init_ddp_connection
torch_distrib.init_process_group(
File "/home/edoardo.debenedetti/.pyenv/versions/pytorch-miniconda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 442, in init_process_group
barrier()
File "/home/edoardo.debenedetti/.pyenv/versions/pytorch-miniconda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1947, in barrier
work = _default_pg.barrier()
RuntimeError: CUDA error: out of memory
170500096it [00:14, 11915539.63it/s]
Traceback (most recent call last):
File "/home/edoardo.debenedetti/projects/gans-mia-unlearning/gans_mia_unlearning/architectures/gan.py", line 213, in <module>
dm, model, trainer = cli_main()
File "/home/edoardo.debenedetti/projects/gans-mia-unlearning/gans_mia_unlearning/architectures/gan.py", line 208, in cli_main
trainer.fit(model, dm)
File "/home/edoardo.debenedetti/.pyenv/versions/pytorch-miniconda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 444, in fit
results = self.accelerator_backend.train()
File "/home/edoardo.debenedetti/.pyenv/versions/pytorch-miniconda/lib/python3.8/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 148, in train
results = self.ddp_train(process_idx=self.task_idx, model=model)
File "/home/edoardo.debenedetti/.pyenv/versions/pytorch-miniconda/lib/python3.8/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 238, in ddp_train
self.init_ddp_connection(
File "/home/edoardo.debenedetti/.pyenv/versions/pytorch-miniconda/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 215, in init_ddp_connection
torch_distrib.init_process_group(
File "/home/edoardo.debenedetti/.pyenv/versions/pytorch-miniconda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 442, in init_process_group
barrier()
File "/home/edoardo.debenedetti/.pyenv/versions/pytorch-miniconda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1947, in barrier
work = _default_pg.barrier()
RuntimeError: Broken pipe
On the other side, if I try the regular PyTorch way to use DDP (like in PyTorch’s GAN guide here), I have no such exception.
Also, I tried the Boring Model (https:// github com/PyTorchLightning/pytorch-lightning/blob/master/pl_examples/bug_report_model.py, the link is like that because I am a new user), and I have the same issue. Moreover, DP works well.
Do you think this is a problem on my side or on the workstation configuration?
Thanks in advance!