Hello Everyone, Initially, I trained my model in single GPU environment. And it was working perfectly fine. But now I have increased GPU’s to 2, number of nodes -2 (strategy - ‘DDP’) and following all the instructions from this:
https://pytorch-lightning.readthedocs.io/en/stable/common/trainer.html#replace-sampler-ddp
But I am getting following issue - (This is logged in slurm’s error log script)
Downloading: “https;//…” to /root/.cache/torch/hub/checkpoints/…
Downloading: “https;//…” to /root/.cache/torch/hub/checkpoints/…
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/4
This is logged in my slurm’s output script -
NOTE! Installing ujson may make loading annotations faster.
NOTE! Installing ujson may make loading annotations faster.
After this, the process just hangs and doesn’t give any error and doesn’t even terminate (until I do). Can someone please suggest me what can be wrong here. I have read all the threads for multi-gpu errors but I no one raised this issue. I am not getting what is wrong in my code.
Some information regarding my model:
code is downloading and loading checkpoint (from internet) to the model.
I have used torch.utils.data.DistributedSampler
I have downloaded data in setup() function - in DataModule of PYL
I am initialising the trainer like this -
Trainer(gradient_clip_val = args.clip_max_norm, max_epochs = args.epochs,
gpus = args.gpus, strategy=“ddp”, replace_sampler_ddp = False, num_nodes=args.num_nodes, default_root_dir = args.output_path,
logger=TensorBoardLogger(save_dir=args.output_path, name = args.name))
Please let me know if any further information is needed. I would like to thank everyone in advance.