Hi, I am trying to train an implementation SimCLR using lightning, and am doing this on a cluster where I have access to two gpus, but neither is picked up by lightning when specified over slurm.
I’ve pasted the error message below:
(simclr) [nsk367@mycluster src]$ cat slurm-10012580.out
Traceback (most recent call last):
File "train_simclr.py", line 246, in <module>
cli_main()
File "train_simclr.py", line 240, in cli_main
trainer = pl.Trainer.from_argparse_args(args)
File "/gpfs/scratch/nsk367/anaconda3/envs/simclr/lib/python3.8/site-packages/pytorch_lightning/trainer/properties.py", line 122, in from_argparse_args
return argparse_utils.from_argparse_args(cls, args, **kwargs)
File "/gpfs/scratch/nsk367/anaconda3/envs/simclr/lib/python3.8/site-packages/pytorch_lightning/utilities/argparse_utils.py", line 50, in from_argparse_args
return cls(**trainer_kwargs)
File "/gpfs/scratch/nsk367/anaconda3/envs/simclr/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/env_vars_connector.py", line 41, in overwrite_by_env_vars
return fn(self, **kwargs)
File "/gpfs/scratch/nsk367/anaconda3/envs/simclr/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 328, in __init__
self.accelerator_connector.on_trainer_init(
File "/gpfs/scratch/nsk367/anaconda3/envs/simclr/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator_connector.py", line 111, in on_trainer_init
self.trainer.data_parallel_device_ids = device_parser.parse_gpu_ids(self.trainer.gpus)
File "/gpfs/scratch/nsk367/anaconda3/envs/simclr/lib/python3.8/site-packages/pytorch_lightning/utilities/device_parser.py", line 76, in parse_gpu_ids
gpus = _sanitize_gpu_ids(gpus)
File "/gpfs/scratch/nsk367/anaconda3/envs/simclr/lib/python3.8/site-packages/pytorch_lightning/utilities/device_parser.py", line 144, in _sanitize_gpu_ids
raise MisconfigurationException(f"""
pytorch_lightning.utilities.exceptions.MisconfigurationException:
You requested GPUs: [0, 1]
But your machine only has: []
I am running on torch 1.6.0, and lightning 1.0.2
The slurm command requests two gpus, and I have an additional arugment
python train_simclr.py --gpus 2
Which leads me to this error. Happy to share more information if it helps.
Thank you!
EDIT:
At the top of the .py file, I run the command
import torch
print(torch.cuda.device_count())
and that output a 0, so I’m not sure if the issue is entirely with lightning, but I have run jobs on this cluster before but have not seen this error until running with this version of pytorch / lightning in case anyone still might know the root cause.