Unable to find GPU on cluster?

Noah_Kasmanoff · October 22, 2020, 10:54pm

Hi, I am trying to train an implementation SimCLR using lightning, and am doing this on a cluster where I have access to two gpus, but neither is picked up by lightning when specified over slurm.

I’ve pasted the error message below:

(simclr) [nsk367@mycluster src]$ cat slurm-10012580.out
Traceback (most recent call last):
  File "train_simclr.py", line 246, in <module>
    cli_main()
  File "train_simclr.py", line 240, in cli_main
    trainer = pl.Trainer.from_argparse_args(args)
  File "/gpfs/scratch/nsk367/anaconda3/envs/simclr/lib/python3.8/site-packages/pytorch_lightning/trainer/properties.py", line 122, in from_argparse_args
    return argparse_utils.from_argparse_args(cls, args, **kwargs)
  File "/gpfs/scratch/nsk367/anaconda3/envs/simclr/lib/python3.8/site-packages/pytorch_lightning/utilities/argparse_utils.py", line 50, in from_argparse_args
    return cls(**trainer_kwargs)
  File "/gpfs/scratch/nsk367/anaconda3/envs/simclr/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/env_vars_connector.py", line 41, in overwrite_by_env_vars
    return fn(self, **kwargs)
  File "/gpfs/scratch/nsk367/anaconda3/envs/simclr/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 328, in __init__
    self.accelerator_connector.on_trainer_init(
  File "/gpfs/scratch/nsk367/anaconda3/envs/simclr/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator_connector.py", line 111, in on_trainer_init
    self.trainer.data_parallel_device_ids = device_parser.parse_gpu_ids(self.trainer.gpus)
  File "/gpfs/scratch/nsk367/anaconda3/envs/simclr/lib/python3.8/site-packages/pytorch_lightning/utilities/device_parser.py", line 76, in parse_gpu_ids
    gpus = _sanitize_gpu_ids(gpus)
  File "/gpfs/scratch/nsk367/anaconda3/envs/simclr/lib/python3.8/site-packages/pytorch_lightning/utilities/device_parser.py", line 144, in _sanitize_gpu_ids
    raise MisconfigurationException(f"""
pytorch_lightning.utilities.exceptions.MisconfigurationException: 
                You requested GPUs: [0, 1]
                But your machine only has: []

I am running on torch 1.6.0, and lightning 1.0.2

The slurm command requests two gpus, and I have an additional arugment

python train_simclr.py --gpus 2

Which leads me to this error. Happy to share more information if it helps.

Thank you!

EDIT:

At the top of the .py file, I run the command

import torch
print(torch.cuda.device_count())

and that output a 0, so I’m not sure if the issue is entirely with lightning, but I have run jobs on this cluster before but have not seen this error until running with this version of pytorch / lightning in case anyone still might know the root cause.

jirka · February 22, 2021, 2:40pm

Hello, my apology for the late reply. We are slowly converging to deprecate this forum in favor of the GH build-in version… Could we kindly ask you to recreate your question there - Lightning Discussions

Topic		Replies	Views
Multi-GPU with SLURM failed at initialization DDP/GPU	1	1474	April 4, 2022
Error with Pytorch Lightning ddp_spawn on SLURM DDP/GPU	0	1255	October 1, 2023
NCCL error related to multi gpu processing DDP/GPU	0	1271	December 12, 2021
Multi-GPU training issue - DDP strategy. Training hangs upon distributed GPU initialisation DDP/GPU	3	3595	January 18, 2023
Training freezes at "initializing ddp: GLOBAL_RANK ..." DDP/GPU	4	2421	May 9, 2024

Unable to find GPU on cluster?

Related topics