Error with Pytorch Lightning ddp_spawn on SLURM

f2010126 · October 1, 2023, 9:56pm

Hi Folks!

I’m using Pytorch Lightning to tune hyperparamters. But my training function has issues when run with a SLURM script. The code works fine when I have an interactive shell running, where I execute the commands (see Slurm Workflow below) on each node manually. But when it’s run with a bash script I get the following errors consistenly:

Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 101: invalid device ordinal
Exception raised from device_count_impl at ../c10/cuda/CUDAFunctions.cpp:84 (most recent call first):

For this error, I tried restarting my tuning run.

Further down is this socket error.

Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1
[W socket.cpp:426] [c10d] The server socket has failed to bind to [::]:15841 (errno: 98 - Address already in use).
[W socket.cpp:426] [c10d] The server socket has failed to bind to 0.0.0.0:15841 (errno: 98 - Address already in use).
[E socket.cpp:462] [c10d] The server socket has failed to listen on any local network address.

Each Trainer has 8 GPUS to work with which I verify by printing with torch.cuda.is_available() and torch.cuda.device_count() before calling trainer.fit(model,dm). I had to use ddp_spawn as the trainer.fit is called from another process.

My trainer setup:

trainer = Trainer(
            max_epochs=int(budget),
            accelerator="auto",
            num_nodes=1,
            devices="auto",
            strategy="ddp_spawn",
            max_time="00:1:00:00",  # give each run a time limit
            num_sanity_val_steps=1,
        )

SLURM Env flags:

export NCCL_DEBUG=INFO
export CUDA_LAUNCH_BLOCKING=1
export TOKENIZERS_PARALLELISM=False
export TORCH_DISTRIBUTED_DEBUG=DETAIL
export TORCH_SHOW_CPP_STACKTRACES=1

Slurm Workflow:

set flags, navigate to folder and activate python enviroment
obtain list of nodes as a list
start the first node in the list as a Master for the Hyperparameter Tuning Lib
start the remaining nodes as workers.
tear down and clean up

I am not sure how to deal with this, so any advice is welcome.

Thank you

Topic		Replies	Views
Multi-GPU with SLURM failed at initialization DDP/GPU	1	1549	April 4, 2022
Multi-GPU training issue - DDP strategy. Training hangs upon distributed GPU initialisation DDP/GPU	3	3887	January 18, 2023
Collective mismatch at end of training epoch DDP/GPU	0	1083	July 30, 2022
CUDA OOM while initializing DDP DDP/GPU	1	4195	November 17, 2020
Training freezes at "initializing ddp: GLOBAL_RANK ..." DDP/GPU	4	2790	May 9, 2024

Error with Pytorch Lightning ddp_spawn on SLURM

Related topics