Error with Pytorch Lightning ddp_spawn on SLURM

Hi Folks!

I’m using Pytorch Lightning to tune hyperparamters. But my training function has issues when run with a SLURM script. The code works fine when I have an interactive shell running, where I execute the commands (see Slurm Workflow below) on each node manually. But when it’s run with a bash script I get the following errors consistenly:

Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 101: invalid device ordinal
Exception raised from device_count_impl at ../c10/cuda/CUDAFunctions.cpp:84 (most recent call first):

For this error, I tried restarting my tuning run.

Further down is this socket error.

Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1
[W socket.cpp:426] [c10d] The server socket has failed to bind to [::]:15841 (errno: 98 - Address already in use).
[W socket.cpp:426] [c10d] The server socket has failed to bind to 0.0.0.0:15841 (errno: 98 - Address already in use).
[E socket.cpp:462] [c10d] The server socket has failed to listen on any local network address.

Each Trainer has 8 GPUS to work with which I verify by printing with torch.cuda.is_available() and torch.cuda.device_count() before calling trainer.fit(model,dm). I had to use ddp_spawn as the trainer.fit is called from another process.

My trainer setup:

trainer = Trainer(
            max_epochs=int(budget),
            accelerator="auto",
            num_nodes=1,
            devices="auto",
            strategy="ddp_spawn",
            max_time="00:1:00:00",  # give each run a time limit
            num_sanity_val_steps=1,
        )

SLURM Env flags:

export NCCL_DEBUG=INFO
export CUDA_LAUNCH_BLOCKING=1
export TOKENIZERS_PARALLELISM=False
export TORCH_DISTRIBUTED_DEBUG=DETAIL
export TORCH_SHOW_CPP_STACKTRACES=1

Slurm Workflow:

  • set flags, navigate to folder and activate python enviroment
  • obtain list of nodes as a list
  • start the first node in the list as a Master for the Hyperparameter Tuning Lib
  • start the remaining nodes as workers.
  • tear down and clean up

I am not sure how to deal with this, so any advice is welcome.

Thank you :slight_smile: