I’m using Pytorch Lightning to tune hyperparamters. But my training function has issues when run with a SLURM script. The code works fine when I have an interactive shell running, where I execute the commands (see Slurm Workflow below) on each node manually. But when it’s run with a bash script I get the following errors consistenly:
Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 101: invalid device ordinal Exception raised from device_count_impl at ../c10/cuda/CUDAFunctions.cpp:84 (most recent call first):
For this error, I tried restarting my tuning run.
Further down is this socket error.
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1 Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1 [W socket.cpp:426] [c10d] The server socket has failed to bind to [::]:15841 (errno: 98 - Address already in use). [W socket.cpp:426] [c10d] The server socket has failed to bind to 0.0.0.0:15841 (errno: 98 - Address already in use). [E socket.cpp:462] [c10d] The server socket has failed to listen on any local network address.
Each Trainer has 8 GPUS to work with which I verify by printing with
torch.cuda.device_count() before calling
trainer.fit(model,dm). I had to use
ddp_spawn as the
trainer.fit is called from another process.
My trainer setup:
trainer = Trainer( max_epochs=int(budget), accelerator="auto", num_nodes=1, devices="auto", strategy="ddp_spawn", max_time="00:1:00:00", # give each run a time limit num_sanity_val_steps=1, )
SLURM Env flags:
export NCCL_DEBUG=INFO export CUDA_LAUNCH_BLOCKING=1 export TOKENIZERS_PARALLELISM=False export TORCH_DISTRIBUTED_DEBUG=DETAIL export TORCH_SHOW_CPP_STACKTRACES=1
- set flags, navigate to folder and activate python enviroment
- obtain list of nodes as a list
- start the first node in the list as a Master for the Hyperparameter Tuning Lib
- start the remaining nodes as workers.
- tear down and clean up
I am not sure how to deal with this, so any advice is welcome.