Hi Folks!
I’m using Pytorch Lightning to tune hyperparamters. But my training function has issues when run with a SLURM script. The code works fine when I have an interactive shell running, where I execute the commands (see Slurm Workflow below) on each node manually. But when it’s run with a bash script I get the following errors consistenly:
Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 101: invalid device ordinal
Exception raised from device_count_impl at ../c10/cuda/CUDAFunctions.cpp:84 (most recent call first):
For this error, I tried restarting my tuning run.
Further down is this socket error.
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1
[W socket.cpp:426] [c10d] The server socket has failed to bind to [::]:15841 (errno: 98 - Address already in use).
[W socket.cpp:426] [c10d] The server socket has failed to bind to 0.0.0.0:15841 (errno: 98 - Address already in use).
[E socket.cpp:462] [c10d] The server socket has failed to listen on any local network address.
Each Trainer has 8 GPUS to work with which I verify by printing with torch.cuda.is_available()
and torch.cuda.device_count()
before calling trainer.fit(model,dm)
. I had to use ddp_spawn
as the trainer.fit
is called from another process.
My trainer setup:
trainer = Trainer(
max_epochs=int(budget),
accelerator="auto",
num_nodes=1,
devices="auto",
strategy="ddp_spawn",
max_time="00:1:00:00", # give each run a time limit
num_sanity_val_steps=1,
)
SLURM Env flags:
export NCCL_DEBUG=INFO
export CUDA_LAUNCH_BLOCKING=1
export TOKENIZERS_PARALLELISM=False
export TORCH_DISTRIBUTED_DEBUG=DETAIL
export TORCH_SHOW_CPP_STACKTRACES=1
Slurm Workflow:
- set flags, navigate to folder and activate python enviroment
- obtain list of nodes as a list
- start the first node in the list as a Master for the Hyperparameter Tuning Lib
- start the remaining nodes as workers.
- tear down and clean up
I am not sure how to deal with this, so any advice is welcome.
Thank you