I’m using slurm for training.
I’m having a problem that my job gets a signal after an hour of running, with the error:
RuntimeError: DataLoader worker (pid 380671) is killed by signal: CPU time limit exceeded.
Currently I use a single GPU and a single node, I even tried to test with num_workers=0, but still no luck.
I tried to do the same with native pytorch and it works fine, but when using trainer.fit it crashed after an hour.
My slurm script:
#!/bin/bash #SBATCH --job-name=test #SBATCH --output=%x.%j.out #SBATCH --error=%x.%j.err #SBATCH --mem=0 #SBATCH --gres=gpu:1 #SBATCH --nodes=1 # This needs to match Trainer(num_nodes=...) #SBATCH --ntasks-per-node=1 # This needs to match Trainer(devices=...) #SBATCH --cpus-per-task=1 echo "Current date and time: $(date)" hostname pwd srun python main.py echo "Current date and time: $(date)" echo "END"
Appreciate any help here.