Slurm - CPU time limit exceeded


I’m using slurm for training.
I’m having a problem that my job gets a signal after an hour of running, with the error:
RuntimeError: DataLoader worker (pid 380671) is killed by signal: CPU time limit exceeded.
Currently I use a single GPU and a single node, I even tried to test with num_workers=0, but still no luck.

I tried to do the same with native pytorch and it works fine, but when using it crashed after an hour.

My slurm script:


#SBATCH --job-name=test
#SBATCH --output=%x.%j.out
#SBATCH --error=%x.%j.err
#SBATCH --mem=0
#SBATCH --gres=gpu:1
#SBATCH --nodes=1            # This needs to match Trainer(num_nodes=...)
#SBATCH --ntasks-per-node=1   # This needs to match Trainer(devices=...)
#SBATCH --cpus-per-task=1

echo "Current date and time: $(date)"

srun python

echo "Current date and time: $(date)"
echo "END"

Appreciate any help here.