Slurm - CPU time limit exceeded

SharonPeled · July 20, 2023, 7:13am

Hi,

I’m using slurm for training.
I’m having a problem that my job gets a signal after an hour of running, with the error:
RuntimeError: DataLoader worker (pid 380671) is killed by signal: CPU time limit exceeded.
Currently I use a single GPU and a single node, I even tried to test with num_workers=0, but still no luck.

I tried to do the same with native pytorch and it works fine, but when using trainer.fit it crashed after an hour.

My slurm script:

#!/bin/bash

#SBATCH --job-name=test
#SBATCH --output=%x.%j.out
#SBATCH --error=%x.%j.err
#SBATCH --mem=0
#SBATCH --gres=gpu:1
#SBATCH --nodes=1            # This needs to match Trainer(num_nodes=...)
#SBATCH --ntasks-per-node=1   # This needs to match Trainer(devices=...)
#SBATCH --cpus-per-task=1


echo "Current date and time: $(date)"
hostname
pwd

srun python main.py

echo "Current date and time: $(date)"
echo "END"

Appreciate any help here.

Topic		Replies	Views
SLURM Runtime Error due to "ntasks" variable DDP/GPU	3	2295	March 6, 2023
Multi-GPU with SLURM failed at initialization DDP/GPU	1	1495	April 4, 2022
Issue with nonzero num_workers DataModule	1	957	January 29, 2022
Crash if numworkers>0	2	1499	May 25, 2023
Error with Pytorch Lightning ddp_spawn on SLURM DDP/GPU	0	1276	October 1, 2023

Slurm - CPU time limit exceeded

Related topics