Hi,
I’m using slurm for training.
I’m having a problem that my job gets a signal after an hour of running, with the error:
RuntimeError: DataLoader worker (pid 380671) is killed by signal: CPU time limit exceeded.
Currently I use a single GPU and a single node, I even tried to test with num_workers=0, but still no luck.
I tried to do the same with native pytorch and it works fine, but when using trainer.fit it crashed after an hour.
My slurm script:
#!/bin/bash
#SBATCH --job-name=test
#SBATCH --output=%x.%j.out
#SBATCH --error=%x.%j.err
#SBATCH --mem=0
#SBATCH --gres=gpu:1
#SBATCH --nodes=1 # This needs to match Trainer(num_nodes=...)
#SBATCH --ntasks-per-node=1 # This needs to match Trainer(devices=...)
#SBATCH --cpus-per-task=1
echo "Current date and time: $(date)"
hostname
pwd
srun python main.py
echo "Current date and time: $(date)"
echo "END"
Appreciate any help here.