Hi all, my script runs on SLURM, with 10 nodes, and each node contains one GPU. My slurm script looks like
`#SBATCH -J xxx
#SBATCH -p xxx
#SBATCH --nodes=10
#SBATCH --ntasks-per-node=1
#SBATCH --gpus-per-task xxx
#SBATCH --exclusive
#SBATCH --output=%x-%j.out
#SBATCH --error=%x-%j.err
srun python main.py fit`
If I directly use “sbatch run_script.sh”, everything is fine. But if I first allocate 10 SLURM nodes, ssh to the first one, and run “srun python main.py fit”, it shows a RUNTIME error:
RuntimeError: You set
--ntasks=10
in your SLURM bash script, but this variable is not supported. HINT: Use--ntasks-per-node=10
instead.
However, I did not assign the “ntasks” variable. And before I used pytorch-lightning, the vanilla pytorch works fine for both two ways (i.e., sbatch to submit the job, or first allocate the resource, ssh to the slurm node and use run). So what can be the possbile issues for my second case? Thanks.