SLURM Runtime Error due to "ntasks" variable

Hi all, my script runs on SLURM, with 10 nodes, and each node contains one GPU. My slurm script looks like
`#SBATCH -J xxx
#SBATCH -p xxx
#SBATCH --nodes=10
#SBATCH --ntasks-per-node=1
#SBATCH --gpus-per-task xxx
#SBATCH --exclusive
#SBATCH --output=%x-%j.out
#SBATCH --error=%x-%j.err

srun python main.py fit`

If I directly use “sbatch run_script.sh”, everything is fine. But if I first allocate 10 SLURM nodes, ssh to the first one, and run “srun python main.py fit”, it shows a RUNTIME error:

RuntimeError: You set --ntasks=10 in your SLURM bash script, but this variable is not supported. HINT: Use --ntasks-per-node=10 instead.

However, I did not assign the “ntasks” variable. And before I used pytorch-lightning, the vanilla pytorch works fine for both two ways (i.e., sbatch to submit the job, or first allocate the resource, ssh to the slurm node and use run). So what can be the possbile issues for my second case? Thanks.

This error was added so users don’t misconfigure their sbatch job.
To use a node interactively like you described, you can set the job name to “bash”.

–job-name bash

Thanks for your reply. I change the job name to bash, and run

srun --job-name=bash python main.py fit

however, it hangs on the initializing ddp stage.

PossibleUserWarning: The srun command is available on your system but is not used. HINT: If your intention is to run Lightning on SLURM, prepend your python command with srun
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/10

Once logged in to the machine, you wouldn’t run your script with srun. Just a regular python script run.

I added some docs here to make it clearer: Document SLURM interactive mode by awaelchli · Pull Request #16955 · Lightning-AI/lightning · GitHub

Hope this helps