ChatGPT Despite scaling up batch size and nodes using PyTorch Lightning and DDP, there's no speedup in training

I’ve been training a model using PyTorch Lightning Fabric and Distributed Data Parallel (DDP) across multiple nodes, but haven’t observed any speedup in training time. Is this lack of acceleration expected at smaller scales? I’ve experimented with different batch sizes - 64 for one node and 256 for four nodes - yet the per-epoch training time remained similar. Could this be a bandwidth bottleneck or something else?
I’m also unsure about my MPI script usage for multi-node training. Here’s the command I’ve used:

mpiexec --verbose --envall --env CUDA_CACHE_DISABLE=1 -n 8 --ppn 4 --hostfile=/var/spool/pbs/aux/1204475.polaris-pbs-01.hsn.cm.polaris.alcf.anl.gov python -W ignore src/run_cavmae_pretrain_piano_roll.py --model cav-mae --dataset audioset --pretrain_path /home/ben2002chou/code/cav-mae-pl/cav-mae/IN-initial.pth --data-train /home/ben2002chou/code/cav-mae/data/cocochorals/audioset_2m_cocochorals_train.json --data-val /home/ben2002chou/code/cav-mae/data/cocochorals/audioset_eval_cocochorals_valid.json --exp-dir ./exp_midi/testmae02-audioset-cav-mae-balNone-lr5e-5-epoch25-bs16-normTrue-c0.01-p1.0-tpFalse-mr-unstructured-0.75 --label-csv /home/ben2002chou/code/cav-mae/data/cocochorals/class_labels_indices_combined.csv --n_class 527 --lr 5e-5 --n-epochs 25 --batch-size 16 --save_model True --mixup 0.0 --bal None --lrscheduler_start 10 --lrscheduler_decay 0.5 --lrscheduler_step 5 --dataset_mean -5.081 --dataset_std 4.4849 --target_length 1024 --noise True --warmup True --lr_adapt False --norm_pix_loss True --mae_loss_weight 1.0 --contrast_loss_weight 0.01 --tr_pos False --masking_ratio 0.75 --mask_mode unstructured