DDP Error in a Hyperparameter Optimisation run

f2010126 · September 6, 2023, 5:13pm

Hi,

I am using a library HpBandSter to wrap around my Pytorch Lightning experiment to allow for Hyperparamter Optimisation.

The library spawns workers that run the objective function via:

t1 = threading.Thread(target=training_func, name='target')
t1.start()

My lone training function with the Lightning code works fine. But when used as the objective function for the library, an error is thrown during the DDP initialisation.

I get the errors:

Exception in training: 
Detected mismatch between collectives on ranks. Rank 1 is running collective: CollectiveFingerPrint(OpType=ALLREDUCE, TensorShape=[], TensorDtypes=Float, TensorDeviceTypes=TensorOptions(dtype=float (default), device=cuda, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt))), but Rank 0 is running collective: CollectiveFingerPrint(OpType=ALLREDUCE).

and

Exception in training: 
DDP expects same model across all ranks, but Rank 0 has 201 params, while rank 1 has inconsistent 393 params.
Exception raised from verify_params_across_processes at ../torch/csrc/distributed/c10d/reducer.cpp:2132 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fdad859e4d7 in /home/dsengupt/tinybert_nlp/lib/python3.10/site-packages/torch/lib/libc10.so)

Any advice on how to proceed will be welcome. Thank you!

Topic		Replies	Views
Collective mismatch at end of training epoch DDP/GPU	0	1062	July 30, 2022
Multi-task model in version 2.0.9 with DDP error DDP/GPU	0	906	October 4, 2023
Error while using accelerator = 'ddp' DDP/GPU	6	4704	February 8, 2021
Ddp2 in multi node and multi gpu failing on pytorch lightning	0	535	November 7, 2021
Is anyone know why is code using ddp will be stucked on the epoch 3 DDP/GPU	0	207	September 24, 2023

DDP Error in a Hyperparameter Optimisation run

Related topics