Hi,
I am using a library HpBandSter to wrap around my Pytorch Lightning experiment to allow for Hyperparamter Optimisation.
The library spawns workers that run the objective function via:
t1 = threading.Thread(target=training_func, name='target')
t1.start()
My lone training function with the Lightning code works fine. But when used as the objective function for the library, an error is thrown during the DDP initialisation.
I get the errors:
Exception in training:
Detected mismatch between collectives on ranks. Rank 1 is running collective: CollectiveFingerPrint(OpType=ALLREDUCE, TensorShape=[], TensorDtypes=Float, TensorDeviceTypes=TensorOptions(dtype=float (default), device=cuda, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt))), but Rank 0 is running collective: CollectiveFingerPrint(OpType=ALLREDUCE).
and
Exception in training:
DDP expects same model across all ranks, but Rank 0 has 201 params, while rank 1 has inconsistent 393 params.
Exception raised from verify_params_across_processes at ../torch/csrc/distributed/c10d/reducer.cpp:2132 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fdad859e4d7 in /home/dsengupt/tinybert_nlp/lib/python3.10/site-packages/torch/lib/libc10.so)
Any advice on how to proceed will be welcome. Thank you!