The problem here seems to be that hydra is creating 2 run dirs, one for each process. It would seem to me that the solution would be to create your own run dirs. Using the following in your config, you can set the current dir as run dir and disable creating new subdirs:
output_subdir: null # Disable saving of config files. We'll do that ourselves.
dir: . # Set working dir to current directory
Thank you @shreeyak for your suggestion and pointers.
So far I’ve decided to manually set the hydra.run.dir myself, thus enabling the ckpts and logging folders to be shared among the DDP processes.
I am not too familiar with the rank 0 process related callbacks but I am willing to study it more and come back with a better solution granted I can find one.
Have a nice day
EDIT: Note that I updated the repository linked above into an actual Minimal Working Example such that it no longer uses actual but dummy data instead.
It reproduces the issue whether I use the latest versions (requirements_latest.txt) or my own environment (requirements.txt).
Just to give a high-level overview: because DDP launches separate processes for each GPU, certain tasks should not be executed on all processes to avoid errors, such as read/writing the same file. So, we only perform them on the rank 0 process.
In PL, there is a method that can also be used as a decorator. Any method with this decorator will only execute on rank 0 process:
# Global rank 0 is the 0th process on the 0th node
if int(os.environ.get('LOCAL_RANK', 0) == 0 and os.environ.get('NODE_RANK', 0):
If one is using a callback for some task, then the callback will generally use the @rank_zero_only decorator and perform the task during the setup or pretrain period.
Anyway, looks like your error might be an actual bug? If so, feel free to refer to my repo for a tmp. alternative (set hydra run dir to current directory, create a new logging dir for this run manually, pass that directory to the ModelCheckpoint callback).