I’m looking for a good way to sync my output dir name (which contains a timestamp etc) between DDP processes. For now, I’m doing something like this:
local_rank = os.environ.get('LOCAL_RANK', 0)
if local_rank == 0:
now = datetime.now(dateutil.tz.tzlocal())
timestamp = now.strftime('%Y_%m_%d_%H_%M_%S')
run_output_dir = os.path.join(cfg.output_dir,
'%s_%s_%s_%s' % (cfg.dataset, cfg.cfg_name, timestamp, cfg.seed))
os.environ['RUN_OUTPUT_DIR'] = run_output_dir
else:
run_output_dir = os.environ['RUN_OUTPUT_DIR']
Is this OK or does someone have a better solution?
I’ve tried to use torch.distributed.send
and torch.distributed.recv
, but these only work for tensors.
I’m also using the WandBLogger
, so I have considered having all processes save output to wandb_logger.experiment.dir
, but that doesn’t work because the logger returns a dummy experiment in all but the main process (link).