Thank you some much for sending this notebook
I think I realized my Mistake. My mistake is calling wandb_logger.experiment.config.some_param = 'x'
before starting the DDP process starts.
I wanted the logger to Log the HyperParams and so I was adding it as a part of the wandb config before training starts. Earlier I would provide 1 GPU so never a problem. Now with multi-TPU’s, this doesn’t work as the Wandb process is already spawned and breaks as soon as we spawn
/fork
processes.
But with the DDP setup, I even Tried adding the hparam logging to on_fit_start
hook. This also crashed with a similar error.
The full error of failure is below
wandb: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter: ··········
wandb: Appending key for api.wandb.ai to your netrc file: /root/.netrc
Tracking run with wandb version 0.10.8
Syncing run EXP_NAME to Weights & Biases (Documentation).
Project page: https://wandb.ai/valaydave/PROJECT_NAME
Run page: https://wandb.ai/valaydave/PROJECT_NAME/runs/2hp8cgze
Run data is saved locally in wandb/run-20201026_203332-2hp8cgze
GPU available: False, used: False
TPU available: True, using: 8 TPU cores
training on 8 TPU cores
---------------------------------------------------------------------------
ProcessRaisedException Traceback (most recent call last)
<ipython-input-10-1f9f6fbe4f6c> in <module>()
----> 1 test_x(tmpdir)
5 frames
/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/spawn.py in join(self, timeout)
164 msg = "\n\n-- Process %d terminated with the following error:\n" % error_index
165 msg += original_trace
--> 166 raise ProcessRaisedException(msg, error_index, failed_process.pid)
167
168
ProcessRaisedException:
-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/usr/lib/python3.6/logging/__init__.py", line 996, in emit
stream.write(msg)
File "/usr/local/lib/python3.6/dist-packages/wandb/lib/redirect.py", line 91, in new_write
cb(name, data)
File "/usr/local/lib/python3.6/dist-packages/wandb/sdk/wandb_run.py", line 644, in _console_callback
self._backend.interface.publish_output(name, data)
File "/usr/local/lib/python3.6/dist-packages/wandb/interface/interface.py", line 146, in publish_output
self._publish_output(o)
File "/usr/local/lib/python3.6/dist-packages/wandb/interface/interface.py", line 151, in _publish_output
self._publish(rec)
File "/usr/local/lib/python3.6/dist-packages/wandb/interface/interface.py", line 428, in _publish
if self._process and not self._process.is_alive():
File "/usr/lib/python3.6/multiprocessing/process.py", line 134, in is_alive
assert self._parent_pid == os.getpid(), 'can only test a child process'
AssertionError: can only test a child process
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 330, in _mp_start_fn
_start_fn(index, pf_cfg, fn, args)
File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 324, in _start_fn
fn(gindex, *args)
File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/accelerators/tpu_backend.py", line 119, in tpu_train_in_process
self.__setup_tpu_training(model, trainer)
File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/accelerators/tpu_backend.py", line 221, in __setup_tpu_training
log.info(f'INIT TPU local core: {trainer.tpu_local_core_rank},'
File "/usr/lib/python3.6/logging/__init__.py", line 1308, in info
self._log(INFO, msg, args, **kwargs)
File "/usr/lib/python3.6/logging/__init__.py", line 1444, in _log
self.handle(record)
File "/usr/lib/python3.6/logging/__init__.py", line 1454, in handle
self.callHandlers(record)
File "/usr/lib/python3.6/logging/__init__.py", line 1516, in callHandlers
hdlr.handle(record)
File "/usr/lib/python3.6/logging/__init__.py", line 865, in handle
self.emit(record)
File "/usr/lib/python3.6/logging/__init__.py", line 1000, in emit
self.handleError(record)
File "/usr/lib/python3.6/logging/__init__.py", line 917, in handleError
sys.stderr.write('--- Logging error ---\n')
File "/usr/local/lib/python3.6/dist-packages/wandb/lib/redirect.py", line 91, in new_write
cb(name, data)
File "/usr/local/lib/python3.6/dist-packages/wandb/sdk/wandb_run.py", line 644, in _console_callback
self._backend.interface.publish_output(name, data)
File "/usr/local/lib/python3.6/dist-packages/wandb/interface/interface.py", line 146, in publish_output
self._publish_output(o)
File "/usr/local/lib/python3.6/dist-packages/wandb/interface/interface.py", line 151, in _publish_output
self._publish(rec)
File "/usr/local/lib/python3.6/dist-packages/wandb/interface/interface.py", line 428, in _publish
if self._process and not self._process.is_alive():
File "/usr/lib/python3.6/multiprocessing/process.py", line 134, in is_alive
assert self._parent_pid == os.getpid(), 'can only test a child process'
AssertionError: can only test a child process
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
fn(i, *args)
File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 335, in _mp_start_fn
file=sys.stderr)
File "/usr/local/lib/python3.6/dist-packages/wandb/lib/redirect.py", line 91, in new_write
cb(name, data)
File "/usr/local/lib/python3.6/dist-packages/wandb/sdk/wandb_run.py", line 644, in _console_callback
self._backend.interface.publish_output(name, data)
File "/usr/local/lib/python3.6/dist-packages/wandb/interface/interface.py", line 146, in publish_output
self._publish_output(o)
File "/usr/local/lib/python3.6/dist-packages/wandb/interface/interface.py", line 151, in _publish_output
self._publish(rec)
File "/usr/local/lib/python3.6/dist-packages/wandb/interface/interface.py", line 428, in _publish
if self._process and not self._process.is_alive():
File "/usr/lib/python3.6/multiprocessing/process.py", line 134, in is_alive
assert self._parent_pid == os.getpid(), 'can only test a child process'
AssertionError: can only test a child process
COLAB to replicate bug: Google Colab
In the light of this problem, I was wondering what is best practice when to Log hparams with WanDB using DDP or multiple-TPU.