Hello,
I am experiencing v. weird behaviour of the Trainer in single-GPU mode (accelerator=None
). In pytorch-lightning 1.1.6. Right before training starts, I am getting the following warning:
.../env/lib/python3.7/site-packages/pytorch_lightning/utilities/distributed.py:50: UserWarning: `pos_label` automatically set 1.
warnings.warn(*args, **kwargs)
And subsequently the training crashes with:
Traceback (most recent call last):
File "envs/pl1.x/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "envs/pl1.x/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "train.py", line 112, in <module>
main(cv_partition=cv_partition)
File "train.py", line 94, in main
trainer.fit(module, datamodule)
File "train.py", line 510, in fit
results = self.accelerator_backend.train()
File "/envs/pl1.x/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 57, in train
return self.train_or_test()
File "envs/pl1.x/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 74, in train_or_test
results = self.trainer.train()
File "envs/pl1.x/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 561, in train
self.train_loop.run_training_epoch()
File "envs/pl1.x/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 611, in run_training_epoch
self.run_on_epoch_end_hook(epoch_output)
File "envs/pl1.x/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 846, in run_on_epoch_end_hook
self.trainer.logger_connector.on_train_epoch_end()
File "envs/pl1.x/lib/python3.7/site-packages/pytorch_lightning/trainer/connectors/logger_connector/logger_connector.py", line 371, in on_train_epoch_end
self.cached_results.has_batch_loop_finished = True
File "envs/pl1.x/lib/python3.7/site-packages/pytorch_lightning/trainer/connectors/logger_connector/epoch_result_store.py", line 439, in has_batch_loop_finished
self.auto_reduce_results_on_epoch_end()
File "envs/pl1.x/lib/python3.7/site-packages/pytorch_lightning/trainer/connectors/logger_connector/epoch_result_store.py", line 429, in auto_reduce_results_on_epoch_end
hook_result.auto_reduce_results_on_epoch_end()
File "envs/pl1.x/lib/python3.7/site-packages/pytorch_lightning/trainer/connectors/logger_connector/epoch_result_store.py", line 223, in auto_reduce_results_on_epoch_end
opt_outputs = time_reduced_outputs[0].__class__.reduce_on_epoch_end(time_reduced_outputs)
File "envs/pl1.x/lib/python3.7/site-packages/pytorch_lightning/core/step_result.py", line 519, in reduce_on_epoch_end
recursive_stack(result)
File "envs/pl1.x/lib/python3.7/site-packages/pytorch_lightning/core/step_result.py", line 660, in recursive_stack
result[k] = collate_tensors(v)
File "envs/pl1.x/lib/python3.7/site-packages/pytorch_lightning/core/step_result.py", line 682, in collate_tensors
return torch.stack(items)
RuntimeError: All input tensors must be on the same device. Received cpu and cuda:0
I am logging the loss in train_step
and validation_step
as suggested in the pl docs. My on_train_epoch_end
is not even reached (crashes before w/ above error). Any hints on what is going on here?