I am trying to move a project from native pytorch to Lightning. Problem is that when I try to use “16-mixed” precision, all outputs from the forward method of the model (and therefore loss) become nan, Right from the start.
This does not happen if I do fp16 in its native pytorch implementation and use autocast amp. Any ideas?
Happy to provide more logs or the code if needed. I am thinking I am missing something trivial.
the model is a modified resnet, and it’s pretty ehm… convoluted.
When I tried to use autocast and use fp16 in pytorch it works fine.
I am now trying to port it to Lightning.
Using fp32 works fine, or at least it seems if I try to overfit a batch, or run a fast_dev.
If I try to use the precision = ‘16-mixed’ flag, forward method produces Nan right away from the first iteration.
detect anomally=True produces expectedly: Function ‘LogSoftmaxBackward0’ returned nan values in its 0th output.
Optimizer is SGD and initial learning rate is 0.1. Tried with smaller values, it still fails.
Any input whatsoever on what may be the cause, or just how to approach troubleshooting this, is more than welcome.
thank you for your time.