Mixed Precision not working only in LIghtning. foward produces Nan



I am trying to move a project from native pytorch to Lightning. Problem is that when I try to use “16-mixed” precision, all outputs from the forward method of the model (and therefore loss) become nan, Right from the start.
This does not happen if I do fp16 in its native pytorch implementation and use autocast amp. Any ideas?
Happy to provide more logs or the code if needed. I am thinking I am missing something trivial.

Full version:

the model is a modified resnet, and it’s pretty ehm… convoluted.
When I tried to use autocast and use fp16 in pytorch it works fine.
I am now trying to port it to Lightning.
Using fp32 works fine, or at least it seems if I try to overfit a batch, or run a fast_dev.

If I try to use the precision = ‘16-mixed’ flag, forward method produces Nan right away from the first iteration.
detect anomally=True produces expectedly: Function ‘LogSoftmaxBackward0’ returned nan values in its 0th output.
Optimizer is SGD and initial learning rate is 0.1. Tried with smaller values, it still fails.

Any input whatsoever on what may be the cause, or just how to approach troubleshooting this, is more than welcome.

thank you for your time.

I have the same issue with precision 16
the model weights are casted in 16 but the data remain in float32 and this produces nan values

1 Like