Mixed Precision not working only in LIghtning. foward produces Nan

Argotera · September 17, 2023, 6:13pm

Hi!

TL:DR

I am trying to move a project from native pytorch to Lightning. Problem is that when I try to use “16-mixed” precision, all outputs from the forward method of the model (and therefore loss) become nan, Right from the start.
This does not happen if I do fp16 in its native pytorch implementation and use autocast amp. Any ideas?
Happy to provide more logs or the code if needed. I am thinking I am missing something trivial.

Full version:

the model is a modified resnet, and it’s pretty ehm… convoluted.
When I tried to use autocast and use fp16 in pytorch it works fine.
I am now trying to port it to Lightning.
Using fp32 works fine, or at least it seems if I try to overfit a batch, or run a fast_dev.

If I try to use the precision = ‘16-mixed’ flag, forward method produces Nan right away from the first iteration.
detect anomally=True produces expectedly: Function ‘LogSoftmaxBackward0’ returned nan values in its 0th output.
Optimizer is SGD and initial learning rate is 0.1. Tried with smaller values, it still fails.

Any input whatsoever on what may be the cause, or just how to approach troubleshooting this, is more than welcome.

thank you for your time.

nilianne72 · December 6, 2023, 9:26pm

I have the same issue with precision 16
the model weights are casted in 16 but the data remain in float32 and this produces nan values

Topic		Replies	Views
Training on combined two large dataset with 16bit precision implementations	1	1223	August 28, 2020
Torch.utils.checkpoint not compatible with Mixed Precision	1	1179	February 22, 2021
Error with mixed precision 16bit	2	4913	January 7, 2021
Precision 16 run problem implementation help	0	66	June 4, 2024
How to set some special layers to float32 when training use mix-precision float16	2	644	October 24, 2023

Mixed Precision not working only in LIghtning. foward produces Nan

Related topics