Error with mixed precision 16bit

sudarshan85 · December 31, 2020, 6:11pm

Hello,

I’m training a bert model for sequence classification (from HF). I am using 16bit precision and I have run into the following error:

AssertionError: Attempted step but _scale is None.  This may indicate your script did not use scaler.scale(loss or outputs) earlier in the iteration.

However, 32bit runs without any problems. I am using Pytorch Lightning version 1.1.2. I should note that, I ran a similar code in an earlier version (don’t remember which one but it was > 1.0.0) and I didn’t run into any problems with 16bit training. Here are the training arguments:

 early_stop_callback = EarlyStopping(
  monitor='val_loss',
  min_delta=0.0,
  patience=5,
  verbose=False,
  mode='min'
)

logger = CSVLogger(
  save_dir=f'{model_dir}',
  name=None,
)

checkpoint_callback = ModelCheckpoint(
  filepath=Path(f'{logger.log_dir}/checkpoints')/'{epoch}-{val_loss:0.3f}-{val_accuracy:0.3f}',
  save_top_k=3,
  monitor='val_loss',
  verbose=True,
  mode='min',
  prefix=''
)
callbacks = [
  PrintTableMetricsCallback(),
]

trainer_args = Namespace(
  progress_bar_refresh_rate=1,
  max_epochs=2,
  gpus=1,
  accumulate_grad_batches=1,
  precision=16,
  overfit_batches=0.1,
  checkpoint_callback=checkpoint_callback,
  logger=logger,
  callbacks=callbacks,
  fast_dev_run=True,
  reload_dataloaders_every_epoch=True,
)

I’ll put the code for the model if required. I’d like to train 16bit models instead of 32bit models to increase my batch size.

Thanks.

jirka · January 6, 2021, 7:36pm

could you pls shot an issue with a full example so we can try to reproduce on our end?

sudarshan85 · January 7, 2021, 2:46pm

This is issue is tied to the other issue based on my silly mistake! It got solved.

FYI, the silly mistake: I copied my Lightning module that I created for a similar project over. When I copied the training_step, by mistake I didn’t include return loss. So basically, training was being done with no loss.

Topic		Replies	Views
Why `precision=16` for me is almost useless for speeding up? Trainer	1	1098	January 16, 2023
Torch.utils.checkpoint not compatible with Mixed Precision	1	1179	February 22, 2021
Mixed Precision not working only in LIghtning. foward produces Nan implementation help	1	915	December 6, 2023
Training on combined two large dataset with 16bit precision implementations	1	1223	August 28, 2020
Precision 16 run problem implementation help	0	66	June 4, 2024

Error with mixed precision 16bit

Related topics