Confusing # of optimizer steps when using gradient accumulation with DeepSpeed
|
|
0
|
362
|
May 25, 2023
|
Training when data is stored in batches
|
|
2
|
122
|
May 21, 2023
|
Trainer prints every step in validation
|
|
2
|
998
|
May 17, 2023
|
Weird result in convolutional network
|
|
2
|
309
|
May 14, 2023
|
Retraining a model with new data
|
|
1
|
193
|
May 9, 2023
|
How to use SWA with a cyclic scheduler
|
|
0
|
276
|
May 7, 2023
|
Resume training / load module from DeepSpeed checkpoint
|
|
14
|
2180
|
May 6, 2023
|
Resuming training gives different model result / weights
|
|
0
|
506
|
May 4, 2023
|
Wonder if _update_learning_rates is properly implemented
|
|
0
|
111
|
April 19, 2023
|
Why is the Trainer instance saved inside the DataModule during checkpoint save?
|
|
2
|
230
|
April 11, 2023
|
Trainer.validate/test with ckpt_path does not resume global_step
|
|
3
|
145
|
April 7, 2023
|
Is gradient clipping done before or after gradients accumulation?
|
|
2
|
432
|
April 5, 2023
|
Multiple dataloaders and epoch calculation
|
|
0
|
114
|
April 1, 2023
|
How does `LightningOptimizer.zero_grad()` work?
|
|
2
|
161
|
March 31, 2023
|
Number of steps drifts for `val_check_interval` when gradient accumulation turned on
|
|
0
|
200
|
March 26, 2023
|
Global_step increased at new epoch regardless of gradient accumulation
|
|
2
|
384
|
March 26, 2023
|
Incorrect batch size being inferred using trainer.fit(), correct batch size in dataloader? What could be going wrong? [PyLightning]
|
|
1
|
354
|
March 26, 2023
|
Model Works on CPU but Error out while running on GPU
|
|
1
|
704
|
March 25, 2023
|
How to continue training for more epochs?
|
|
1
|
870
|
March 25, 2023
|
Changing batch size during trainig
|
|
3
|
1161
|
March 20, 2023
|
Modifying the Trainer when calling Trainer.fit() multiple times
|
|
2
|
1129
|
February 18, 2023
|
Error while training simclr model
|
|
0
|
158
|
February 12, 2023
|
Question about auto_lr_find()
|
|
1
|
2003
|
January 31, 2023
|
How do I prevent initial validation run in Trainer 1.9.0?
|
|
1
|
183
|
January 24, 2023
|
Save_last and monitor in ModelCheckpoint
|
|
0
|
123
|
January 23, 2023
|
Why `precision=16` for me is almost useless for speeding up?
|
|
1
|
903
|
January 16, 2023
|
Resume_from_checkpoint not work
|
|
4
|
3679
|
December 7, 2022
|
Dealing with large dataset
|
|
1
|
2448
|
December 3, 2022
|
Auto_lr_find dependence on initial learning rate
|
|
1
|
399
|
November 22, 2022
|
Gradient Accumulation with Dual (optimizer, scheduler) Training
|
|
0
|
361
|
November 10, 2022
|