Training/predicting takes forever before predict_step is even called

dorien · October 5, 2023, 2:40pm

Ever since my model / data got a bit bigger, it seems to take forever / hangs and never gets to the training / predict next step anymore.

In the example of live prediction. I am only interested in last 200 rows to make 1 final prediction (so should not load a lot of data).

My model has 770k params (LSTM) and I am running on MPS (M1).

In case it’s relevant, I use TimeSeriesDataset from pytorch-forecasting as dataloader:

def set_prod_data(self, prod_df): 
    self.prod_dataset = TimeSeriesDataSet.from_dataset(self.training_dataset, prod_df, predict_mode = True)
    self.prod_dataloader = self.prod_dataset.to_dataloader(train=False, batch_size=self.p['batch_size'] * 1, num_workers=self.p['num_workers'])
    
    return self.prod_dataloader, self.prod_dataset

and in main.py:

loaders = dataloader.Dataloaders(dataset, dataset_predict, p)
prod_dataloader, prod_dataset = loaders.set_prod_data(dataset_prod)
       
model = lstm.LSTMClassifier(p, loaders)
training = training.Training(p, device, reset=False)
trainer = training.get_trainer()
eval = eval.Eval(p, trainer, model, loaders)
            
print('Starting evaluation') # this is printed)
trainer.predict(model = model, dataloaders=prod_dataloader)

I have a debug marker set at the first line of def predict_step(self, batch, batch_idx): but it never reaches there.

I am really torn what is happening. Is the LSTM really too big to load in memory? My batch size is only 200, and production data is tiny.

How can I figure out what where the infinite loop is? It just seems to dissapear after calling .predict() and never reaches predict_step().

I have all my loaders in the object loaders (also training data loader). This is being passed when lstm is init() because I use the loaders.decode to do some checks during training. But still, the loading finishes, so unsure what is happening.

dorien · October 7, 2023, 7:28am

More information:

I found out that the process ‘hanging’ is in fact _log_hyperparams(self) in trainer.py, and more specifically logger.save() in utitilities.py.

I use TensorBoard as a logger.

    checkpoint_callback = ModelCheckpoint(
        dirpath=checkpoint_path,
        filename="best-checkpoint_{epoch}-{val_loss:.3f}",
        save_top_k=1,
        verbose=True,
        monitor="val_loss",
        mode="min"
    )

    tb_name = 'tb_' + p['model_name']  #+ '_hidden' + str(p['hidden_size']) + '_numlay' + str(p['num_layers'])
    tensor_logger = TensorBoardLogger(tb_path, name=tb_name)
    
    for key, value in p.items():
        if isinstance(value, (int, float)):
            tensor_logger.experiment.add_scalar(key, value)


    self.trainer = Trainer(
        max_epochs=p['max_epochs'],
        logger=tensor_logger, #, csv_logger], #neptune_logger
        accelerator='cpu',
        devices=1,
        log_every_n_steps=1,
        callbacks=[TQDMProgressBar(refresh_rate=1), checkpoint_callback], #,checkpoint_callback],  ADDED => "callbacks" receives a list(s) of callbacks
    )

and my model:

class LSTMClassifier(pl.LightningModule):
‘’’
Standard PyTorch Lightning module:
https://pytorch-lightning.readthedocs.io/en/latest/lightning_module.html
‘’’
def init(self, p, loaders):
super(LSTMClassifier, self).init()
self.n_features = p[‘n_features’]
self.hidden_size = p[‘hidden_size’]
self.seq_len = p[‘seq_len’]
self.model_name = p[‘model_name’]
self.batch_size = p[‘batch_size’]
self.num_layers = p[‘num_layers’]
self.log_path = p[‘log_path’]
self.dropout = p[‘dropout’]
self.criterion = p[‘criterion’]
self.learning_rate = p[‘learning_rate’]
self.n_labels = p[‘n_labels’]
self.loaders = loaders

      # Add BatchNorm1d before LSTM
      self.batch_norm_input = nn.BatchNorm1d(self.seq_len)
      
      #and after lstm
      self.batch_norm_lstm = nn.BatchNorm1d(self.seq_len)
      
      self.lstm = nn.LSTM(input_size=self.n_features,  #attention: https://github.com/chrisvdweth/ml-toolkit/blob/master/pytorch/models/text/classifier/rnn.py
                          hidden_size=self.hidden_size,
                          num_layers=self.num_layers, 
                          dropout=self.dropout, 
                          bidirectional=False,
                          batch_first=True)  

      self.sig = nn.Sigmoid()
      self.softmax = nn.Softmax(dim=1)  #dim 1 was taken by default, but is it really??
      
      linear1_size = self.hidden_size // 2

      self.linear1 = nn.Sequential(
          nn.Linear(self.hidden_size, linear1_size),
          nn.ReLU(),
          nn.Dropout(self.dropout)
      )
      self.linear2 = nn.Sequential(
          nn.Linear(linear1_size, self.n_labels)
      )
      
      
      self.save_hyperparameters()
      self.validation_step_y_hats = []
      self.validation_step_ys = []
      self.test_step_y_hats = []
      self.test_step_ys = []
      self.test_step_xs = []
      self.pred_step_y_hats = []
      self.pred_step_ys = []
      self.pred_step_xs = []
      self.datarow = [] # idx filename x[-1] y y_hat  
      self.databatch_x = []
      
      print('LSTM model set.')
      
      
  def forward(self, x):
      batch_size = x["encoder_cont"].size(0)
      network_input = x["encoder_cont"] #TODO.squeeze(-1)
      
      network_input = self.batch_norm_input(network_input)
      lstm_out, _ = self.lstm(network_input)
      lstm_out = self.batch_norm_lstm(lstm_out)

      out = self.linear1(lstm_out[:,-1])
      out = self.linear2(out)
      y_pred = self.softmax(out)
      
      return y_pred

dorien · October 7, 2023, 11:09am

Solved it. Turns out my hparams.yaml was huge as I was accidentally passing my dataloader to my lstm model.

Topic		Replies	Views
Bug in the trainer.predict() Trainer	0	59	June 6, 2024
Dealing with large dataset Trainer	1	3988	December 3, 2022
Multiple data loader get stuck at epoch 1	0	1093	July 14, 2021
Does not run validation step after epoch when running with all data implementation help	5	2529	May 1, 2023
Validation_step and validation_epoch_end won't get called in trainer.fit() routine LightningModule	4	6879	November 2, 2022

Training/predicting takes forever before predict_step is even called

Related topics