Module is not converging

Hey everyone!
I am quite new to deep learning and PyTorch lightning, and I have some issues with my loss of values while trying to pre-train BERT for a recommendation from scratch.

I followed this tutorial https://towardsdatascience.com/build-your-own-movie-recommender-system-using-bert4rec-92e4e34938c5,and use the GitHub code as my starting point, for Bert4Rec implementation.

Here is a snippet with the relevant information from my module implementation

class Recommender(pl.LightningModule):
    def __init__(self, vocabulary_size, features=128, mask=1, dropout=0.4, lr=5e-5, iterations=[]):
            super().__init__()
    	...
            self.item_embeddings = torch.nn.Embedding(self.vocabulary_size, embedding_dim=features)
    
            self.input_pos_embedding = torch.nn.Embedding(512, embedding_dim=features)
    
            encoder_layer = nn.TransformerEncoderLayer(d_model=features, nhead=4, dropout=self.dropout)
    
            self.encoder = torch.nn.TransformerEncoder(encoder_layer, num_layers=6)
            self.linear_out = Linear(features, self.vocabulary_size)

    def encode_src(self, src_items):
            src_items = self.item_embeddings(src_items)
            batch_size, in_sequence_len = src_items.size(0), src_items.size(1)
            pos_encoder = (
              torch.arange(0, in_sequence_len, device=src_items.device)
                  .unsqueeze(0)
                  .repeat(batch_size, 1)
            )
            pos_encoder = self.input_pos_embedding(pos_encoder) 
            src_items += pos_encoder 
            src = src_items.permute(1, 0, 2)
            src = self.encoder(src) 
            return src.permute(1, 0, 2)
  
  
      def forward(self, src_items):
          src = self.encode_src(src_items)
          out = self.linear_out(src)
          return out
  
      def training_step(self, batch, batch_idx):
          src_items, y_true = batch
  
          y_pred = self(src_items)
  
          y_pred = y_pred.view(-1, y_pred.size(2))
          y_true = y_true.view(-1)
  
          src_items = src_items.view(-1)
          mask = src_items == self.mask
  
          loss = masked_ce(y_pred=y_pred, y_true=y_true, mask=mask)
          accuracy = masked_accuracy(y_pred=y_pred, y_true=y_true, mask=mask)
          
          self.log("train_loss", loss)
          self.log("train_accuracy", accuracy)
          return loss
  
  	….
      def configure_optimizers(self):
          optimizer = torch.optim.Adam(self.parameters(), lr=self.lr)
          scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(
              optimizer, patience=10, factor=0.1
          )
          return {
              "optimizer": optimizer,
              "lr_scheduler": scheduler,
              "monitor": "valid_loss",
          }

For some reason, my learning rate stay pretty much the same throughout the training, here are the avarage loss of each epoch for the first 32 epochs:
[7.691485668923165, 7.690969317763656, 7.6902515966971, 7.689588720018083, 7.686376930595757, 7.685173241345136, 7.688746468560235, 7.683287980439546, 7.685947586227585, 7.683389254160471, 7.674922955048096, 7.678214648345092, 7.6736966854817155, 7.679115080618644, 7.678637226780614, 7.677104617740299, 7.6784126775281445, 7.674682577570398, 7.672071377674977, 7.668677749099197, 7.674774644849776, 7.668729655138843, 7.676391048832341, 7.660469470439373, 7.667116234371731, 7.662718962382029, 7.663188390664987, 7.663334126229043, 7.667270759681801, 7.665728591941856, 7.665296751696307, 7.662635789857851, 7.659676546091074]
As you can see, the numbers stay around 7.7 and not really going anywhere, what could be the reason for it?

Some of the responses for similar issues suggest playing around with the hyperparameters, so I did try a few things:

  • Since it could be stuck in local min, it could be that the learning rate needs to be changed in order to be able to go out of it (is that right?) - I tried to change my learning rate from 1-e4 to 5-e5, it didn’t help much
  • In order to check if the training work as expected, I tried to overfit my model on small number of datasets (10) and this the avg of the loss for the first 20 epochs looks as follow:
    [0.0, 0.0, 0.0, 1.2672607898712158, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.9458177089691162, 0.0, 0.0, 0.0, 1.6752853393554688, 0.0, 0.0]

Any suggestion would be much appreciated!