Difference between BertForSequenceClassification and Bert + nn.Linear

I’ve been working on this recently and compared the performance of BertForSequenceClassification vs my own classifier head.

Indeed, dropout and weight initialization seem to be the only major concerns.
In huggingface, unless a classifier-specific dropout rate is specified, the internal bert’s dropout is used (0.1 apparently), so I apply that. Weight initialization differs from the default pytorch initialization, so I take the initializing function implemented in the PretrainedBert class and run it over my classifier as well.

Here’s how I do it:

        # Load model and add classification head
        self.model = AutoModel.from_pretrained(huggingface_model)
        self.classifier = nn.Linear(self.model.config.hidden_size, num_labels)

        # Init classifier weights according to initialization rules of model
        self.model._init_weights(self.classifier)

        # Apply dropout rate of model
        dropout_prob = self.model.config.hidden_dropout_prob
        log.info(f"Dropout probability of classifier set to {dropout_prob}.")
        self.dropout = nn.Dropout(dropout_prob)

Those two things helped me get performance parity between the two models!
These are my quick conclusions from testing on the MNLI task, but I may have missed stuff. More comments are welcome, for the sake of future search results leading to this page.

More code regarding implementing huggingface models in pytorch lightning can be found in my github template repository!

1 Like