HuggingFace : BERT Fine-tuning

In this notebook, we will fine-tune a pre-trained model to predict the sentiment of the tweets : Fine-tune with TensorFlow.

We will compare this pre-trained local model to the baseline model from 1_baseline.ipynb.

In order to use GPU for training, we used Kaggle environment.

Load project modules and data

We will use basic python packages, and the HuggingFace package to predict text sentiment.

Text preprocessing

The text is transformed to tensors with AutoTokenizer.

Model fine-tuning

We are going to fit the TFAutoModelForSequenceClassification in order to adapt it to our dataset.

Results and evaluation

We tried to fine-tune two different models, and we will compare the results : the standard BERT model, and the more adapted to english tweets BERTweet model.

Vanilla BERT model : bert-base-uncased

The model has been trained for ~6.5h on 1M on Kaggle with GPU accelerator : oc-p7_bert_fine-tuning - Version 8 - 8-BERT-1M.

The model has more than 109M parameters, so the 1M tweets are probably not enough to train the model correctly.

The performances on the test dataset are slightly better than the baseline model, but not as good as other models :

But this model is really very biased towards the POSITIVE class : it predicted 9.1 times (baseline = 35% , -89%) more POSITIVE (181954) messages than NEGATIVE (18046).

English tweets adapted model : vinai/bertweet-base

The model has been trained for ~11h on 1M on Kaggle with GPU accelerator : oc-p7_bert_fine-tuning - Version 10 - 10-BERTweet-1M.

The model has more than 134M parameters, so the 1M tweets are probably not enough to train the model correctly.

This run failed in the end because it reached Kaggle's maximum execution time, but the results are still available in the logs :

log
Confusion Matrix :
[[79351 20649]
[10016 89984]]

ROC AUC score : 0.915
Average Precision score : 0.901

Text : "@Retrievergirl Clapton's is certainly one of the worlds greatest guitarists , and for me closely followed by Brian May"
True sentiment : 1
Predicted sentiment : 0.542

The performances on the test dataset are much better than the baseline model :

But this model is still quite biased towards the POSITIVE class : it predicted 23% (baseline = 35% , -35%) more POSITIVE (110633) messages than NEGATIVE (89367).