Air Paradis : Detect bad buzz with deep learning

Context

"Air Paradis" is an airline company who's marketing department wants to be able to detect quickly "bad buzz" on social networks, to be able to anticipate and address issues as fast as possible. They need an AI API that can detect "bad buzz" and predict the reason for it.

The goal here is to evaluate different approaches to detect "bad buzz" :

  1. Baseline Model : Logistic Regression
  2. Word embedding : Gensim Doc2Vec
  3. Azure Cognitive Services : Text Analytics API
  4. HuggingFace Transformer Pipeline : Sentiment Analysis
  5. HuggingFace : BERT Fine-tuning
  6. AzureML Studio : Automated ML
  7. AzureML Studio : Designer
  8. Custom Models : Neural Networks with Keras
  9. AzureML Studio : Notebooks

After exploring our dataset, we will compare the different approaches.

Project modules

The helpers functions and project specific code will be placed in ../src/.

We will use the Python programming language, and present here the code and results in this Notebook JupyterLab file.

We will use the usual libraries for data exploration, modeling and visualisation :

We will also use libraries specific to the goals of this project :

Exploratory data analysis (EDA)

We are going to load the data and analyse the distribution of each variable.

Load data

Let's download the data from the Kaggle - Sentiment140 dataset with 1.6 million tweets competition.

Now we can load the data.

Explore data

Let's display a few examples, find out how many data points are available, what are the variables and what is their distribution.

There are 1600000 rows, each composed of 6 columns :

We are only interrested in the target and text variables. The rest of the columns are not useful for our analysis.

There are exactly as many (800000) POSITIVE tweets as NEGATIVE tweets. There are no NEUTRAL tweets. The problem is well balanced and there will be no bias towards one class during the training of our models.

There are no big difference between the POSITIVE and NEGATIVE tweets, but NEGATIVE tweets are slightly longer than POSITIVE tweets. In both classes, there are two modes : ~45 characters and 138 characters (the maximum allowed at some point).

There are no big difference between the POSITIVE and NEGATIVE tweets, but NEGATIVE tweets are significatively longer than POSITIVE tweets. In both classes, there are two modes : ~7 words and ~20 words.

Text analysis

We will look more in details at what contains the text variable.

First, we will transform the dataset into a Bag of Words representation with TfIdf (Term Frequency - Inverse Document Frequency) weights. To achieve this, we are going to use th SpaCy tokenizer.

Our corpus is now transformed into a BoW representation. We can analyse the words frequencies.

We can see that the most important words actually meaningful and relevant regarding the sentiment associated to each message.

Models comparison

In this section, we are gonig to compare the metrics of the models we have tested in the other Notebooks.

Raw metrics

Each model is built and tested in the corresponding Notebooks (cf. list above). We are only looking at the classification metrics here.

Observations

Best model

The best model is (by far) the model obtained by 6.2 - AzureML Automated ML : 10h on GPU :

Pros
Cons

Off-the-shelf models (cloud or pre-trained)

These models (3.1 - Azure Cognitive Service API and 4 - HuggingFace Sentiment Analysis) have produced verage results.

Adding a classification model on top of Azure Cognitive Service prediction did not improve the results a lot (3.2 - Logistic Regression on Azure Cognitive Service).

Pros
Cons

Fine-tuned BERT model

We've seen that a fine-tuned BERT model can be very efficient as a pre-processing layer (cf. 6.2 - AzureML Automated ML : 10h on GPU).

But directly fine-tuning BERT for sentiment analysis (5.1 - HuggingFace : BERT Fine-tuning) proved to be a real challenge : we were only able to obtain average results after ~6.5h of training with GPU capability.

Using a more adapted model (5.2 - HuggingFace : BERTweet Fine-tuning) proved to greatly improve the results, at the cost of a longer training time (11h).

Given the BERT model has more than 109M parameters (134M for BERTweet), our dataset is probably too small to fine-tune the model efficiently.

Pros
Cons

Custom Neural Networks

Starting from a basic model (8.1 - FFNN on word counts), adding a Embedding layer (8.4 - FFNN with custom Embedding), a Recurrent layers ((8.6 - RNN, 8.7 - LSTM, 8.8 - Bidirectional-LSTM and 8.9 - Stacked Bidirectional-LSTM) and a second stacked LSTM layer showed that more complexity does not mean better results.

Adding the Embedding layer actually reduced the classification results. Adding the Recurrent layer improved the model with Embedding layer. Adding the LSTM layer improved the results further more. Adding the Bidirectional-LSTM layer improved just slightly the results. Adding the Stacking a second Bidirectional-LSTM did not significatively change the results.

Pros
Cons