Baseline Model : Logistic Regression

In this notebook, we will use :

This model will be the baseline against which we will compare all other (more advanced) models.

Load project modules

The helpers functions and project specific code will be placed in ../src/.

We will use the Python programming language, and present here the code and results in this Notebook JupyterLab file.

We will use the usual libraries for data exploration, modeling and visualisation :

We will also use libraries specific to the goals of this project :

Text pre-processing

Before we can train our model, we need to pre-process the text data. We will tokeinze the text using SpaCy and vectorize the documents with the Tf-Idf Vectorizer (Term Frequency - Inverted Document Frequency) model.

During the tokenization process, we apply the following pre-processing steps:

Our corpus is now transformed into a BoW representation.

Classification model

We are going to train and evaluate a classification model to predict the sentiment of a new message.

Dimension reduction & Topic modeling

First, we need to reduce the dimension of the BoW representation : there are 240589 words in the corpus. We use the Latent Semantic Analysis (LSA) method to reduce the dimension of the BoW representation. This method will create Topics that are the most relevant to the corpus.

The elbow method should help us to choose the number of topics, but here there is no clear elbow, so we choose 50 topics.

The dataset is now reduced to 50 topics. We can observe the composition (most relevant words) of each topic.

We can identify the following topics :

Train and test the model

We can now train and test a classification model. We are going to use the Logistic Regression model.

Once the model is trained, we can observe which topics are the most relevant to the sentiment of the messages.

The most NEGATIVE topics are :

The most POSITIVE topics are :

Now we can measure the performance of our model. We are going to use the Confusion Matrix, the Precision-Recall curve (Average Precision metric) and the ROC curve (ROC AUC metric) to evaluate our model.

The performances on the train and test datasets are identical, so we know our model is well trained (no over/under-fitting).

The performances are quite correct for a baseline model :

Our model is biased towards the POSITIVE class : it predicted 35% more POSITIVE (918049) messages than NEGATIVE (681951).

Let's observe some classification errors.

On this false-positive example, we can see that the model is not able to predict the sentiment of the message, despite words like "bummer" and "should"...

On this false-negative example, we can see that the model is not able to predict the sentiment of the message. But in this case, the model is fooled by the presence of words like "sick", "cheap", "hurts", ...