Word embedding : Gensim Doc2Vec

In this notebook, we will use the Gensim Doc2Vec model to compute Word Embedding on our tweets dataset, before training a classification model on the lower-dimension vector space.

We will compare this pre-trained cloud model to the baseline model from 1_baseline.ipynb.

Load project modules and data

We will use basic python packages, and the Gensim package to use the Doc2Vec model.

Text pre-processing

Before we can train our model, we need to pre-process the text data. We will tokeinze the text using SpaCy and vectorize the documents (list of tokens) with the word embedding model Doc2Vec.

Text tokenization

During the tokenization process, we apply the following pre-processing steps:

Tokens vectorization

Instead of a simple Count or TfIdf vectorizer, we will use the Doc2Vec model to vectorize the text. This model uses word embeddings to represent the text as a vector in the lower-dimension space. We train our embedding model on the whole corpus, and then we can use the model to vectorize the text.

Classification model

We will use a simple Logistic Regression model to train our classification model, just like we did in 1_baseline.ipynb.

The performances on the train and test datasets are identical, so we know our model is well trained (no over/under-fitting).

The performances on the dataset are slightly better than our baseline model :

Our model is also biased towards the POSITIVE class, but much less than the baseline model : it predicted 19% (baseline = 35% , -45%) more POSITIVE (174064) messages than NEGATIVE (145936).

Let's observe some classification errors.

On this false-positive example, we can see that the model is not able to predict the sentiment of the message, but it is not obvious even for a human...

On this false-negative example, we can see that the model is not able to predict the sentiment of the message. But in this case, the model is fooled by the presence of words like "sick", "cheap", "hurts", ...