In this notebook, we will use the Gensim Doc2Vec model to compute Word Embedding on our tweets dataset, before training a classification model on the lower-dimension vector space.
We will compare this pre-trained cloud model to the baseline model from 1_baseline.ipynb.
import pickle
# Import custom helper libraries
import os
import sys
src_path = os.path.abspath(os.path.join("../src"))
if src_path not in sys.path:
sys.path.append(src_path)
import data.helpers as data_helpers
import visualization.helpers as viz_helpers
# Maths modules
import pandas as pd
# Viz modules
import plotly.express as px
# Render for export
import plotly.io as pio
pio.renderers.default = "notebook"
# Download and unzip CSV files
!cd .. && make dataset && cd notebooks
>>> Downloading and extracting data files... Data files already downloaded. >>> OK.
# Load data from CSV
df = pd.read_csv(
os.path.join("..", "data", "raw", "training.1600000.processed.noemoticon.csv"),
names=["target", "id", "date", "flag", "user", "text"],
)
# Reduce memory usage
df = data_helpers.reduce_dataframe_memory_usage(df)
# Drop useless columns
df.drop(columns=["id", "date", "flag", "user"], inplace=True)
# Replace target values with labels
df.target = df.target.map(
{
0: "NEGATIVE",
2: "NEUTRAL",
4: "POSITIVE",
}
)
df.describe()
During the tokenization process, we apply the following pre-processing steps:
# Tokenizers, Stemmers and Lemmatizers
import spacy
# Download SpaCy model
!python -m spacy download en_core_web_sm
nlp = spacy.load("en_core_web_sm")
# Define tokenizer
tokenizer = lambda text: [ # SpaCy Lemmatizer
token.lemma_.lower() for token in nlp(text) if token.is_alpha and not token.is_stop
]
# Processed data path
processed_data_path = os.path.join("..", "data", "processed")
tokenized_dataset_file_path = os.path.join(processed_data_path, "spacy_dataset.pkl")
if os.path.exists(tokenized_dataset_file_path):
# Load tokenized dataset
with (open(tokenized_dataset_file_path, "rb")) as f:
X = pickle.load(f)
else:
# Tokenize dataset
X = df.text.apply(tokenizer)
# Save tokenized dataset as pickle
with open(tokenized_dataset_file_path, "wb") as f:
pickle.dump(X, f)
2022-01-06 15:41:07.203969: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory 2022-01-06 15:41:07.204002: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. [nltk_data] Downloading package stopwords to [nltk_data] /home/clement/nltk_data... [nltk_data] Package stopwords is already up-to-date!
2022-01-06 15:41:11.073550: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-01-06 15:41:11.073578: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
Collecting en-core-web-sm==3.2.0
Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.2.0/en_core_web_sm-3.2.0-py3-none-any.whl (13.9 MB)
|████████████████████████████████| 13.9 MB 4.6 MB/s
Requirement already satisfied: spacy<3.3.0,>=3.2.0 in /home/clement/Workspace/oc_p7/env/lib/python3.9/site-packages (from en-core-web-sm==3.2.0) (3.2.1)
Requirement already satisfied: thinc<8.1.0,>=8.0.12 in /home/clement/Workspace/oc_p7/env/lib/python3.9/site-packages (from spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (8.0.13)
Requirement already satisfied: requests<3.0.0,>=2.13.0 in /home/clement/Workspace/oc_p7/env/lib/python3.9/site-packages (from spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (2.27.0)
Requirement already satisfied: pydantic!=1.8,!=1.8.1,<1.9.0,>=1.7.4 in /home/clement/Workspace/oc_p7/env/lib/python3.9/site-packages (from spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (1.8.2)
Requirement already satisfied: wasabi<1.1.0,>=0.8.1 in /home/clement/Workspace/oc_p7/env/lib/python3.9/site-packages (from spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (0.9.0)
Requirement already satisfied: pathy>=0.3.5 in /home/clement/Workspace/oc_p7/env/lib/python3.9/site-packages (from spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (0.6.1)
Requirement already satisfied: tqdm<5.0.0,>=4.38.0 in /home/clement/Workspace/oc_p7/env/lib/python3.9/site-packages (from spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (4.62.3)
Requirement already satisfied: preshed<3.1.0,>=3.0.2 in /home/clement/Workspace/oc_p7/env/lib/python3.9/site-packages (from spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (3.0.6)
Requirement already satisfied: typer<0.5.0,>=0.3.0 in /home/clement/Workspace/oc_p7/env/lib/python3.9/site-packages (from spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (0.4.0)
Requirement already satisfied: catalogue<2.1.0,>=2.0.6 in /home/clement/Workspace/oc_p7/env/lib/python3.9/site-packages (from spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (2.0.6)
Requirement already satisfied: spacy-loggers<2.0.0,>=1.0.0 in /home/clement/Workspace/oc_p7/env/lib/python3.9/site-packages (from spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (1.0.1)
Requirement already satisfied: blis<0.8.0,>=0.4.0 in /home/clement/Workspace/oc_p7/env/lib/python3.9/site-packages (from spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (0.7.5)
Requirement already satisfied: srsly<3.0.0,>=2.4.1 in /home/clement/Workspace/oc_p7/env/lib/python3.9/site-packages (from spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (2.4.2)
Requirement already satisfied: langcodes<4.0.0,>=3.2.0 in /home/clement/Workspace/oc_p7/env/lib/python3.9/site-packages (from spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (3.3.0)
Requirement already satisfied: packaging>=20.0 in /home/clement/Workspace/oc_p7/env/lib/python3.9/site-packages (from spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (21.3)
Requirement already satisfied: setuptools in /home/clement/Workspace/oc_p7/env/lib/python3.9/site-packages (from spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (44.1.1)
Requirement already satisfied: numpy>=1.15.0 in /home/clement/Workspace/oc_p7/env/lib/python3.9/site-packages (from spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (1.20.3)
Requirement already satisfied: spacy-legacy<3.1.0,>=3.0.8 in /home/clement/Workspace/oc_p7/env/lib/python3.9/site-packages (from spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (3.0.8)
Requirement already satisfied: jinja2 in /home/clement/Workspace/oc_p7/env/lib/python3.9/site-packages (from spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (3.0.3)
Requirement already satisfied: murmurhash<1.1.0,>=0.28.0 in /home/clement/Workspace/oc_p7/env/lib/python3.9/site-packages (from spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (1.0.6)
Requirement already satisfied: cymem<2.1.0,>=2.0.2 in /home/clement/Workspace/oc_p7/env/lib/python3.9/site-packages (from spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (2.0.6)
Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /home/clement/Workspace/oc_p7/env/lib/python3.9/site-packages (from packaging>=20.0->spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (3.0.6)
Requirement already satisfied: smart-open<6.0.0,>=5.0.0 in /home/clement/Workspace/oc_p7/env/lib/python3.9/site-packages (from pathy>=0.3.5->spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (5.2.1)
Requirement already satisfied: typing-extensions>=3.7.4.3 in /home/clement/Workspace/oc_p7/env/lib/python3.9/site-packages (from pydantic!=1.8,!=1.8.1,<1.9.0,>=1.7.4->spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (4.0.1)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /home/clement/Workspace/oc_p7/env/lib/python3.9/site-packages (from requests<3.0.0,>=2.13.0->spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (1.26.7)
Requirement already satisfied: charset-normalizer~=2.0.0 in /home/clement/Workspace/oc_p7/env/lib/python3.9/site-packages (from requests<3.0.0,>=2.13.0->spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (2.0.9)
Requirement already satisfied: idna<4,>=2.5 in /home/clement/Workspace/oc_p7/env/lib/python3.9/site-packages (from requests<3.0.0,>=2.13.0->spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (3.3)
Requirement already satisfied: certifi>=2017.4.17 in /home/clement/Workspace/oc_p7/env/lib/python3.9/site-packages (from requests<3.0.0,>=2.13.0->spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (2021.10.8)
Requirement already satisfied: click<9.0.0,>=7.1.1 in /home/clement/Workspace/oc_p7/env/lib/python3.9/site-packages (from typer<0.5.0,>=0.3.0->spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (8.0.3)
Requirement already satisfied: MarkupSafe>=2.0 in /home/clement/Workspace/oc_p7/env/lib/python3.9/site-packages (from jinja2->spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (2.0.1)
✔ Download and installation successful
You can now load the package via spacy.load('en_core_web_sm')
Instead of a simple Count or TfIdf vectorizer, we will use the Doc2Vec model to vectorize the text. This model uses word embeddings to represent the text as a vector in the lower-dimension space. We train our embedding model on the whole corpus, and then we can use the model to vectorize the text.
from gensim.models.doc2vec import TaggedDocument, Doc2Vec
# Processed data path
processed_data_path = os.path.join("..", "data", "processed")
vectorized_dataset_file_path = os.path.join(
processed_data_path, "doc2vec_spacy_dataset.pkl"
)
if os.path.exists(vectorized_dataset_file_path):
# Load vectorized dataset
with (open(vectorized_dataset_file_path, "rb")) as f:
X = pickle.load(f)
else:
# Tag documents for training
X = [TaggedDocument(doc, [i]) for i, doc in enumerate(X)]
# Train doc2vec model
doc2vec = Doc2Vec()
doc2vec.build_vocab(X)
doc2vec.train(X, total_examples=doc2vec.corpus_count, epochs=doc2vec.epochs)
# Vectorize text
X = [doc2vec.infer_vector(doc.words) for doc in X]
# Save vectorized dataset as pickle
with open(vectorized_dataset_file_path, "wb") as f:
pickle.dump(X, f)
We will use a simple Logistic Regression model to train our classification model, just like we did in 1_baseline.ipynb.
from sklearn.model_selection import train_test_split
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
X,
df.target,
test_size=0.2,
stratify=df.target,
random_state=42,
)
from sklearn.linear_model import LogisticRegressionCV
# Define model
model = LogisticRegressionCV(random_state=42)
# Train model
model.fit(X_train, y_train)
LogisticRegressionCV(random_state=42)
# Comute the coefficients
dimensions = [f"Dimension {i + 1}" for i in range(len(X[0]))]
coefs = pd.Series(
model.coef_[0],
index=dimensions,
)
# Top 20 topics by importance (positive and negative)
top_20_coefs = coefs.nlargest(10).append(coefs.nsmallest(10)).sort_values()
# Plot top 20 topics by importance (positive and negative)
fig = px.bar(
top_20_coefs,
x=top_20_coefs.index,
y=top_20_coefs.values,
labels={"x": "Dimension", "y": "Importance", "color": "Importance"},
title=f"Top 20 important dimensions",
color=top_20_coefs.values,
)
fig.show()
viz_helpers.plot_classifier_results(
model,
X_train,
y_train,
title="Train set results",
)
/home/clement/Workspace/oc_p7/env/lib/python3.9/site-packages/sklearn/utils/deprecation.py:87: FutureWarning: Function plot_confusion_matrix is deprecated; Function `plot_confusion_matrix` is deprecated in 1.0 and will be removed in 1.2. Use one of the class methods: ConfusionMatrixDisplay.from_predictions or ConfusionMatrixDisplay.from_estimator. /home/clement/Workspace/oc_p7/env/lib/python3.9/site-packages/sklearn/utils/deprecation.py:87: FutureWarning: Function plot_precision_recall_curve is deprecated; Function `plot_precision_recall_curve` is deprecated in 1.0 and will be removed in 1.2. Use one of the class methods: PrecisionRecallDisplay.from_predictions or PrecisionRecallDisplay.from_estimator. /home/clement/Workspace/oc_p7/env/lib/python3.9/site-packages/sklearn/utils/deprecation.py:87: FutureWarning: Function plot_roc_curve is deprecated; Function :func:`plot_roc_curve` is deprecated in 1.0 and will be removed in 1.2. Use one of the class methods: :meth:`sklearn.metric.RocCurveDisplay.from_predictions` or :meth:`sklearn.metric.RocCurveDisplay.from_estimator`.
viz_helpers.plot_classifier_results(
model,
X_test,
y_test,
title="Test set results",
)
/home/clement/Workspace/oc_p7/env/lib/python3.9/site-packages/sklearn/utils/deprecation.py:87: FutureWarning: Function plot_confusion_matrix is deprecated; Function `plot_confusion_matrix` is deprecated in 1.0 and will be removed in 1.2. Use one of the class methods: ConfusionMatrixDisplay.from_predictions or ConfusionMatrixDisplay.from_estimator. /home/clement/Workspace/oc_p7/env/lib/python3.9/site-packages/sklearn/utils/deprecation.py:87: FutureWarning: Function plot_precision_recall_curve is deprecated; Function `plot_precision_recall_curve` is deprecated in 1.0 and will be removed in 1.2. Use one of the class methods: PrecisionRecallDisplay.from_predictions or PrecisionRecallDisplay.from_estimator. /home/clement/Workspace/oc_p7/env/lib/python3.9/site-packages/sklearn/utils/deprecation.py:87: FutureWarning: Function plot_roc_curve is deprecated; Function :func:`plot_roc_curve` is deprecated in 1.0 and will be removed in 1.2. Use one of the class methods: :meth:`sklearn.metric.RocCurveDisplay.from_predictions` or :meth:`sklearn.metric.RocCurveDisplay.from_estimator`.
The performances on the train and test datasets are identical, so we know our model is well trained (no over/under-fitting).
The performances on the dataset are slightly better than our baseline model :
Our model is also biased towards the POSITIVE class, but much less than the baseline model : it predicted 19% (baseline = 35% , -45%) more POSITIVE (174064) messages than NEGATIVE (145936).
Let's observe some classification errors.
# Compute predictions
y_pred = model.predict(X)
df["prediction"] = y_pred
import shap
shap.initjs()
explainer = shap.Explainer(model, pd.DataFrame(X_train), feature_names=dimensions)
shap_values = explainer(X)
Linear explainer: 1600001it [00:22, 35623.76it/s]
# False positive example
fp_index = df[(df.target == "NEGATIVE") & (df.prediction == "POSITIVE")].index[0]
fp_text = df.text.values[fp_index]
print(fp_text)
shap.plots.force(shap_values[fp_index])
@Kenichan I dived many times for the ball. Managed to save 50% The rest go out of bounds
On this false-positive example, we can see that the model is not able to predict the sentiment of the message, but it is not obvious even for a human...
# False negative example
fn_index = df[(df.target == "POSITIVE") & (df.prediction == "NEGATIVE")].index[0]
fn_text = df.text.values[fn_index]
print(fn_text)
shap.plots.force(shap_values[fn_index])
Being sick can be really cheap when it hurts too much to eat real food Plus, your friends make you soup
On this false-negative example, we can see that the model is not able to predict the sentiment of the message. But in this case, the model is fooled by the presence of words like "sick", "cheap", "hurts", ...