"Avis Restau" is a start-up who's goal is to connect restaurants and customers. Customers will be able to post photos and reviews of the restaurants they have visited.
The goal here is to identify topics of bad customer reviews and to label photos as indoor or outdoor , food or drink, ...
The helpers functions and project specific code will be placed in ../src/
.
We will use the Python programming language, and present here the code and results in this Notebook JupyterLab file.
We will use the usual libraries for data exploration, modeling and visualisation :
We will also use libraries specific to the goals of this project :
# Import custom helper libraries
import os
import sys
src_path = os.path.abspath(os.path.join("../src"))
if src_path not in sys.path:
sys.path.append(src_path)
import features.helpers as feat_helpers
import data.helpers as data_helpers
import visualization.helpers as viz_helpers
# Load environment variables from .env file
from dotenv import load_dotenv
load_dotenv()
YELP_CLIENT_ID = os.getenv("YELP_CLIENT_ID")
YELP_API_KEY = os.getenv("YELP_API_KEY")
# Set up logging
import logging
logging.basicConfig(level=logging.WARNING)
logger = logging.getLogger(__name__)
# System modules
import random
# ML modules
import pandas as pd
import numpy as np
# Viz modules
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots
# Sample data for development
TEXT_SAMPLE_SIZE = 10 * 1000 # <= 0 for all
PHOTO_SAMPLE_SIZE = 10 * 1000 # <= 0 for all
import plotly.io as pio
pio.renderers.default = "notebook"
# Download SpaCy model
!python -m spacy download en_core_web_sm
2021-12-13 10:08:18.422080: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2021-12-13 10:08:18.422164: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
Collecting en-core-web-sm==3.2.0
Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.2.0/en_core_web_sm-3.2.0-py3-none-any.whl (13.9 MB)
|████████████████████████████████| 13.9 MB 7.6 MB/s
Requirement already satisfied: spacy<3.3.0,>=3.2.0 in /home/clement/Workspace/oc_p6/env/lib/python3.9/site-packages (from en-core-web-sm==3.2.0) (3.2.1)
Requirement already satisfied: packaging>=20.0 in /home/clement/Workspace/oc_p6/env/lib/python3.9/site-packages (from spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (21.3)
Requirement already satisfied: thinc<8.1.0,>=8.0.12 in /home/clement/Workspace/oc_p6/env/lib/python3.9/site-packages (from spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (8.0.13)
Requirement already satisfied: numpy>=1.15.0 in /home/clement/Workspace/oc_p6/env/lib/python3.9/site-packages (from spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (1.20.3)
Requirement already satisfied: jinja2 in /home/clement/Workspace/oc_p6/env/lib/python3.9/site-packages (from spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (3.0.3)
Requirement already satisfied: spacy-loggers<2.0.0,>=1.0.0 in /home/clement/Workspace/oc_p6/env/lib/python3.9/site-packages (from spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (1.0.1)
Requirement already satisfied: requests<3.0.0,>=2.13.0 in /home/clement/Workspace/oc_p6/env/lib/python3.9/site-packages (from spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (2.26.0)
Requirement already satisfied: langcodes<4.0.0,>=3.2.0 in /home/clement/Workspace/oc_p6/env/lib/python3.9/site-packages (from spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (3.3.0)
Requirement already satisfied: blis<0.8.0,>=0.4.0 in /home/clement/Workspace/oc_p6/env/lib/python3.9/site-packages (from spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (0.7.5)
Requirement already satisfied: catalogue<2.1.0,>=2.0.6 in /home/clement/Workspace/oc_p6/env/lib/python3.9/site-packages (from spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (2.0.6)
Requirement already satisfied: srsly<3.0.0,>=2.4.1 in /home/clement/Workspace/oc_p6/env/lib/python3.9/site-packages (from spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (2.4.2)
Requirement already satisfied: setuptools in /home/clement/Workspace/oc_p6/env/lib/python3.9/site-packages (from spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (44.1.1)
Requirement already satisfied: wasabi<1.1.0,>=0.8.1 in /home/clement/Workspace/oc_p6/env/lib/python3.9/site-packages (from spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (0.9.0)
Requirement already satisfied: tqdm<5.0.0,>=4.38.0 in /home/clement/Workspace/oc_p6/env/lib/python3.9/site-packages (from spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (4.62.3)
Requirement already satisfied: cymem<2.1.0,>=2.0.2 in /home/clement/Workspace/oc_p6/env/lib/python3.9/site-packages (from spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (2.0.6)
Requirement already satisfied: preshed<3.1.0,>=3.0.2 in /home/clement/Workspace/oc_p6/env/lib/python3.9/site-packages (from spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (3.0.6)
Requirement already satisfied: pydantic!=1.8,!=1.8.1,<1.9.0,>=1.7.4 in /home/clement/Workspace/oc_p6/env/lib/python3.9/site-packages (from spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (1.8.2)
Requirement already satisfied: pathy>=0.3.5 in /home/clement/Workspace/oc_p6/env/lib/python3.9/site-packages (from spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (0.6.1)
Requirement already satisfied: murmurhash<1.1.0,>=0.28.0 in /home/clement/Workspace/oc_p6/env/lib/python3.9/site-packages (from spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (1.0.6)
Requirement already satisfied: spacy-legacy<3.1.0,>=3.0.8 in /home/clement/Workspace/oc_p6/env/lib/python3.9/site-packages (from spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (3.0.8)
Requirement already satisfied: typer<0.5.0,>=0.3.0 in /home/clement/Workspace/oc_p6/env/lib/python3.9/site-packages (from spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (0.4.0)
Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /home/clement/Workspace/oc_p6/env/lib/python3.9/site-packages (from packaging>=20.0->spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (3.0.6)
Requirement already satisfied: smart-open<6.0.0,>=5.0.0 in /home/clement/Workspace/oc_p6/env/lib/python3.9/site-packages (from pathy>=0.3.5->spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (5.2.1)
Requirement already satisfied: typing-extensions>=3.7.4.3 in /home/clement/Workspace/oc_p6/env/lib/python3.9/site-packages (from pydantic!=1.8,!=1.8.1,<1.9.0,>=1.7.4->spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (4.0.1)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /home/clement/Workspace/oc_p6/env/lib/python3.9/site-packages (from requests<3.0.0,>=2.13.0->spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (1.26.7)
Requirement already satisfied: charset-normalizer~=2.0.0 in /home/clement/Workspace/oc_p6/env/lib/python3.9/site-packages (from requests<3.0.0,>=2.13.0->spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (2.0.9)
Requirement already satisfied: idna<4,>=2.5 in /home/clement/Workspace/oc_p6/env/lib/python3.9/site-packages (from requests<3.0.0,>=2.13.0->spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (3.3)
Requirement already satisfied: certifi>=2017.4.17 in /home/clement/Workspace/oc_p6/env/lib/python3.9/site-packages (from requests<3.0.0,>=2.13.0->spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (2021.10.8)
Requirement already satisfied: click<9.0.0,>=7.1.1 in /home/clement/Workspace/oc_p6/env/lib/python3.9/site-packages (from typer<0.5.0,>=0.3.0->spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (8.0.3)
Requirement already satisfied: MarkupSafe>=2.0 in /home/clement/Workspace/oc_p6/env/lib/python3.9/site-packages (from jinja2->spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (2.0.1)
✔ Download and installation successful
You can now load the package via spacy.load('en_core_web_sm')
We will use Yelp's GraphQL API to get the data. We will load the Reviews (~3 reviews per restaurant) and Photos (1 photo per restaurant) of 1000 restaurants from 5 locations (200 restaurants per location).
We download the dataset from the Yelp GraphQL API.
# Download and unzip CSV files
!cd .. && make dataset && cd notebooks
>>> Downloading and saving data files... python -m src.data.make-dataset -t data/raw/api/ INFO:root:Data already downloaded >>> OK.
DATA_PATH = "../data/raw/api/"
businesses_csv_path = os.path.join(DATA_PATH, "businesses.csv")
reviews_csv_path = os.path.join(DATA_PATH, "reviews.csv")
photos_csv_path = os.path.join(DATA_PATH, "photos.csv")
if (
os.path.exists(businesses_csv_path)
and os.path.exists(reviews_csv_path)
and os.path.exists(photos_csv_path)
):
logging.info(f"Data found, loading from {DATA_PATH}")
businesses_df = pd.read_csv(businesses_csv_path)
reviews_df = pd.read_csv(reviews_csv_path)
photos_df = pd.read_csv(photos_csv_path)
else:
logging.error("Data not found, please run `make dataset`")
# Fix dtypes
businesses_df["business_alias"] = businesses_df["business_alias"].astype(str)
businesses_df["business_review_count"] = businesses_df["business_review_count"].astype(
int
)
businesses_df["business_rating"] = businesses_df["business_rating"].astype(float)
businesses_df["business_price"] = businesses_df["business_price"].astype(int)
businesses_df["business_city"] = businesses_df["business_city"].astype(str)
businesses_df["business_state"] = businesses_df["business_state"].astype(str)
businesses_df["business_postal_code"] = businesses_df["business_postal_code"].astype(
str
)
businesses_df["business_country"] = businesses_df["business_country"].astype(str)
businesses_df["business_latitude"] = businesses_df["business_latitude"].astype(float)
businesses_df["business_longitude"] = businesses_df["business_longitude"].astype(float)
businesses_df["business_categories"] = businesses_df["business_categories"].astype(str)
businesses_df["business_parent_categories"] = businesses_df[
"business_parent_categories"
].astype(str)
reviews_df["business_alias"] = reviews_df["business_alias"].astype(str)
reviews_df["review_text"] = reviews_df["review_text"].astype(str)
reviews_df["review_rating"] = reviews_df["review_rating"].astype(float)
photos_df["business_alias"] = photos_df["business_alias"].astype(str)
photos_df["photo_url"] = photos_df["photo_url"].astype(str)
photos_df["file_name"] = photos_df["file_name"].astype(str)
# Reduce memory usage
businesses_df = data_helpers.reduce_dataframe_memory_usage(businesses_df)
reviews_df = data_helpers.reduce_dataframe_memory_usage(reviews_df)
photos_df = data_helpers.reduce_dataframe_memory_usage(photos_df)
We will just display here a few statistics about each DataFrame.
The dataset is composed of 1000 businesses from 5 locations (Paris, New York, Tokyo, Rio de Janeiro and Sydney). Each business has a unique ID, a name, a category, an average rating, a price category, a city, a state, a country, a postal code, a latitude and a longitude.
businesses_df.head()
business_alias | business_review_count | business_rating | business_price | business_city | business_state | business_postal_code | business_country | business_latitude | business_longitude | business_categories | business_parent_categories | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | le-comptoir-de-la-gastronomie-paris | 1107 | 4.5 | 2 | Paris | 75 | 75001 | FR | 48.864517 | 2.345402 | ["french"] | ["restaurants"] |
1 | l-as-du-fallafel-paris | 1810 | 4.5 | 1 | Paris | 75 | 75004 | FR | 48.857498 | 2.359080 | ["kosher", "falafel", "sandwiches"] | ["restaurants", "mediterranean"] |
2 | angelina-paris | 1347 | 4.0 | 3 | Paris | 75 | 75001 | FR | 48.865093 | 2.328464 | ["tea", "cakeshop", "breakfast_brunch"] | ["restaurants", "food"] |
3 | l-avant-comptoir-paris-3 | 612 | 4.5 | 2 | Paris | 75 | 75006 | FR | 48.852020 | 2.338800 | ["wine_bars", "tapas"] | ["bars", "restaurants"] |
4 | la-coïncidence-paris-4 | 493 | 4.5 | 2 | Paris | 75 | 75116 | FR | 48.868107 | 2.284365 | ["french"] | ["restaurants"] |
businesses_df.describe(include="all")
business_alias | business_review_count | business_rating | business_price | business_city | business_state | business_postal_code | business_country | business_latitude | business_longitude | business_categories | business_parent_categories | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 1000 | 1000.000000 | 1000.000000 | 1000.000000 | 1000 | 1000 | 1000 | 1000 | 1000.000000 | 1000.000000 | 1000 | 1000 |
unique | 1000 | NaN | NaN | NaN | 40 | 6 | 301 | 5 | NaN | NaN | 530 | 50 |
top | le-comptoir-de-la-gastronomie-paris | NaN | NaN | NaN | Paris | 13 | 2000 | AU | NaN | NaN | ["french"] | ["restaurants"] |
freq | 1 | NaN | NaN | NaN | 200 | 200 | 139 | 200 | NaN | NaN | 56 | 551 |
mean | NaN | 529.395000 | 4.270500 | 2.144000 | NaN | NaN | NaN | NaN | 13.688472 | 35.219700 | NaN | NaN |
std | NaN | 1134.869631 | 0.353135 | 0.897814 | NaN | NaN | NaN | NaN | 34.823799 | 93.358452 | NaN | NaN |
min | NaN | 6.000000 | 3.000000 | 0.000000 | NaN | NaN | NaN | NaN | -33.897026 | -74.016022 | NaN | NaN |
25% | NaN | 30.000000 | 4.000000 | 2.000000 | NaN | NaN | NaN | NaN | -22.983292 | -43.218462 | NaN | NaN |
50% | NaN | 71.000000 | 4.500000 | 2.000000 | NaN | NaN | NaN | NaN | 35.673141 | 2.340317 | NaN | NaN |
75% | NaN | 265.750000 | 4.500000 | 3.000000 | NaN | NaN | NaN | NaN | 40.751259 | 139.770420 | NaN | NaN |
max | NaN | 13047.000000 | 5.000000 | 4.000000 | NaN | NaN | NaN | NaN | 48.890209 | 151.298248 | NaN | NaN |
fig = px.histogram(
businesses_df, x="business_rating", marginal="box", color="business_country"
)
fig.show()
fig = px.histogram(
businesses_df, x="business_price", marginal="box", color="business_country"
)
fig.show()
We can see that French and Japanese restaurants have higher review ratings, but Japanese restaurants are less expensive than french ones.
There are no empty values or outliers.
# Encode categories
df = feat_helpers.one_hot_encode_list_variables(
businesses_df, ["business_categories", "business_parent_categories"]
)
# Plot the correlation of categories with rating
corr_rating = (
df[
["business_rating", "business_price"]
+ [col for col in df.columns if col.startswith("business_categories")]
]
.corr()[["business_rating"]]
.drop("business_rating")
.sort_values(
by=["business_rating"],
ascending=False,
)
)
fig = px.bar(corr_rating.head(10).append(corr_rating.tail(10)), color="value")
fig.update_layout(
title="Top 20 Correlations with Business Rating",
xaxis_title="Category",
yaxis_title="Correlation",
)
fig.show()
# Plot the correlation of categories with prince
corr_price = (
df[
["business_price", "business_rating"]
+ [col for col in df.columns if col.startswith("business_categories")]
]
.corr()[["business_price"]]
.drop("business_price")
.sort_values(
by=["business_price"],
ascending=False,
)
)
fig = px.bar(corr_price.head(10).append(corr_price.tail(10)), color="value")
fig.update_layout(
title="Top 20 Correlations with Business Price",
xaxis_title="Category",
yaxis_title="Correlation",
)
fig.show()
Overall, pricy restaurants tend to have lower review ratings : people are more picky when they spend more money. The correlations confirm that French restaurants are overpriced (high positive correlation to the Price), while Japanese restaurants are cheap (high negative correlation to the Price). Ramen restaurants seem to be the best value for money deals (un the top 10 Ratings and bottom 10 Prices).
The dataset is composed of 2928 reviews from 1000 businesses (~3 reviews per business). Each review has a business alias, a text and a rating.
reviews_df.head()
business_alias | review_text | review_rating | |
---|---|---|---|
0 | le-comptoir-de-la-gastronomie-paris | This review is from our 2019 trip. Shame on m... | 5.0 |
1 | le-comptoir-de-la-gastronomie-paris | This place def lives up the hype. Best French... | 5.0 |
2 | le-comptoir-de-la-gastronomie-paris | While planning a friends trip to Paris, I came... | 5.0 |
3 | l-as-du-fallafel-paris | This is the best falafel sandwich I have ever ... | 5.0 |
4 | l-as-du-fallafel-paris | IMO this is a must try in Paris. Located in ... | 5.0 |
reviews_df.describe(include="all")
business_alias | review_text | review_rating | |
---|---|---|---|
count | 2928 | 2928 | 2928.000000 |
unique | 990 | 2928 | NaN |
top | le-comptoir-de-la-gastronomie-paris | This review is from our 2019 trip. Shame on m... | NaN |
freq | 3 | 1 | NaN |
mean | NaN | NaN | 4.396516 |
std | NaN | NaN | 0.904879 |
min | NaN | NaN | 1.000000 |
25% | NaN | NaN | 4.000000 |
50% | NaN | NaN | 5.000000 |
75% | NaN | NaN | 5.000000 |
max | NaN | NaN | 5.000000 |
reviews_text_df = pd.DataFrame()
reviews_text_df["len"] = reviews_df["review_text"].str.len()
reviews_text_df["wc"] = reviews_df["review_text"].str.split().str.len()
fig = px.histogram(reviews_text_df, x="len", marginal="box", title="Review Text Length")
fig.show()
fig = px.histogram(
reviews_text_df, x="wc", marginal="box", title="Review Text Word Count"
)
fig.show()
fig = px.histogram(
reviews_df, x="review_rating", marginal="box", title="Review Ratings"
)
fig.show()
Reviews are truncated to 160 characters (~30 words per review). This can be a problem when the text that would explain the review rating has been removed.
We can see that review ratings are highly skewed towards the high grades : more than half of the reviews are 5 stars.
There are no empty values or outliers.
ratings_diff = reviews_df.join(
businesses_df.set_index("business_alias")[["business_rating"]],
on="business_alias",
)
reviews_df["rating_diff"] = (
ratings_diff["review_rating"] - ratings_diff["business_rating"]
)
reviews_df["rating_sentiment"] = (reviews_df["rating_diff"] > 0).astype(int)
reviews_df[["rating_diff", "rating_sentiment"]].describe()
rating_diff | rating_sentiment | |
---|---|---|
count | 2928.000000 | 2928.000000 |
mean | 0.125683 | 0.572404 |
std | 0.875052 | 0.494814 |
min | -3.500000 | 0.000000 |
25% | -0.500000 | 0.000000 |
50% | 0.500000 | 1.000000 |
75% | 0.500000 | 1.000000 |
max | 2.000000 | 1.000000 |
fig = px.histogram(
reviews_df,
x="rating_diff",
marginal="box",
title="Rating Difference : Review ratings - Business average ratings",
color=reviews_df["rating_sentiment"],
)
fig.show()
We can see that in our Reviews dataset, ratings are a bit higher than the average ratings from the Businesses dataset.
For the rest of this analysis, we will consider the difference of rating between a review and the average rating of the business. A review is considered as a positive review if the rating is higher than the average rating of the business, and a negative review if the rating is lower or equal to the average rating of the business.
The dataset is composed of 999 photos from 1000 businesses (~1 photo per business). Each photo has a business alias and a photo URL.
photos_df.head()
business_alias | photo_url | file_name | |
---|---|---|---|
0 | le-comptoir-de-la-gastronomie-paris | https://s3-media2.fl.yelpcdn.com/bphoto/Je6THJ... | le-comptoir-de-la-gastronomie-paris_0200c0c54c... |
1 | l-as-du-fallafel-paris | https://s3-media2.fl.yelpcdn.com/bphoto/wdIhzK... | l-as-du-fallafel-paris_ab3344d5839c2238e825b28... |
2 | angelina-paris | https://s3-media3.fl.yelpcdn.com/bphoto/DPM5TB... | angelina-paris_0aced3805db4ea7246d49c771fc48d8... |
3 | l-avant-comptoir-paris-3 | https://s3-media3.fl.yelpcdn.com/bphoto/mVwgxg... | l-avant-comptoir-paris-3_73670e8469e59d89b41bf... |
4 | la-coïncidence-paris-4 | https://s3-media1.fl.yelpcdn.com/bphoto/QdrAgE... | la-coïncidence-paris-4_3b775bee0b2de9e4fa369fb... |
photos_df.describe(include="all")
business_alias | photo_url | file_name | |
---|---|---|---|
count | 999 | 999 | 999 |
unique | 999 | 999 | 999 |
top | le-comptoir-de-la-gastronomie-paris | https://s3-media2.fl.yelpcdn.com/bphoto/Je6THJ... | le-comptoir-de-la-gastronomie-paris_0200c0c54c... |
freq | 1 | 1 | 1 |
We will study the photos more deeply in the dedicated section.
We will use also use the Academic dataset provided by Yelp (https://www.yelp.com/dataset) composed of 8,635,403 reviews, 160,585 businesses and 200,000 pictures from 8 metropolitan areas.
We are only going to use the reviews and photos data. Since the dataset is huge, we are going to sample a small subset of the data.
# Sample data for development
TEXT_SAMPLE_SIZE = 10 * 1000 # <= 0 for all
# Load academic dataset
if os.path.exists("../data/processed/academic/reviews.pkl.gz"):
# Load academic data from pickle file
logger.info(">>> Loading reviews from pickle file...")
reviews_academic_df = pd.read_pickle("../data/processed/academic/reviews.pkl.gz")
logger.info(f">>> OK : {len(reviews_academic_df)} reviews loaded from pickle file.")
else:
# Load academic data from row CSV file
logger.info(">>> Loading reviews from JSON file...")
reviews_academic_df = pd.DataFrame()
with pd.read_json(
"../data/raw/academic/yelp_academic_dataset_review.json",
dtype={
"review_id": str,
"user_id": str,
"business_id": str,
"stars": int,
"useful": int,
"funny": int,
"cool": int,
"text": str,
"date": "datetime64[ns]",
},
chunksize=500 * 1000,
lines=True,
) as json_reader:
# Load data in chunks
for chunk in json_reader:
reviews_academic_df = reviews_academic_df.append(chunk)
logger.info(f"Loaded {len(reviews_academic_df)} reviews")
logger.info(f">>> OK : {len(reviews_academic_df)} reviews loaded from JSON file.")
# Reduce memory usage
reviews_academic_df = data_helpers.reduce_dataframe_memory_usage(
reviews_academic_df
)
# Save as pickle
logger.info(">>> Saving reviews data as pickle file...")
os.makedirs("../data/processed/academic/", exist_ok=True)
reviews_academic_df.to_pickle("../data/processed/academic/reviews.pkl.gz")
logger.info(
">>> OK : Reviews data saved to ../data/processed/academic/reviews.pkl.gz ."
)
if TEXT_SAMPLE_SIZE > 0:
# Sample data
logger.info(">>> Sampling reviews data...")
reviews_academic_df = reviews_academic_df.sample(TEXT_SAMPLE_SIZE, random_state=42)
logger.info(f">>> OK : Data sampled to {len(reviews_academic_df)} reviews.")
We will just display here a few statistics about each DataFrame.
reviews_academic_df.describe(include="all", datetime_is_numeric=True)
review_id | user_id | business_id | stars | useful | funny | cool | text | date | |
---|---|---|---|---|---|---|---|---|---|
count | 10000 | 10000 | 10000 | 10000.000000 | 10000.000000 | 10000.000000 | 10000.000000 | 10000 | 10000 |
unique | 10000 | 9726 | 8479 | NaN | NaN | NaN | NaN | 10000 | 10000 |
top | 8W--5RJuDQbsTjiKJaho7A | RtGqdDBvvBCjcu5dUqwfzA | WkN8Z2Q8gbhjjkCt8cDVxg | NaN | NaN | NaN | NaN | Most fave place to get live seafood. Been a cu... | 2014-09-13 09:43:57 |
freq | 1 | 9 | 11 | NaN | NaN | NaN | NaN | 1 | 1 |
mean | NaN | NaN | NaN | 3.728400 | 1.260800 | 0.407000 | 0.499600 | NaN | NaN |
std | NaN | NaN | NaN | 1.465554 | 3.214843 | 1.874594 | 2.279324 | NaN | NaN |
min | NaN | NaN | NaN | 1.000000 | 0.000000 | 0.000000 | 0.000000 | NaN | NaN |
25% | NaN | NaN | NaN | 3.000000 | 0.000000 | 0.000000 | 0.000000 | NaN | NaN |
50% | NaN | NaN | NaN | 4.000000 | 0.000000 | 0.000000 | 0.000000 | NaN | NaN |
75% | NaN | NaN | NaN | 5.000000 | 1.000000 | 0.000000 | 0.000000 | NaN | NaN |
max | NaN | NaN | NaN | 5.000000 | 123.000000 | 100.000000 | 124.000000 | NaN | NaN |
# Number of reviews per user
reviews_per_user = reviews_academic_df.groupby("user_id").count()["review_id"]
fig = px.histogram(
reviews_per_user,
title=f"Number of reviews per user (N={len(reviews_academic_df)})",
log_y=True,
histnorm="probability density",
)
fig.update_layout(
xaxis_title_text="Number of reviews",
yaxis_title_text="Ratio of users (log)",
)
fig.show()
Most users (~98%) write only one review.
# Number of reviews per business
reviews_per_business = reviews_academic_df.groupby("business_id").count()["review_id"]
fig = px.histogram(
reviews_per_business,
title=f"Number of reviews per business (N={len(reviews_academic_df)})",
log_y=True,
histnorm="probability density",
)
fig.update_layout(
xaxis_title_text="Number of reviews",
yaxis_title_text="Ratio of businesses (log)",
)
fig.show()
Most businesses (~87%) have only one review.
# Number of reviews per stars
fig = px.histogram(
reviews_academic_df.sample(1000, random_state=42),
x="stars",
title=f"Reviews by Stars (N={len(reviews_academic_df)}, n=1000)",
marginal="box",
histnorm="probability",
)
fig.show()
The stars distribution is quite similar to the API dataset: half of the reviews are 5 stars. But there are more 3 and 2 stars reviews and much more 1 star reviews. This shows that one of the two datasets is not representative of the reality.
# Number of reviews per reaction
fig = px.histogram(
reviews_academic_df[["useful", "funny", "cool"]],
title=f"Useful / Funny / Cool score per review (N={len(reviews_academic_df)})",
log_y=True,
histnorm="probability density",
barmode="stack",
)
fig.update_layout(
xaxis_title_text="Useful / Funny / Cool score",
yaxis_title_text="Ratio of reviews (log)",
)
fig.show()
We can see that the Academic dataset is a bit different that the API dataset. :
In this section, we are going to try and identify the topics of bad reviews. To achieve this, we are going to represent the reviews as a bag of words (BoW) and to use a topic modelling algorithm to identify the topics of the reviews. We are going to compare different text processing techniques and to compare the results of topic modelling algorithms.
We are going to use a target variable for regression (rating difference) and a target variable for classification (review sentiment).
from sklearn.model_selection import train_test_split
## API dataset
# Sentiment analysis : regression of rating difference
X = reviews_df["review_text"]
y = reviews_df["rating_diff"]
# Free memory
del reviews_df
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
# Sentiment analysis : binary classification
y_bi = (y > 0).astype(int)
y_train_bi = (y_train > 0).astype(int)
y_test_bi = (y_test > 0).astype(int)
## Academic dataset
X_academic = reviews_academic_df["text"]
y_academic = reviews_academic_df["stars"]
# Free memory
del reviews_academic_df
We are going to compare different combinations of text tokenizers and vectorizers. The goal here is to eliminate the noise of the reviews and to reduce the dimension of the dataset, while trying to keep the meaning of the reviews.
Vectorizer : transforms a list of tokens into a vector of features, with more or less processing of the tokens
CountVectorizer
: just count the token occurrences in each document of the corpusTfidfVectorizer
: count the token occurrences in each document of the corpus and then compute the inverse document frequency (IDF) of each token (if a token is present in every document, it is not specific to a document and its IDF will be low)strip_accents
: replace special characterslowercase
: transform all the tokens to lowercasestop_words
: remove tokens thet belong to a list of stopwords (ex. : "the", "a", "of", ...){min,max}_df
: filter out tokens that appear in less than min_df documents or more than max_df documents to prevent the feature to be too specific or too generalngrams
: create n-grams from the tokens, i.e. create combinations of n consecutive tokensTokenizer : transforms a string of text into a list of tokens, with more or less processing of the text
None
: no text processing, just split the string into words (default)PorterStemmer
: stemming is the process of removing prefixes or suffixes of words (ex. : "ing", "ed", "ly", ...) to prevent same token variations (ex. : "running", "run", "runs", ...)WordNetLemmatizer
: lemmatization is the process of changing the word into its base form (ex. : "running" -> "run", "runs" -> "run", ...)SpaCy
: SpaCy is a library for natural language processing (NLP) that uses a combination of machine learning, natural language understanding, and natural language generation# Vectorizers
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
# Tokenizers, Stemmers and Lemmatizers
import nltk
from nltk import word_tokenize, pos_tag
from nltk.corpus import stopwords, wordnet
from nltk.stem import PorterStemmer, WordNetLemmatizer
import spacy
nltk.download("stopwords")
nltk.download("wordnet")
stopwords = set(stopwords.words("english"))
nlp = spacy.load("en_core_web_sm")
# PoS (Part of Speach) tagger
def pos_tagger(nltk_tag: str) -> str:
"""
Translate NLTK POS tags to spacy POS tags.
Args:
nltk_tag (str): NLTK POS tag.
Returns:
str: Spacy POS tag.
"""
if nltk_tag.startswith("J"):
return wordnet.ADJ
elif nltk_tag.startswith("V"):
return wordnet.VERB
elif nltk_tag.startswith("N"):
return wordnet.NOUN
elif nltk_tag.startswith("R"):
return wordnet.ADV
else:
return wordnet.NOUN
# Tokenizers : simple to complex
tokenizers = {
"None": None, # basic word tokenizer
"stopwords": lambda text: [ # remove stopwords
token.lower()
for token in word_tokenize(text)
if token.isalpha() and token.lower() not in stopwords
],
"PorterStemmer": lambda text: [ # Porter Stemmer
PorterStemmer().stem(token).lower()
for token in word_tokenize(text)
if token.isalpha() and token.lower() not in stopwords
],
"WordNetLemmatizer": lambda text: [ # WordNet Lemmatizer
WordNetLemmatizer().lemmatize(token, pos_tagger(pos)).lower()
for token, pos in pos_tag(word_tokenize(text))
if token.isalpha() and token.lower() not in stopwords
],
"SpaCy": lambda text: [ # SpaCy Lemmatizer
token.lemma_.lower()
for token in nlp(text)
if token.is_alpha and not token.is_stop
],
}
# Basic vectorizers : simple tokenization, no stemming or lemmatization
vectorizers = {
"CountVectorizer": CountVectorizer(),
"TfidfVectorizer": TfidfVectorizer(),
"CountVectorizer + strip_accents + lowercase": CountVectorizer(
strip_accents="unicode",
lowercase=True,
),
"TfidfVectorizer + strip_accents + lowercase": TfidfVectorizer(
strip_accents="unicode",
lowercase=True,
),
"CountVectorizer + strip_accents + lowercase + stop_words": CountVectorizer(
strip_accents="unicode",
lowercase=True,
stop_words=stopwords,
),
"TfidfVectorizer + strip_accents + lowercase + stop_words": TfidfVectorizer(
strip_accents="unicode",
lowercase=True,
stop_words=stopwords,
),
"CountVectorizer + strip_accents + lowercase + stop_words + {min,max}_df": CountVectorizer(
strip_accents="unicode",
lowercase=True,
stop_words=stopwords,
max_df=0.9,
min_df=0.01,
),
"TfidfVectorizer + strip_accents + lowercase + stop_words + {min,max}_df": TfidfVectorizer(
strip_accents="unicode",
lowercase=True,
stop_words=stopwords,
max_df=0.9,
min_df=0.01,
),
"CountVectorizer + strip_accents + lowercase + stop_words + {min,max}_df + ngrams": CountVectorizer(
strip_accents="unicode",
lowercase=True,
stop_words=stopwords,
max_df=0.9,
min_df=0.01,
ngram_range=(1, 3),
),
"TfidfVectorizer + strip_accents + lowercase + stop_words + {min,max}_df + ngrams": TfidfVectorizer(
strip_accents="unicode",
lowercase=True,
stop_words=stopwords,
max_df=0.9,
min_df=0.01,
ngram_range=(1, 3),
),
}
# Enrich vectorizers with more complex tokenizers : stemming and lemmatization
for tokenizer_name, tokenizer in tokenizers.items():
vectorizers[
f"CountVectorizer + strip_accents + lowercase + {{min,max}}_df + ngrams + {tokenizer_name}"
] = CountVectorizer(
strip_accents="unicode",
lowercase=True,
max_df=0.9,
min_df=0.01,
ngram_range=(1, 3),
tokenizer=tokenizer,
)
vectorizers[
f"TfidfVectorizer + strip_accents + lowercase + {{min,max}}_df + ngrams + {tokenizer_name}"
] = TfidfVectorizer(
strip_accents="unicode",
lowercase=True,
max_df=0.9,
min_df=0.01,
ngram_range=(1, 3),
tokenizer=tokenizer,
)
2021-12-13 10:09:15.214132: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory 2021-12-13 10:09:15.214214: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. [nltk_data] Downloading package stopwords to [nltk_data] /home/clement/nltk_data... [nltk_data] Package stopwords is already up-to-date! [nltk_data] Downloading package wordnet to /home/clement/nltk_data... [nltk_data] Package wordnet is already up-to-date!
We are going to evaluate the combination of different tokenizers, vectorizers and regression models in order to find the model that best predicts the sentiment of the reviews (as the difference between the Business' average rating and the review rating).
# Regression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.compose import TransformedTargetRegressor
from sklearn.preprocessing import QuantileTransformer
from sklearn.dummy import DummyRegressor
from sklearn.linear_model import (
LinearRegression,
BayesianRidge,
PassiveAggressiveRegressor,
SGDRegressor,
Ridge,
RidgeCV,
Lars,
LarsCV,
Lasso,
LassoCV,
ElasticNet,
ElasticNetCV,
LassoLars,
LassoLarsCV,
OrthogonalMatchingPursuit,
OrthogonalMatchingPursuitCV,
BayesianRidge,
ARDRegression,
HuberRegressor,
TheilSenRegressor,
PassiveAggressiveRegressor,
SGDRegressor,
)
from sklearn.kernel_ridge import KernelRidge
from sklearn.svm import SVR
from sklearn.neural_network import MLPRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from lightgbm import LGBMRegressor
if not os.path.exists("../results/regressors_grid_search_results.csv"):
# Define a processing pipeline : vectorizer + regressor
pipe_reg = Pipeline(
[
("vec", CountVectorizer()),
("reg", DummyRegressor()),
]
)
# Define a grid search : vectorizer + regressor + hyperparameters
grid_reg = GridSearchCV(
pipe_reg,
param_grid=dict(
vec=[
CountVectorizer(strip_accents="unicode", lowercase=True),
TfidfVectorizer(strip_accents="unicode", lowercase=True),
],
reg=[
DummyRegressor(),
ElasticNetCV(cv=2),
# TransformedTargetRegressor(
# regressor=ElasticNetCV(),
# transformer=QuantileTransformer(),
# ),
LinearRegression(),
# RidgeCV(cv=2),
# LarsCV(),
# LassoCV(cv=2),
# LassoLars(),
# LassoLarsCV(),
# OrthogonalMatchingPursuit(),
# OrthogonalMatchingPursuitCV(),
# BayesianRidge(),
# ARDRegression(),
# HuberRegressor(),
# TheilSenRegressor(),
# PassiveAggressiveRegressor(),
# SGDRegressor(),
# KernelRidge(),
SVR(),
MLPRegressor(),
# KNeighborsRegressor(),
# DecisionTreeRegressor(),
RandomForestRegressor(),
# GradientBoostingRegressor(),
LGBMRegressor(),
],
vec__max_df=[1.0, 0.99],
vec__min_df=[1, 0.01],
vec__ngram_range=[(1, 1), (1, 2)],
vec__tokenizer=list(tokenizers.values()),
),
cv=2,
# verbose=9,
).fit(X, y)
print(grid_reg.best_estimator_)
print(grid_reg.best_params_)
print(grid_reg.best_score_)
with open("../results/regressors_grid_search_results.csv", "w") as f:
pd.DataFrame(grid_reg.cv_results_).sort_values(
by="rank_test_score",
ascending=True,
).to_csv("../results/regressors_grid_search_results.csv", index=False)
else:
results_reg_df = pd.read_csv("../results/regressors_grid_search_results.csv")
print(
results_reg_df[
[
"param_reg",
"param_vec",
"mean_fit_time",
"mean_test_score",
"rank_test_score",
]
].sort_values(by="rank_test_score", ascending=True)
)
param_reg param_vec \ 0 ElasticNetCV(cv=2) TfidfVectorizer(ngram_range=(1, 2), strip_acce... 1 ElasticNetCV(cv=2) TfidfVectorizer(ngram_range=(1, 2), strip_acce... 2 ElasticNetCV(cv=2) TfidfVectorizer(ngram_range=(1, 2), strip_acce... 3 ElasticNetCV(cv=2) TfidfVectorizer(ngram_range=(1, 2), strip_acce... 4 ElasticNetCV(cv=2) TfidfVectorizer(ngram_range=(1, 2), strip_acce... .. ... ... 555 LGBMRegressor() CountVectorizer(strip_accents='unicode') 556 LGBMRegressor() CountVectorizer(strip_accents='unicode') 557 LGBMRegressor() CountVectorizer(strip_accents='unicode') 558 LGBMRegressor() CountVectorizer(strip_accents='unicode') 559 LGBMRegressor() CountVectorizer(strip_accents='unicode') mean_fit_time mean_test_score rank_test_score 0 91.366752 0.094062 1 1 91.466346 0.094062 1 2 113.303345 0.091955 3 3 113.717107 0.091955 3 4 88.506179 0.084741 5 .. ... ... ... 555 0.796290 NaN 556 556 0.397112 NaN 557 557 0.033853 NaN 558 558 1.917799 NaN 559 559 8.848708 NaN 560 [560 rows x 5 columns]
It appears that the best regression model is the ElasticNetCV
model, which outperforms all other models, whatever the tokenizer and vectorizer.
A surprizing result is that the more complex vectorizers and tokenizers don't really improve the results. This might be due to the fact that the dataset is quite small and the reviews are quite similar.
We are going to evaluate the combination of different tokenizers, vectorizers and classification models in order to find the model that best predicts the sentiment of the reviews (as whether the review rating is higher ("good" review) or lower ("bad" review) than the Business' average rating).
# Classification
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.dummy import DummyClassifier
from sklearn.linear_model import (
LogisticRegressionCV,
RidgeClassifierCV,
SGDClassifier,
)
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from lightgbm import LGBMClassifier
if not os.path.exists("../results/classifiers_grid_search_results.csv"):
# Define a processing pipeline : vectorizer + classifier
pipe_cls = Pipeline(
[
("vec", CountVectorizer()),
("cls", DummyClassifier()),
]
)
# Define a grid search : vectorizer + classifier + hyperparameters
grid_cls = GridSearchCV(
pipe_cls,
param_grid=dict(
vec=[
CountVectorizer(strip_accents="unicode", lowercase=True),
TfidfVectorizer(strip_accents="unicode", lowercase=True),
],
cls=[
DummyClassifier(),
RidgeClassifierCV(cv=2),
LogisticRegressionCV(cv=2),
# SGDClassifier(),
SVC(),
KNeighborsClassifier(),
MLPClassifier(),
# DecisionTreeClassifier(),
RandomForestClassifier(),
# GradientBoostingClassifier(),
LGBMClassifier(),
],
vec__max_df=[1.0, 0.99],
vec__min_df=[1, 0.01],
vec__ngram_range=[(1, 1), (1, 2)],
vec__tokenizer=list(tokenizers.values()),
),
cv=2,
# verbose=9,
).fit(X, y_bi)
print(grid_cls.best_estimator_)
print(grid_cls.best_params_)
print(grid_cls.best_score_)
with open("../results/classifiers_grid_search_results.csv", "w") as f:
pd.DataFrame(grid_cls.cv_results_).sort_values(
by="rank_test_score",
ascending=True,
).to_csv("../results/classifiers_grid_search_results.csv", index=False)
else:
results_cls_df = pd.read_csv("../results/classifiers_grid_search_results.csv")
print(
results_cls_df[
[
"param_cls",
"param_vec",
"mean_fit_time",
"mean_test_score",
"rank_test_score",
]
].sort_values(by="rank_test_score", ascending=True)
)
param_cls param_vec \ 0 SVC() CountVectorizer(strip_accents='unicode') 1 SVC() CountVectorizer(strip_accents='unicode') 2 LogisticRegressionCV(cv=2) CountVectorizer(strip_accents='unicode') 3 LogisticRegressionCV(cv=2) CountVectorizer(strip_accents='unicode') 4 SVC() CountVectorizer(strip_accents='unicode') .. ... ... 635 LGBMClassifier() CountVectorizer(strip_accents='unicode') 636 LGBMClassifier() CountVectorizer(strip_accents='unicode') 637 LGBMClassifier() CountVectorizer(strip_accents='unicode') 638 LGBMClassifier() CountVectorizer(strip_accents='unicode') 639 LGBMClassifier() CountVectorizer(strip_accents='unicode') mean_fit_time mean_test_score rank_test_score 0 0.481287 0.607582 1 1 0.466467 0.607582 1 2 0.486332 0.602117 3 3 0.495638 0.602117 3 4 0.418584 0.601434 5 .. ... ... ... 635 0.030264 NaN 636 636 9.108561 NaN 637 637 2.095358 NaN 638 638 0.803402 NaN 639 639 0.029353 NaN 640 [640 rows x 5 columns]
It appears that the best regression model is the SVC
model. The LogisticRegressionCV
, RandomForestClassifier
and RidgeClassifierCV
have similar results depending on the tokenizer and vectorizer.
Again, the more complex text processors don't necessarily improve the results.
Now that we have found the best Machine Learning models for regression and classification, we are going to compare the results of the different vectorizers and tokenizers. We will use the same dataset and the same regression and classification models.
The goal is to measure the influence of the different vectorizers and tokenizers on the results of the regression and classification models. We also want to evaluate (empirically) how the vectorizers and tokenizers improve the interpretability of the models.
from sklearn.linear_model import ElasticNetCV # Regression
from sklearn.linear_model import RidgeClassifierCV # Classification
from sklearn.metrics import (
# Regression metrics
median_absolute_error,
r2_score,
# Classification metrics
accuracy_score,
precision_score,
recall_score,
f1_score,
roc_auc_score,
plot_confusion_matrix,
plot_roc_curve,
plot_precision_recall_curve,
)
if not os.path.exists(
"../results/regression_vectorisers_results.csv"
) or not os.path.exists("../results/classification_vectorisers_results.csv"):
results_reg = []
results_cls = []
for vectorizer_name, vectorizer in vectorizers.items():
# Vectorization of the corpus
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)
# Tokens distribution
words_count = pd.Series(
X_train_vec.sum(axis=0).tolist()[0],
index=vectorizer.get_feature_names_out(),
)
# Top 20 tokens by occurrences
top_20_count = words_count.sort_values(ascending=False).head(20)
fig = px.bar(
top_20_count,
x=top_20_count.index,
y=top_20_count.values,
labels={"x": "Word", "y": "Count"},
title=f"{vectorizer_name} : Top 20 frequent words in reviews (vocabulary = {len(words_count)} words)",
color=top_20_count.values,
)
fig.show()
# Regression model
print("Regression")
# Train the regression model
reg = ElasticNetCV(random_state=42, n_jobs=-1).fit(X_train_vec, y_train)
# Comute the coefficients
coefs_reg = pd.Series(reg.coef_, index=vectorizer.get_feature_names_out())
# Top 20 tokens by importance (positive and negative)
top_20_coefs_reg = (
coefs_reg.nlargest(10).append(coefs_reg.nsmallest(10)).sort_values()
)
fig = px.bar(
top_20_coefs_reg,
x=top_20_coefs_reg.index,
y=top_20_coefs_reg.values,
labels={"x": "Word", "y": "Count"},
title=f"{vectorizer_name} : Top 20 important words in reviews (vocabulary = {len(words_count)} words)",
color=top_20_coefs_reg.values,
)
fig.show()
# Test the regression model
y_pred_reg = reg.predict(X_test_vec)
fig = px.box(
x=y_test,
y=y_pred_reg,
labels={"x": "Actual", "y": "Predicted"},
title=f"{vectorizer_name} : Actual vs Predicted / R² = {round(r2_score(y_test, y_pred_reg), 3)} / MAE = {round(median_absolute_error(y_test, y_pred_reg), 3)}",
color=y_test,
)
fig.show()
# Keep the results
results_reg.append(
{
"vectorizer": vectorizer_name,
"vocabulary_size": len(words_count),
"r2_score": r2_score(y_test, y_pred_reg),
"median_absolute_error": median_absolute_error(y_test, y_pred_reg),
}
)
print()
print(f"{vectorizer_name}")
print(
f"vocabulary = {len(words_count)} words / R² = {round(r2_score(y_test, y_pred_reg), 3)} / MAE = {round(median_absolute_error(y_test, y_pred_reg), 3)}"
)
print()
# Classification model
print("Classification")
# Train the regression model
cls = RidgeClassifierCV().fit(X_train_vec, y_train_bi)
# Comute the coefficients
coefs_cls = pd.Series(
cls.coef_[0],
index=vectorizer.get_feature_names_out(),
)
# Top 20 tokens by importance (positive and negative)
top_20_coefs_cls = (
coefs_cls.nlargest(10).append(coefs_cls.nsmallest(10)).sort_values()
)
fig = px.bar(
top_20_coefs_cls,
x=top_20_coefs_cls.index,
y=top_20_coefs_cls.values,
labels={"x": "Word", "y": "Count"},
title=f"{vectorizer_name} : Top 20 important words in reviews (vocabulary = {len(words_count)} words)",
color=top_20_coefs_cls.values,
)
fig.show()
# Test the regression model
y_pred_cls = cls.predict(X_test_vec)
plot_confusion_matrix(
estimator=cls,
X=X_test_vec,
y_true=y_test_bi,
)
plt.show()
plot_roc_curve(
estimator=cls,
X=X_test_vec,
y=y_test_bi,
)
plt.show()
plot_precision_recall_curve(
estimator=cls,
X=X_test_vec,
y=y_test_bi,
)
plt.show()
# Keep the results
results_cls.append(
{
"vectorizer": vectorizer_name,
"vocabulary_size": len(words_count),
"accuracy_score": accuracy_score(y_test_bi, y_pred_cls),
"precision_score": precision_score(y_test_bi, y_pred_cls),
"recall_score": recall_score(y_test_bi, y_pred_cls),
"f1_score": f1_score(y_test_bi, y_pred_cls),
"roc_auc_score": roc_auc_score(y_test_bi, y_pred_cls),
}
)
print()
print(f"{vectorizer_name}")
print(
f"vocabulary = {len(words_count)} words / accuracy_score = {round(accuracy_score(y_test_bi, y_pred_cls), 3)} / precision_score = {round(precision_score(y_test_bi, y_pred_cls), 3)} / recall_score = {round(recall_score(y_test_bi, y_pred_cls), 3)} / f1_score = {round(f1_score(y_test_bi, y_pred_cls), 3)} / roc_auc_score = {round(roc_auc_score(y_test_bi, y_pred_cls), 3)}"
)
print()
print(
pd.DataFrame(results_reg).sort_values(
by=["r2_score", "median_absolute_error"],
ascending=[False, True],
)
)
with open("../results/regression_vectorisers_results.csv", "w") as f:
f.write(
pd.DataFrame(results_reg)
.sort_values(
by=["r2_score", "median_absolute_error"],
ascending=[False, True],
)
.to_csv(index=False)
)
print(
pd.DataFrame(results_cls).sort_values(
by=["roc_auc_score", "f1_score"],
ascending=[False, False],
)
)
with open("../results/classification_vectorisers_results.csv", "w") as f:
f.write(
pd.DataFrame(results_cls)
.sort_values(
by=["roc_auc_score", "f1_score"],
ascending=[False, False],
)
.to_csv(index=False)
)
else:
results_reg_df = pd.read_csv("../results/regression_vectorisers_results.csv")
print(
results_reg_df.sort_values(
by=["r2_score", "median_absolute_error"],
ascending=[False, True],
)
)
results_cls_df = pd.read_csv("../results/classification_vectorisers_results.csv")
print(
results_cls_df.sort_values(
by=["roc_auc_score", "f1_score"],
ascending=[False, False],
)
)
vectorizer vocabulary_size \ 0 TfidfVectorizer 5541 1 TfidfVectorizer + strip_accents + lowercase 5523 2 CountVectorizer 5541 3 CountVectorizer + strip_accents + lowercase 5523 4 TfidfVectorizer + strip_accents + lowercase + ... 5389 5 CountVectorizer + strip_accents + lowercase + ... 5389 6 CountVectorizer + strip_accents + lowercase + ... 546 7 TfidfVectorizer + strip_accents + lowercase + ... 546 8 CountVectorizer + strip_accents + lowercase + ... 294 9 TfidfVectorizer + strip_accents + lowercase + ... 294 10 CountVectorizer + strip_accents + lowercase + ... 248 11 CountVectorizer + strip_accents + lowercase + ... 286 12 TfidfVectorizer + strip_accents + lowercase + ... 286 13 TfidfVectorizer + strip_accents + lowercase + ... 248 14 CountVectorizer + strip_accents + lowercase + ... 283 15 CountVectorizer + strip_accents + lowercase + ... 291 16 CountVectorizer + strip_accents + lowercase + ... 283 17 TfidfVectorizer + strip_accents + lowercase + ... 283 18 TfidfVectorizer + strip_accents + lowercase + ... 291 19 TfidfVectorizer + strip_accents + lowercase + ... 283 r2_score median_absolute_error 0 0.101818 0.480907 1 0.101805 0.481011 2 0.101543 0.490524 3 0.101543 0.490524 4 0.093988 0.491920 5 0.091645 0.478977 6 0.084090 0.477133 7 0.082455 0.480170 8 0.075870 0.483166 9 0.074727 0.477151 10 0.072494 0.490615 11 0.072171 0.488288 12 0.069704 0.482606 13 0.069545 0.493815 14 0.069113 0.490336 15 0.069113 0.490336 16 0.068611 0.486080 17 0.067449 0.487323 18 0.066269 0.496893 19 0.066121 0.495416 vectorizer vocabulary_size \ 0 CountVectorizer 5541 1 CountVectorizer + strip_accents + lowercase 5523 2 CountVectorizer + strip_accents + lowercase + ... 5389 3 TfidfVectorizer + strip_accents + lowercase + ... 294 4 CountVectorizer + strip_accents + lowercase + ... 294 5 CountVectorizer + strip_accents + lowercase + ... 286 6 TfidfVectorizer + strip_accents + lowercase + ... 286 7 TfidfVectorizer + strip_accents + lowercase + ... 546 8 CountVectorizer + strip_accents + lowercase + ... 248 9 TfidfVectorizer + strip_accents + lowercase + ... 248 10 CountVectorizer + strip_accents + lowercase + ... 283 11 TfidfVectorizer + strip_accents + lowercase 5523 12 TfidfVectorizer + strip_accents + lowercase + ... 283 13 CountVectorizer + strip_accents + lowercase + ... 283 14 TfidfVectorizer 5541 15 CountVectorizer + strip_accents + lowercase + ... 546 16 CountVectorizer + strip_accents + lowercase + ... 291 17 TfidfVectorizer + strip_accents + lowercase + ... 291 18 TfidfVectorizer + strip_accents + lowercase + ... 283 19 TfidfVectorizer + strip_accents + lowercase + ... 5389 accuracy_score precision_score recall_score f1_score roc_auc_score 0 0.628415 0.661702 0.733491 0.695749 0.608628 1 0.627049 0.660297 0.733491 0.694972 0.607005 2 0.620219 0.652720 0.735849 0.691796 0.598444 3 0.635246 0.642468 0.834906 0.726154 0.597648 4 0.614754 0.646694 0.738208 0.689427 0.591506 5 0.610656 0.643892 0.733491 0.685777 0.587525 6 0.622951 0.636531 0.813679 0.714286 0.587034 7 0.621585 0.635359 0.813679 0.713547 0.585411 8 0.605191 0.634731 0.750000 0.687568 0.577922 9 0.613388 0.628415 0.813679 0.709147 0.575671 10 0.598361 0.635417 0.719340 0.674779 0.575579 11 0.622951 0.621711 0.891509 0.732558 0.572378 12 0.610656 0.625225 0.818396 0.708887 0.571536 13 0.592896 0.631250 0.714623 0.670354 0.569974 14 0.620219 0.619672 0.891509 0.731141 0.569131 15 0.588798 0.631692 0.695755 0.662177 0.568657 16 0.588798 0.628931 0.707547 0.665927 0.566436 17 0.603825 0.620939 0.811321 0.703476 0.564751 18 0.601093 0.618280 0.813679 0.702648 0.561060 19 0.606557 0.606583 0.912736 0.728814 0.548900
As we can see, the best text processing models both for regression and classification are the most simple ones (basic CountVectorizer
and TfidfVectorizer
). Adding complexity to the text processing models doesn't improve the results, but they greatly help reduce the vocabulary size and make sense of the importance of each word to estimate the sentiment of a review.
With these results, we are able to visualize "negative words" (like "bad", "worst", "disappoining", "rude", ...) and "positive words" (like "amazing", "favorite", "delicious", "wonderful", ...) in the reviews.
We are going to test the best regression and classification models on a random review from the test dataset, and compare the prediction with a pre-trained model.
from sklearn.linear_model import ElasticNetCV, RidgeClassifierCV
from sklearn.feature_extraction.text import TfidfVectorizer
# Define the vectorizer
vectorizer = vectorizers[
"TfidfVectorizer + strip_accents + lowercase + {min,max}_df + ngrams + SpaCy"
]
# Vectorization of the corpus
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)
# Train the regression model
reg = ElasticNetCV(random_state=42, n_jobs=-1).fit(X_train_vec, y_train)
# Test the regression model
y_pred_reg = reg.predict(X_test_vec)
# Train the regression model
cls = RidgeClassifierCV().fit(X_train_vec, y_train_bi)
# Test the regression model
y_pred_cls = cls.predict(X_test_vec)
# Pick a random review for testing
rand_index = random.randint(0, len(X_test))
review_text = X_test.values[rand_index]
true_reg = y_test.values[rand_index]
pred_reg = y_pred_reg[rand_index]
true_cls = y_test_bi.values[rand_index]
pred_cls = y_pred_cls[rand_index]
# Display the review text
print(f'Review : "{review_text}"')
# Display the predicted rating and sentiment
if abs(true_reg - pred_reg) < 1:
print(
f"✅ Predicted review rating correct : {pred_reg} (pred) vs. {true_reg} (true)"
)
else:
print(
f"❌ Predicted review rating incorrect : {pred_reg} (pred) vs. {true_reg} (true)"
)
if true_cls == pred_cls:
print(
f"✅ Predicted review sentiment correct : {pred_cls} (pred) vs. {true_cls} (true)"
)
else:
print(
f"❌ Predicted review sentiment incorrect : {pred_cls} (pred) vs. {true_cls} (true)"
)
Review : "I don't understand why the low reviews. I found the place to be fabulous! Great customer service and good food. I decided to be adventurous and try the..." ❌ Predicted review rating incorrect : 0.16567202711363738 (pred) vs. 1.5 (true) ✅ Predicted review sentiment correct : 1 (pred) vs. 1 (true)
We can see that our model is not very good, but is already able to make better predictions than a random guess, despite the very small training dataset with a lot of missing information due to truncated text.
Let's see how a pre-trained model performs on the same review.
import transformers
import shap
# print the JS visualization code to the notebook
shap.initjs()
# load a transformers pipeline model
model = transformers.pipeline("sentiment-analysis", return_all_scores=True)
result = model([review_text])
# explain the model on two sample inputs
explainer = shap.Explainer(model)
shap_values = explainer([review_text])
pred = int(
(result[0][0]["label"] == "POSITIVE" and result[0][0]["score"] > 0.5)
or (result[0][0]["label"] == "NEGATIVE" and result[0][0]["score"] < 0.5)
)
# Display the predicted rating and sentiment
if true_cls == pred:
print(f"✅ Predicted review sentiment correct : {pred} (pred) vs. {true_cls} (true)")
else:
print(
f"❌ Predicted review sentiment incorrect : {pred} (pred) vs. {true_cls} (true)"
)
# visualize the first prediction's explanation for the POSITIVE output class
shap.plots.text(shap_values[0, :, "POSITIVE"])
No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english)
✅ Predicted review sentiment correct : 1 (pred) vs. 1 (true)
We can see that the pre-trained model is able to make better predictions than our model, but can also be tricked by the poor data.
We want to be able to identify the topics of the reviews in order to understand why clients were satisfied or not.
We are going to use two different techniques to identify the topics of the reviews :
LSA
: Latent Semantic Analysis (LSA) is a method for dimensionality reduction. It is a supervised method that uses a matrix decomposition to project the data into a lower dimensional space.LDA
: Latent Dirichlet Allocation (LDA) is a probabilistic model that uses a probabilistic graphical model to infer the topic distribution of a document.We are going to use the Latent Semantic Analysis (LSA) technique to identify the topics of the reviews.
We are going to compare the results of the different vectorizers and tokenizers.
First, let's compare the topics on the API dataset.
from sklearn.decomposition import TruncatedSVD
# Test different vectorizers
for vectorizer_name in [
"TfidfVectorizer + strip_accents + lowercase",
"TfidfVectorizer + strip_accents + lowercase + stop_words",
"TfidfVectorizer + strip_accents + lowercase + {min,max}_df + ngrams + stopwords",
"TfidfVectorizer + strip_accents + lowercase + {min,max}_df + ngrams + PorterStemmer",
"TfidfVectorizer + strip_accents + lowercase + {min,max}_df + ngrams + WordNetLemmatizer",
"TfidfVectorizer + strip_accents + lowercase + {min,max}_df + ngrams + SpaCy",
]:
# Vectorize the corpus
vectorizer = vectorizers[vectorizer_name]
X_vec = vectorizer.fit_transform(X)
# Project the documents into the latent space (10 topics)
lsa = TruncatedSVD(n_components=10, random_state=42)
X_lsa = lsa.fit_transform(X_vec)
# Plot the top words of each topic
viz_helpers.plot_top_words(
model=lsa,
feature_names=vectorizer.get_feature_names_out(),
n_top_words=10,
n_topics=10,
title=f"{vectorizer_name} : LSA",
)
As we can see, the more complex the vectorizers and tokenizers, the more meaningful the topics.
Here, we can identify the following topics :
Now let's observe the topics on the Academic dataset.
from sklearn.decomposition import TruncatedSVD
# Vectorize the corpus
vectorizer = vectorizers[
"TfidfVectorizer + strip_accents + lowercase + {min,max}_df + ngrams + SpaCy"
]
X_vec = vectorizer.fit_transform(X_academic)
# Project the documents into the latent space (10 topics)
lsa = TruncatedSVD(n_components=10, random_state=42)
X_lsa = lsa.fit_transform(X_vec)
# Plot the top words of each topic
viz_helpers.plot_top_words(
model=lsa,
feature_names=vectorizer.get_feature_names_out(),
n_top_words=10,
n_topics=10,
title=f"{vectorizer_name} : LSA",
)
As we can see, the topics seem well defined.
Here, we can identify the following topics :
We are now going to use the Latent Dirichlet Allocation (LDA) technique to identify the topics of the reviews.
We are going to use the most complex vectorizer and tokenizer, in order to have really meaningful topics.
This technique has a hyperparameter that defines the number of topics to be identified, so we want to find the best number of topics. For this, we are first going to compute the perplexity and coherence of the model for different number of topics.
First, let's run the LDA model on the API dataset.
from gensim.models import LdaModel, CoherenceModel, TfidfModel
from gensim.corpora import Dictionary
# Build the tokenized corpus
docs = X.map(tokenizers["SpaCy"])
dictionary = Dictionary(docs)
corpus = [dictionary.doc2bow(doc) for doc in docs]
tfidf = TfidfModel(corpus, normalize=True)
corpus_tfidf = tfidf[corpus]
# We already know the best number of topics
num_topics = 35
# If we don't know it, we can use the following function to find it
if not num_topics > 0:
# Compute the perplexity and coherence scores of the LDA model for diffenrent number of topics
results_lda = []
for num_topics in range(2, 80, 5):
# Build the LDA model
lda = LdaModel(
corpus=corpus_tfidf,
id2word=dictionary,
num_topics=num_topics,
per_word_topics=True,
passes=10,
random_state=42,
)
# Compute the scores
results = {
"num_topics": num_topics,
"perplexity": lda.log_perplexity(
corpus
), # Compute Perplexity (lower is better)
"coherence": CoherenceModel(
lda, texts=docs
).get_coherence(), # Compute Coherence Score (higher is better)
}
results_lda.append(results)
# Create figure with secondary y-axis
fig = make_subplots(specs=[[{"secondary_y": True}]])
# Add traces
fig.add_trace(
go.Scatter( # Plot perplexity
x=pd.DataFrame(results_lda)["num_topics"],
y=pd.DataFrame(results_lda)["perplexity"],
name="Perplexity",
mode="lines",
),
secondary_y=False,
)
fig.add_trace(
go.Scatter( # Plot coherence
x=pd.DataFrame(results_lda)["num_topics"],
y=pd.DataFrame(results_lda)["coherence"],
name="Coherence",
mode="lines",
),
secondary_y=True,
)
# Add figure title
fig.update_layout(
title_text="LDA Coherence and Perplexity",
xaxis_title="Number of Topics",
yaxis_title="Perplexity",
yaxis2_title="Coherence",
)
fig.show()
We can see that the optimal number of topics seems to be is 35. With more than 60 topics, we have artifacts due to maximum possible topics. Now let's observe the topics.
import pyLDAvis
import pyLDAvis.gensim_models
pyLDAvis.enable_notebook()
lda = LdaModel(
corpus=corpus_tfidf,
id2word=dictionary,
num_topics=num_topics,
per_word_topics=True,
passes=10,
random_state=42,
)
pyLDAvis.gensim_models.prepare(lda, corpus, dictionary)
In a future version of pandas all arguments of DataFrame.drop except for the argument 'labels' will be keyword-only /home/clement/Workspace/oc_p6/env/lib/python3.9/site-packages/past/builtins/misc.py:45: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses from imp import reload /home/clement/Workspace/oc_p6/env/lib/python3.9/site-packages/past/builtins/misc.py:45: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses from imp import reload /home/clement/Workspace/oc_p6/env/lib/python3.9/site-packages/past/builtins/misc.py:45: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses from imp import reload /home/clement/Workspace/oc_p6/env/lib/python3.9/site-packages/past/builtins/misc.py:45: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses from imp import reload /home/clement/Workspace/oc_p6/env/lib/python3.9/site-packages/past/builtins/misc.py:45: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses from imp import reload /home/clement/Workspace/oc_p6/env/lib/python3.9/site-packages/past/builtins/misc.py:45: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses from imp import reload /home/clement/Workspace/oc_p6/env/lib/python3.9/site-packages/past/builtins/misc.py:45: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses from imp import reload /home/clement/Workspace/oc_p6/env/lib/python3.9/site-packages/past/builtins/misc.py:45: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses from imp import reload
With this technique, we can identify the following topics :
This technique is quite efficient and helps us to identify the topics of the reviews.
Now let's compare with the Academic dataset.
from gensim.models import LdaModel, CoherenceModel, TfidfModel
from gensim.corpora import Dictionary
# Build the tokenized corpus
docs_academic = X_academic.map(tokenizers["SpaCy"])
dictionary_academic = Dictionary(docs_academic)
corpus_academic = [dictionary_academic.doc2bow(doc) for doc in docs_academic]
tfidf_academic = TfidfModel(corpus_academic, normalize=True)
corpus_tfidf_academic = tfidf_academic[corpus_academic]
# We already know the best number of topics
num_topics = 10
# If we don't know it, we can use the following function to find it
if not num_topics > 0:
# Compute the perplexity and coherence scores of the LDA model for diffenrent number of topics
results_lda = []
for num_topics in range(2, 80, 5):
# Build the LDA model
lda = LdaModel(
corpus=corpus_tfidf_academic,
id2word=dictionary_academic,
num_topics=num_topics,
per_word_topics=True,
passes=10,
random_state=42,
)
# Compute the scores
results = {
"num_topics": num_topics,
"perplexity": lda.log_perplexity(
corpus
), # Compute Perplexity (lower is better)
"coherence": CoherenceModel(
lda, texts=docs
).get_coherence(), # Compute Coherence Score (higher is better)
}
results_lda.append(results)
# Create figure with secondary y-axis
fig = make_subplots(specs=[[{"secondary_y": True}]])
# Add traces
fig.add_trace(
go.Scatter( # Plot perplexity
x=pd.DataFrame(results_lda)["num_topics"],
y=pd.DataFrame(results_lda)["perplexity"],
name="Perplexity",
mode="lines",
),
secondary_y=False,
)
fig.add_trace(
go.Scatter( # Plot coherence
x=pd.DataFrame(results_lda)["num_topics"],
y=pd.DataFrame(results_lda)["coherence"],
name="Coherence",
mode="lines",
),
secondary_y=True,
)
# Add figure title
fig.update_layout(
title_text="LDA Coherence and Perplexity",
xaxis_title="Number of Topics",
yaxis_title="Perplexity",
yaxis2_title="Coherence",
)
fig.show()
import pyLDAvis
import pyLDAvis.gensim_models
pyLDAvis.enable_notebook()
lda_academic = LdaModel(
corpus=corpus_tfidf_academic,
id2word=dictionary_academic,
num_topics=35,
per_word_topics=True,
passes=10,
random_state=42,
)
pyLDAvis.gensim_models.prepare(lda_academic, corpus_tfidf_academic, dictionary_academic)
In a future version of pandas all arguments of DataFrame.drop except for the argument 'labels' will be keyword-only
Here, we can identify the following topics :
This technique is quite efficient and helps us to identify the topics of the reviews.
The bag of words representation that we have used until now doesn't capture the similarity between words. For that, we can use word embeddings to represent words as vectors, so that synonyms have a similar representation. Word embedding is a technique that represents words according to their context.
Let's use the Word2Vec
model to represent words as vectors.
from gensim.models.word2vec import Word2Vec
word2vec = Word2Vec(docs)
word2vec.wv.most_similar(["burger"], topn=10)
[('french', 0.9996853470802307), ('good', 0.9996733069419861), ('come', 0.999671459197998), ('area', 0.9996638894081116), ('order', 0.9996635913848877), ('meat', 0.999660313129425), ('quality', 0.9996575713157654), ('cheese', 0.999656081199646), ('get', 0.999653160572052), ('look', 0.9996511340141296)]
We want to find a word that would be similar to the word "burger". Our model is not very good at finding similar words, but we can see that the word "burger" is similar to the words "meat", "cheese", "fry".
from gensim.models.word2vec import Word2Vec
word2vec_academic = Word2Vec(docs_academic)
word2vec_academic.wv.most_similar(["burger"], topn=10)
[('wing', 0.9707653522491455), ('sandwich', 0.9501225352287292), ('steak', 0.9466947317123413), ('taco', 0.9340856671333313), ('salad', 0.9304675459861755), ('sub', 0.9254314303398132), ('burrito', 0.9245769381523132), ('oyster', 0.9232752919197083), ('soup', 0.9229745864868164), ('bite', 0.92198246717453)]
We can see that the similar words are more relevant with a better dataset.
from sklearn.manifold import TSNE
X_w2v = [word2vec_academic.wv[x] for x in word2vec_academic.wv.key_to_index.keys()]
X_tsne = TSNE(random_state=42, n_jobs=-1).fit_transform(X_w2v)
fig = px.scatter(
x=X_tsne[:, 0],
y=X_tsne[:, 1],
text=word2vec_academic.wv.key_to_index.keys(),
labels={"x": "Component 1", "y": "Component 2"},
title="Word2Vec TSNE",
)
fig.show()
The default initialization in TSNE will change from 'random' to 'pca' in 1.2. The default learning rate in TSNE will change from 200.0 to 'auto' in 1.2.
If we zoom-in on a zone, we can see that similar words are close to each other.
Let's use the FastText
model to represent words as vectors.
from gensim.models.fasttext import FastText
fasttext = FastText(docs)
fasttext.wv.most_similar(["burger"], topn=10)
[('diner', 0.999984622001648), ('typically', 0.9999839663505554), ('water', 0.9999837279319763), ('butter', 0.999983549118042), ('order', 0.9999834299087524), ('specifically', 0.999983012676239), ('fantastic', 0.9999829530715942), ('chance', 0.9999828934669495), ('lovely', 0.9999828338623047), ('consider', 0.9999827742576599)]
We want to find a word that would be similar to the word "burger". Our model is not very good at finding similar words.
from gensim.models.fasttext import FastText
fasttext_academic = FastText(docs_academic)
fasttext_academic.wv.most_similar(["burger"], topn=10)
[('hamburger', 0.9774997234344482), ('sandwich', 0.9691644906997681), ('salad', 0.9616059064865112), ('cheeseburger', 0.9545460343360901), ('tacos', 0.9522039890289307), ('sandwhich', 0.950270414352417), ('pink', 0.9481170177459717), ('steak', 0.9480866193771362), ('meatloaf', 0.9427076578140259), ('grill', 0.9424874186515808)]
We can see that the similar words are more relevant with a better dataset.
from sklearn.manifold import TSNE
X_ft = [fasttext_academic.wv[x] for x in fasttext_academic.wv.key_to_index.keys()]
X_tsne = TSNE(random_state=42, n_jobs=-1).fit_transform(X_ft)
fig = px.scatter(
x=X_tsne[:, 0],
y=X_tsne[:, 1],
text=fasttext_academic.wv.key_to_index.keys(),
labels={"x": "Component 1", "y": "Component 2"},
title="FastText TSNE",
)
fig.show()
The default initialization in TSNE will change from 'random' to 'pca' in 1.2. The default learning rate in TSNE will change from 200.0 to 'auto' in 1.2.
Again, if we zoom-in on a zone, we can see that similar words are close to each other.