Avis Restau : improve the AI product of your start-up

Context

"Avis Restau" is a start-up who's goal is to connect restaurants and customers. Customers will be able to post photos and reviews of the restaurants they have visited.

The goal here is to identify topics of bad customer reviews and to label photos as indoor or outdoor , food or drink, ...

Load project modules

The helpers functions and project specific code will be placed in ../src/.

We will use the Python programming language, and present here the code and results in this Notebook JupyterLab file.

We will use the usual libraries for data exploration, modeling and visualisation :

We will also use libraries specific to the goals of this project :

Yelp's API dataset

We will use Yelp's GraphQL API to get the data. We will load the Reviews (~3 reviews per restaurant) and Photos (1 photo per restaurant) of 1000 restaurants from 5 locations (200 restaurants per location).

Download the dataset to CSV

We download the dataset from the Yelp GraphQL API.

Load the dataset from CSV

Exploratory Data Analysis

We will just display here a few statistics about each DataFrame.

Businesses

The dataset is composed of 1000 businesses from 5 locations (Paris, New York, Tokyo, Rio de Janeiro and Sydney). Each business has a unique ID, a name, a category, an average rating, a price category, a city, a state, a country, a postal code, a latitude and a longitude.

We can see that French and Japanese restaurants have higher review ratings, but Japanese restaurants are less expensive than french ones.

There are no empty values or outliers.

Overall, pricy restaurants tend to have lower review ratings : people are more picky when they spend more money. The correlations confirm that French restaurants are overpriced (high positive correlation to the Price), while Japanese restaurants are cheap (high negative correlation to the Price). Ramen restaurants seem to be the best value for money deals (un the top 10 Ratings and bottom 10 Prices).

Reviews

The dataset is composed of 2928 reviews from 1000 businesses (~3 reviews per business). Each review has a business alias, a text and a rating.

Reviews are truncated to 160 characters (~30 words per review). This can be a problem when the text that would explain the review rating has been removed.

We can see that review ratings are highly skewed towards the high grades : more than half of the reviews are 5 stars.

There are no empty values or outliers.

We can see that in our Reviews dataset, ratings are a bit higher than the average ratings from the Businesses dataset.

For the rest of this analysis, we will consider the difference of rating between a review and the average rating of the business. A review is considered as a positive review if the rating is higher than the average rating of the business, and a negative review if the rating is lower or equal to the average rating of the business.

Photos

The dataset is composed of 999 photos from 1000 businesses (~1 photo per business). Each photo has a business alias and a photo URL.

We will study the photos more deeply in the dedicated section.

Academic dataset

We will use also use the Academic dataset provided by Yelp (https://www.yelp.com/dataset) composed of 8,635,403 reviews, 160,585 businesses and 200,000 pictures from 8 metropolitan areas.

We are only going to use the reviews and photos data. Since the dataset is huge, we are going to sample a small subset of the data.

Load the dataset from JSON

Exploratory Data Analysis

We will just display here a few statistics about each DataFrame.

Most users (~98%) write only one review.

Most businesses (~87%) have only one review.

The stars distribution is quite similar to the API dataset: half of the reviews are 5 stars. But there are more 3 and 2 stars reviews and much more 1 star reviews. This shows that one of the two datasets is not representative of the reality.

We can see that the Academic dataset is a bit different that the API dataset. :

Natural Language Processing (NLP) : analysis of the reviews texts

In this section, we are going to try and identify the topics of bad reviews. To achieve this, we are going to represent the reviews as a bag of words (BoW) and to use a topic modelling algorithm to identify the topics of the reviews. We are going to compare different text processing techniques and to compare the results of topic modelling algorithms.

Build the dataset and define the models

We are going to use a target variable for regression (rating difference) and a target variable for classification (review sentiment).

We are going to compare different combinations of text tokenizers and vectorizers. The goal here is to eliminate the noise of the reviews and to reduce the dimension of the dataset, while trying to keep the meaning of the reviews.

Vectorizer : transforms a list of tokens into a vector of features, with more or less processing of the tokens

Tokenizer : transforms a string of text into a list of tokens, with more or less processing of the text

Regression models evaluation

We are going to evaluate the combination of different tokenizers, vectorizers and regression models in order to find the model that best predicts the sentiment of the reviews (as the difference between the Business' average rating and the review rating).

It appears that the best regression model is the ElasticNetCV model, which outperforms all other models, whatever the tokenizer and vectorizer.

A surprizing result is that the more complex vectorizers and tokenizers don't really improve the results. This might be due to the fact that the dataset is quite small and the reviews are quite similar.

Classification models evaluation

We are going to evaluate the combination of different tokenizers, vectorizers and classification models in order to find the model that best predicts the sentiment of the reviews (as whether the review rating is higher ("good" review) or lower ("bad" review) than the Business' average rating).

It appears that the best regression model is the SVC model. The LogisticRegressionCV, RandomForestClassifier and RidgeClassifierCV have similar results depending on the tokenizer and vectorizer.

Again, the more complex text processors don't necessarily improve the results.

Vectorizer and tokenizer models comparison

Now that we have found the best Machine Learning models for regression and classification, we are going to compare the results of the different vectorizers and tokenizers. We will use the same dataset and the same regression and classification models.

The goal is to measure the influence of the different vectorizers and tokenizers on the results of the regression and classification models. We also want to evaluate (empirically) how the vectorizers and tokenizers improve the interpretability of the models.

As we can see, the best text processing models both for regression and classification are the most simple ones (basic CountVectorizer and TfidfVectorizer). Adding complexity to the text processing models doesn't improve the results, but they greatly help reduce the vocabulary size and make sense of the importance of each word to estimate the sentiment of a review.

With these results, we are able to visualize "negative words" (like "bad", "worst", "disappoining", "rude", ...) and "positive words" (like "amazing", "favorite", "delicious", "wonderful", ...) in the reviews.

Test our best models

We are going to test the best regression and classification models on a random review from the test dataset, and compare the prediction with a pre-trained model.

We can see that our model is not very good, but is already able to make better predictions than a random guess, despite the very small training dataset with a lot of missing information due to truncated text.

Let's see how a pre-trained model performs on the same review.

We can see that the pre-trained model is able to make better predictions than our model, but can also be tricked by the poor data.

Topic Modeling

We want to be able to identify the topics of the reviews in order to understand why clients were satisfied or not.

We are going to use two different techniques to identify the topics of the reviews :

Latent Semantic Analysis (LSA)

We are going to use the Latent Semantic Analysis (LSA) technique to identify the topics of the reviews.

We are going to compare the results of the different vectorizers and tokenizers.

API dataset

First, let's compare the topics on the API dataset.

As we can see, the more complex the vectorizers and tokenizers, the more meaningful the topics.

Here, we can identify the following topics :

Academic dataset

Now let's observe the topics on the Academic dataset.

As we can see, the topics seem well defined.

Here, we can identify the following topics :

Latent Dirichlet Allocation (LDA)

We are now going to use the Latent Dirichlet Allocation (LDA) technique to identify the topics of the reviews.

We are going to use the most complex vectorizer and tokenizer, in order to have really meaningful topics.

This technique has a hyperparameter that defines the number of topics to be identified, so we want to find the best number of topics. For this, we are first going to compute the perplexity and coherence of the model for different number of topics.

API dataset

First, let's run the LDA model on the API dataset.

We can see that the optimal number of topics seems to be is 35. With more than 60 topics, we have artifacts due to maximum possible topics. Now let's observe the topics.

With this technique, we can identify the following topics :

This technique is quite efficient and helps us to identify the topics of the reviews.

Academic dataset

Now let's compare with the Academic dataset.

Here, we can identify the following topics :

This technique is quite efficient and helps us to identify the topics of the reviews.

Word Embedding

The bag of words representation that we have used until now doesn't capture the similarity between words. For that, we can use word embeddings to represent words as vectors, so that synonyms have a similar representation. Word embedding is a technique that represents words according to their context.

Word2Vec

Let's use the Word2Vec model to represent words as vectors.

We want to find a word that would be similar to the word "burger". Our model is not very good at finding similar words, but we can see that the word "burger" is similar to the words "meat", "cheese", "fry".

We can see that the similar words are more relevant with a better dataset.

If we zoom-in on a zone, we can see that similar words are close to each other.

FastText

Let's use the FastText model to represent words as vectors.

We want to find a word that would be similar to the word "burger". Our model is not very good at finding similar words.

We can see that the similar words are more relevant with a better dataset.

Again, if we zoom-in on a zone, we can see that similar words are close to each other.