Comparing Azure Tools for Sentiment Analysis

Imagine you are the head of Public Relations for a famous company. You want to prevent all the "bad buzz" that could affect the image of your company. To achieve this, you would need to be able to detect NEGATIVE messages on the Internet in order to act before the word spreads.

Sentiment Analysis is one of the most classic NLP problems : Given a piece of text, would you say its rather POSITIVE or NEGATIVE ?

Sentiment Analysis

It seems almost natural to a human mind to classify simple sentences :

"I love my friends because they make me happy everyday!" 👍

"My dog died today, I'm so sad..." 👎

But not all sentences are so "simple".

"OMG this is soooooooo sick ! Shut up and take my money XD" 🤔

There are multiple challenges that can make this task much more difficult :

language : the given sentence could be in any language, potentially one you don't understand
language quality : even if you know the language, the sentence could be written in a very un-intelligible way (with spelling, conjugation, grammar, syntax errors, ...)
language technique : even in a perfectly well written English, the author could use a rhetorical device to imply a different meaning than the literal sense of the words (humor, derision, irony, sarcasm, ...)
context : taking a sentence out of its context can completely change its meaning
subjectivity : different people will interpret the same sentence differently depending on their personal way of thinking

In this article, we are going to cover different Azure services that we can use to predict the sentiment of tweets.

Spoiler : Each Azure service has its own purpose and offers more or less simplicity at the cost of control over the underlying prediction model.

All the code is available in Air Paradis : Detect bad buzz with deep learning.

Exploratory Data Analysis

Complete code available in notebook.ipynb

In this section, we are going to perform an EDA to understand the text and target variables.

The data we are going to use is Kaggle - Sentiment140 dataset :

text : 1.6 million tweets
- low language quality : many Twitter specific words ("RT", @username, #hashtags, urls, slang, ... )
target : binary categorical variable representing the sentiment of the tweet
- 0 = NEGATIVE
- 4 = POSITIVE

Target variable

Let's have a look at how the target variable is distributed.

The target variable is perfectly balanced :

Target variable distribution

Text variable

Let's have a look at what the text variable looks like.

Examples :

"@SexyLexy54321 I dont wanna look like a clown!! lol I dont have yellow." -- @LucasLover321

"@Yveeeee And try to get me autographs, okay? " -- @sarahroters

"goodnight to everyone live at other side of world it's sunny in here =]" -- @dizaynBAZ

Text length

NEGATIVE tweets are slightly (not significantly) longer than POSITIVE tweets.

In both classes, there are two modes :

~45 characters and 138 characters (the maximum allowed at the time the data was extracted) :

Text length distribution

~7 words and ~20 words :

Text word count distribution

Words importance

Let's see what words are most important in the text variable.

After cleanig the text (lowercase, stopwords, SpaCy lemmatization), we can see the most common words (Tf-Idf weighted) in the dataset :

Text word count distribution

Topic modeling

Let's see what topics (group of words frequently found together) are important in the text variable.

Running a LSA on the cleaned text, we can identify topics :

Topics

Running a simple Logistic Regression, we can measure the importance of each topic towards the target variable :

Topics importance

We can see that the most important topics are :

NEGATIVE topics :
- topic #3 : "work"
- topic #6 : "miss"
- topic #10 : "want", "get", "home", "sleep"
POSITIVE topics :
- topic #2 : "thank"
- topic #7 : "love"
- topic #4 : "work", "good", "morning", "thank"
- topic #8 : "go", "love", "sleep", "bed"

Protocol

We are going to split our dataset into a train and a test datasets, and compare the classification results according to different binary classification metrics :

Confusion Matrix : common way of presenting True Positive (TP), True Negative (TN), False Positive (FP) and False Negative (FN) predictions.
Precision : measures how many observations predicted as positive are in fact positive.
Recall or Sensitivity : measures how many observations out of all positive observations have we classified as positive.
Specificity : measures how many observations out of all negative observations have we classified as negative.
Accuracy : measures how many observations, both positive and negative, were correctly classified.
F1-score: combines Precision and Recall into one metric.
Average Precision (AP) : average of precision scores calculated for each recall threshold.
ROC AUC : tradeoff between True Positive Rate (TPR) and False Positive Rate (FPR).

AI as a Service

Complete code available in 3_azure_sentiment_analysis.ipynb

In this section, we are going to evaluate Azure's AIaaS fully-managed cloud service : Azure Cognitive Services - Sentiment Analysis API.

Before using Azure's Sentiment Analysis API, we need to create a Language resource with the standard (S) pricing tier, as explained in the Quickstart: Sentiment analysis and opinion mining.

Data preparation

Using a Azure's Sentiment Analysis API does not require any data preparation. We just need to send the text we want to analyze to the API, and it will return the most likely sentiment label (POSITIVE, NEGATIVE or NEUTRAL), as well as confidence scores for each label.

Model selection

Azure's fully managed Cognitive Service is a black box. It uses Microsoft's best AI models to perform the analysis, but we have no control over it.

The best information we can get is from Azure's documentation, especially Transparency note for Sentiment Analysis.

Model training

The underlying model is pre-trained and we can't train or fine-tune it ourselves.

Classification results

AIaaS results

We ony tested the model on 10,000 tweets in order to limit the cost of this experiment.

Accuracy : 0.714400
F1 : 0.729135
Precision : 0.693362
Recall or Sensitivity : 0.768800
Specificity : 0.660000
Average Precision : 0.74
ROC AUC : 0.77

Pros

no Data Science or Machine Learning experience required
always using Microsoft's up-to-date state-of-the-art model
very easy to set-up and use
no additional costs (model selection, training, deployment, ...)
very cheap for small projects (cf. Cognitive Service for Language pricing)
possible to make use of additional features like Opinion Mining to improve the understanding of the text's miwed sentiments

Cons

no control over the model
the model is not well-balanced (training an other classification model on top of the confidence scores could prevent this bias)
cost can become high for large projects (cf. Cognitive Service for Language pricing)
not suitable for critical or highly confidential data (though the model can be deployed on-premise : Install and run Sentiment Analysis containers)
requires an HTTP call to the API, which introduces a latency and potential security risks (though the API can be deployed on-premise : Install and run Sentiment Analysis containers)

Automated ML

Complete code available in 6_azureml_automated_ml.ipynb

In this section, we are going to evaluate AzureML Studio's Automated ML.

Before using the service, we need to create a Workspace, as explained in the Tutorial: Train a classification model with no-code AutoML in the Azure Machine Learning studio.

Data preparation

Using a Azure's Automated ML service does not require any data preparation. The data has just to be imported in the Workspace as a Dataset.

Model selection

This where the magic actually happens.

The Automated ML service will automatically build, train and optimize hyper-parameters of many Feature Engineering methods and classification models.

For this experiment, we chose to use the following options :

Deep Learning Featurization : Enabled (requires GPU capability)
- this option is specific to text pre-processing and will integrate a BERT model to extract the embeddings of the words in the text (cf. BERT integration in automated ML)
Primary metric : AUC weighted
Training job time (hours) : 10 hours (in order to limit the cost of this experiment)

Model training

Each model created by the Automated ML service is trained automatically, nothing to do here.

Classification results

The service has tested and compared multiple algorithms before selecting the best one :

AzureML - AutomatedML - 10h on GPU - models

The best model is a LightGBM with MaxAbsScaler, with a fine-tuned BERT model :

Best Model

Confusion Matrix	Precision Recall Curve (AP = 0.942)	ROC Curve (AUC = 0.942)

Accuracy : 0.867137
F1 : 0.867608
Precision : 0.870689
Recall or Sensitivity : 0.864549
Specificity : 0.869763
Average Precision : 0.942
ROC AUC : 0.942

Pros

the classification results are very good
the model is very well balanced
the model is actually fitted to the domain data
no Data Science or Machine Learning experience required, but you must be familiar with using cloud services
limited cost : once the best model has been identified, re-training it can be quite fast and in-expensive

Cons

the AutoML experiment can be expensive (but controlled) : you need to pay for the training and evaluation of many models before the best one is identified
once the best model has been identified, you need to deploy it to be able to use it in production, which requires Cloud Infrastructure skills

Designer

Complete code available in 7_azureml_designer.ipynb

In this section, we are going to evaluate AzureML Studio's Designer.

Before using the service, we need to create a Workspace, as explained in the Tutorial: Designer - train a no-code regression model.

This is what our pipeline looks like :

AzureML Designer - Pipeline

Data preparation

Using the Designer's UI, we created multiple data pre-processing steps :

Edit Metadata : columns renaming
Partition and Sample : data sampling to reduce the dataset size
Preprocess Text : text cleaning (special characters removal, stopwords, lowercase, lemmatization, ...)
Split Data : data splitting into train and test datasets

At this stage, we will compare two text feature extraction methods :

Feature Hashing : simple convertion of text tokens into a numeric representation
Extract N-Gram Features : take into account the consecutive tokens

Model selection

In this experiment, we will only use the simple Two-Class Logistic Regression.

Model training

We simply use the Train Model component to train our models on the train dataset.

Classification results

The models are scored thanks to the Score Model component, and the results are displayed thanks to the Evaluate Model component.

The test dataset goes through the same text pre-processing and vectorization steps as the training dataset, before being used to test the model.

Model	Confusion Matrix	AP	Precision Recall Curve	ROC AUC	ROC Curve
Feature Hashing		0.663		0.726
N-Gram Features		0.723		0.811

We can see that the N-Gram Features model performs better than the Feature Hashing model.

Accuracy : 0.730469
F1 : 0.734819
Precision : 0.723147
Recall or Sensitivity : 0.746875
Specificity : 0.714063
Average Precision : 0.723
ROC AUC : 0.811

The results here are not really relevant to our article, since we didn't design a very performant model. The goal was to demonstrate the use of the Designer's UI.

We could have improved the results by :

using more data for training (change the sampling rate)
testing different models (cf. How to select algorithms for Azure Machine Learning) :
tuning the hyper-parameters of the model (cf. Tune Model Hyperparameters)

Pros

the model is actually fitted to the domain data
the results of each steps are cached to be reused in a future run (if the previous steps are unchanged)
- this accelerates next runs and saves on compute time/money
it is possible to view (part of) the results of each step after a run
- this helps understand what is actually happening during a step
no coding skills required, but Data Science and Machine Learning experience are necesary to design a performant model and you still need to be familiar with cloud services

Cons

you need the same amount of trial and error to find the best model as with classic Machine Learning
there is no way to easily version your pipeline
you need to be familiar with drag-and-drop pipeline designing UIs
once you are satisfied with your model, you need to deploy it to be able to use it in production, which requires Cloud Infrastructure skills

Notebooks

Complete code available in 9_azureml_notebooks.ipynb

In this section, we are going to evaluate AzureML Studio's Notebooks.

Before using the service, we need to create a Workspace, as explained in the Tutorial: Train and deploy an image classification model with an example Jupyter Notebook.

In this experiment, we will build, train, deploy and test a custom Deep Neural Network (DNN) to expose a REST API for our tweets sentiment prediction.

The code deployed in the Notebooks environment consists of :

main.ipynb : this is the main Notebook where our data is prepared, our model is built, trained, deployed and tested
- Prepare : prepare the data for our model
- Train : we use the best model from 8_keras_neural_networks.ipynb : Stacked Bidirectional-LSTM layers on Embedded text
- Deploy : we deploy the model in an ACI (Azure Compute Instance), which will expose a REST API to query our model for inference
- Test : we run a POST query to check that our model works
score.py : this is the code deployed in the ACI for inference
- init() : load the registered model
- run(raw_data) : process data sent to the REST API and predict the sentiment with the loaded model
conda_dependencies.yml : this defines the dependencies that must be installed in the Inference environment

This is what our model looks like :

AzureML Notebooks - Model

Data preparation

This part is implemented in the main.ipynb notebook :

the data is loaded from the Dataset in the Workspace thanks to the azureml library
no need for data preparation, since the text_vectorization and embedding layers of our DNN will do the job.

Model selection

In this experiment, we don't do any model selection. The selected model is the best of several Artificial Neural Network (ANN) models compared in the 8_keras_neural_networks.ipynb notebook.

Model training

We simply train our model on the train dataset with Keras fit() method.

We log the training run with MLflow (cf. Track ML models with MLflow and Azure Machine Learning). This allows to view the metrics evolution during training epochs in AzureML Studio :

Notebooks train metrics

Model deployment

Once trained, registering our model in our Workspace with MLflow also allows us to easily deploy our model as a REST API in Azure (cf. Deploy MLflow models as Azure web services)

To achieve that, we use azureml library to :

create an Environment from our conda_dependencies.yml file
fetch our trained model from our Workspace
create an ACI web service
create an Inference configuration from our score.py file
actually deploy our model in the created ACI environment, with the given Inference configuration.

Once the inference environment is started, we can send the requests to the endpoint. The run(raw_data) will process the input text and predict its sentiment with our model.

Classification results

The performances of this model are computed in the 8_keras_neural_networks.ipynb notebook.

Notebooks results

Accuracy : 0.827450
F1 : 0.825740
Precision : 0.834005
Recall or Sensitivity : 0.817638
Specificity : 0.837263
Average Precision : 0.910
ROC AUC : 0.910

The results here are not really relevant to our article, even if they are quite good. The goal was to demonstrate the use of AzureML Notebooks and how to deploy a model in production.

Pros

the model is actually fitted to the domain data
you can easily version your code, and a peer can easily review it
it is possible to view the metrics evolution during training
this method offers the same flexibility, control and developer experience as coding in JupyterLab, with the addition :
- to be able to adapt the available ressources (add more CPU, GPU disk or memory)
- to be fully integrated in Azure, thus allowing to easily deploy models (cf. Deploy machine learning models to Azure) and view experiments results in AzureML Studio

Cons

this doesn't find the best model for you
you need to have Data Science and Machine Learning experience to build your model
once you are satisfied with your model, you need to deploy it to be able to use it in production, which requires Cloud Infrastructure skills

Conclusion

In our context, the best course of actions was to use Automated ML to build a very efficient model, and deploy it in production with AzureML Notebooks.

In this article, we have seen :

how to very easily set-up an AI serice using Azure Cognitive Services with zero techical knowlege
how to very easily create a very performant prediction model using AzureML Automated ML with no Data Science or Machine Learning knowledge (but a good understanding of Azure Studio and AutoML)
how to build a data processing and model training and evaluation pipeline using AzureML Designer with no coding skills (but a good knowledge of Data Science and Machine Learning)
how to develop, train and deploy in production a custom model using AzureML Notebooks

Each AzureML service has its own purpose and offers more or less simplicity at the cost of control over the model.