AzureML Studio : Designer

In this notebook, we will use the AzureML Studio's Designer to design our data processing pipeline.

The experiment is visible in the AzureML Studio : oc-p7-automated-ml

AzureML Designer - Pipeline

We will compare this pre-trained local model to the baseline model from 1_baseline.ipynb.

Text preprocessing

Before training our models, the data is prepared as follow :

Text vectorization

We need to represent the text as a vector of numbers.

Feature Hashing

In this version, we will use the Feature Hashing module to extract features from the text, with the following parameters :

N-Gram Features

In this version, we will use the Extract N-Gram Features from Text module to extract features from the text, with the following parameters :

This creates a vocabulary that is specific to our training data and that will be used for testing our model.

Model training

We train a Two-Class Logistic Regression model with the following parameters :

Results

The test dataset goes through the same text pre-processing and vectorization steps as the training dataset, before being used to test the model.

Model Confusion Matrix AP Precision Recall Curve ROC AUC ROC Curve
Feature Hashing Confusion Matrix 0.663 Precision Recall Curve 0.726 ROC Curve
N-Gram Features Confusion Matrix 0.723 Precision Recall Curve 0.811 ROC Curve

We can see that the N-Gram Features model performs better than the Feature Hashing model.

The performances on the dataset are similar to our baseline model :

Unlike our baseline model, this model is quite balanced, just slightly biased towards the POSITIVE class. It is much less biased than our baseline model : it predicted 6.8% (baseline = 35% , -81%) more POSITIVE (3305) messages than NEGATIVE (3095).