In this notebook, we will use the AzureML Studio's Designer to design our data processing pipeline.
The experiment is visible in the AzureML Studio : oc-p7-automated-ml
We will compare this pre-trained local model to the baseline model from 1_baseline.ipynb.
Before training our models, the data is prepared as follow :
We need to represent the text as a vector of numbers.
In this version, we will use the Feature Hashing module to extract features from the text, with the following parameters :
In this version, we will use the Extract N-Gram Features from Text module to extract features from the text, with the following parameters :
This creates a vocabulary that is specific to our training data and that will be used for testing our model.
We train a Two-Class Logistic Regression model with the following parameters :
The test dataset goes through the same text pre-processing and vectorization steps as the training dataset, before being used to test the model.
Model | Confusion Matrix | AP | Precision Recall Curve | ROC AUC | ROC Curve |
---|---|---|---|---|---|
Feature Hashing | ![]() |
0.663 | ![]() |
0.726 | ![]() |
N-Gram Features | ![]() |
0.723 | ![]() |
0.811 | ![]() |
We can see that the N-Gram Features model performs better than the Feature Hashing model.
The performances on the dataset are similar to our baseline model :
Unlike our baseline model, this model is quite balanced, just slightly biased towards the POSITIVE class. It is much less biased than our baseline model : it predicted 6.8% (baseline = 35% , -81%) more POSITIVE (3305) messages than NEGATIVE (3095).