Custom Models : Neural Networks with Keras

In this notebook, we will use the Keras library to implement different Artificial Neural Network (ANN) models.

We will compare these models to the baseline model from 1_baseline.ipynb.

Load project modules and data

We will use basic python packages, as well as Tensorflow and Keras to build our Neural Network models.

Basic FFNN (Feed Forward Neural Network) model

FFNN on simple word count vectors

In this model, we will use a simple word count vector as our input, with no text preprocessing. The Neural Network architecture will be a simple feed forward network with two hidden layer.

The performances on the train dataset is better than on the test datasets which indicates that our model has slightly over-fitted.

The performances on the dataset are much better than our baseline model :

Our model is very well balanced : it predicted only 0.6% (baseline = 35% , -98%) more NEGATIVE (160499) messages than POSITIVE (159501).

FFNN on SpaCy embedded documents

In this model, we use SpaCy to perform the text preprocessing (lemmatization and vectorization). The document vector represents the average of the SpaCy vectors of the tokens in the document.

The Neural Network architecture is the same as before.

The performances on the train and test datasets are similar which indicates that our model has not over-fitted.

The performances on the test dataset are similar to our previous model (slightly worse). The model is still balanced, but slightly less than before.

FFNN on Doc2Vec embedded documents

In this model, we train a Gensim Doc2vec document embedding model to perform the text preprocessing (lemmatization and vectorization).

The Neural Network architecture is the same as before.

The performances on the train and test datasets are similar which indicates that our model has not over-fitted.

The performances on the dataset are better than our baseline model, but not as good as our previous models :

Our model is quite well balanced : it predicted only 3.2% (baseline = 35% , -91%) more POSITIVE (162531) messages than NEGATIVE (157469).

FFNN models with Embedding layer

In the following models, we will add an embedding layer to our architecture. The goal of such layers is to represent words as vectors so that similar words have similar vectors, and that words meaning relations are kept (eg. : "Paris" is to "France" what "London" is to "England").

The input of an embedding layer is an encoded version of the text data. As a text pre-processing, we are going to test two encoding methods.

The Neural Network architecture after the embedding layer remains same as before.

Embedding layer on simple encoded text

In this model, we will use a basic encoding method : each word is converted to a number.

The performances on the train dataset is better than on the test datasets which indicates that our model has slightly over-fitted.

The performance on the test dataset is similar to our previous models.

This model is also more biased than the previous ones.

Embedding layer on Bert encoded text

In this model, we will use a more sofisticated encoding method : we will use the Bert tokenizer to encode the text. This method does sub-word tokenization.

The performances on the train dataset is better than on the test datasets which indicates that our model has slightly over-fitted.

The performance on the test dataset is similar to our previous models.

One good thing : this model is almost perfectly balanced ! It predicted only 10 more NEGATIVE (160005) messages than POSITIVE (159995).

RNN (Recurrent Neural Network) models

In the following models, we will add RNN or (Bidirectional-)LSTM layers to our architecture. The goal of such layers is to keep the information of the previous steps of the network when we are processing the next step.

The Neural Network architecture after the Recurrent layer remains same as before (with the Embedding layer).

Simple RNN on Embedded text

In this model, we add a simple RNN to our model.

Compressed (left) and unfolded (right) basic recurrent neural network.

The performances on the train and test datasets are similar which indicates that our model has not over-fitted.

The performances on the dataset are slightly better than our previous models :

Our model is much less biased than our baseline, but still leans towards the NEGATIVE class : it predicted 16% (baseline = 35% , -54%) more NEGATIVE (171614) messages than POSITIVE (148386).

LSTM on Embedded text

In this model, we add a LSTM (Long short-term memory) layer in stead of the RNN layer.

Long short-term memory unit

The performances on the train dataset is better than on the test datasets which indicates that our model has slightly over-fitted.

The performances on the dataset are slightly better than our previous models :

Our model is much less biased than our baseline, but still leans towards the POSITIVE class : it predicted 11% (baseline = 35% , -69%) more POSITIVE (168543) messages than NEGATIVE (151457).

Bidirectional LSTM on embedded text

In this model, we add a Bidirectional-LSTM layer in stead of the RNN layer. The goal of such layers is to use information from past (backwards) and future (forward) states simultaneously.

Structure of RNN and BRNN

The performances on the train dataset is better than on the test datasets which indicates that our model has slightly over-fitted.

The performances on the dataset are slightly better than our previous models :

Our model is much almost not biased biased! It still slightly leans towards the NEGATIVE class : it predicted only 4.1% (baseline = 35% , -88%) more NEGATIVE (163235) messages than POSITIVE (156765).

Stacked Bidirectional-LSTM layers on Embedded text

In this model, we add a second Bidirectional-LSTM layer.

Stacked Long Short-Term Memory Archiecture

The performances on the train dataset is better than on the test datasets which indicates that our model has slightly over-fitted.

The performances on the dataset are slightly better than our previous models :

Our model is much almost not biased biased! It still slightly leans towards the NEGATIVE class : it predicted only 4.0% (baseline = 35% , -89%) more NEGATIVE (163140) messages than POSITIVE (156860).