"Air Paradis" is an airline company who's marketing department wants to be able to detect quickly "bad buzz" on social networks, to be able to anticipate and address issues as fast as possible. They need an AI API that can detect "bad buzz" and predict the reason for it.
The goal here is to evaluate different approaches to detect "bad buzz" :
After exploring our dataset, we will compare the different approaches.
The helpers functions and project specific code will be placed in ../src/
.
We will use the Python programming language, and present here the code and results in this Notebook JupyterLab file.
We will use the usual libraries for data exploration, modeling and visualisation :
We will also use libraries specific to the goals of this project :
import pickle
# Import custom helper libraries
import os
import sys
src_path = os.path.abspath(os.path.join("../src"))
if src_path not in sys.path:
sys.path.append(src_path)
import data.helpers as data_helpers
import visualization.helpers as viz_helpers
# Maths modules
from scipy.stats import f_oneway
import pandas as pd
# Viz modules
import plotly.express as px
# Render for export
import plotly.io as pio
pio.renderers.default = "notebook"
We are going to load the data and analyse the distribution of each variable.
Let's download the data from the Kaggle - Sentiment140 dataset with 1.6 million tweets competition.
# Download and unzip CSV files
!cd .. && make dataset && cd notebooks
>>> Downloading and extracting data files... Data files already downloaded. >>> OK.
Now we can load the data.
# Load data from CSV
df = pd.read_csv(
os.path.join("..", "data", "raw", "training.1600000.processed.noemoticon.csv"),
names=["target", "id", "date", "flag", "user", "text"],
)
# Reduce memory usage
df = data_helpers.reduce_dataframe_memory_usage(df)
Let's display a few examples, find out how many data points are available, what are the variables and what is their distribution.
# Display first few rows
df.head(5)
target | id | date | flag | user | text | |
---|---|---|---|---|---|---|
0 | 0 | 1467810369 | Mon Apr 06 22:19:45 PDT 2009 | NO_QUERY | _TheSpecialOne_ | @switchfoot http://twitpic.com/2y1zl - Awww, t... |
1 | 0 | 1467810672 | Mon Apr 06 22:19:49 PDT 2009 | NO_QUERY | scotthamilton | is upset that he can't update his Facebook by ... |
2 | 0 | 1467810917 | Mon Apr 06 22:19:53 PDT 2009 | NO_QUERY | mattycus | @Kenichan I dived many times for the ball. Man... |
3 | 0 | 1467811184 | Mon Apr 06 22:19:57 PDT 2009 | NO_QUERY | ElleCTF | my whole body feels itchy and like its on fire |
4 | 0 | 1467811193 | Mon Apr 06 22:19:57 PDT 2009 | NO_QUERY | Karoli | @nationwideclass no, it's not behaving at all.... |
# Diaplay number of rows and colmn types
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1600000 entries, 0 to 1599999 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 target 1600000 non-null Int8 1 id 1600000 non-null UInt32 2 date 1600000 non-null string 3 flag 1600000 non-null category 4 user 1600000 non-null string 5 text 1600000 non-null string dtypes: Int8(1), UInt32(1), category(1), string(3) memory usage: 48.8 MB
There are 1600000 rows, each composed of 6 columns :
We are only interrested in the target and text variables. The rest of the columns are not useful for our analysis.
# Drop useless columns
df.drop(columns=["id", "date", "flag", "user"], inplace=True)
# Replace target values with labels
df.target = df.target.map(
{
0: "NEGATIVE",
2: "NEUTRAL",
4: "POSITIVE",
}
)
# Display basic statistics
df.describe(include="all")
target | text | |
---|---|---|
count | 1600000 | 1600000 |
unique | 2 | 1581466 |
top | NEGATIVE | isPlayer Has Died! Sorry |
freq | 800000 | 210 |
# Plot target distribution
viz_helpers.histogram(
df, label_x="target", label_colour="target", title="Target distribution"
)