"My Content" is a start-up who's goal is to encourage people to read by recommending relevant content to users.
In this project, we want to create a mobile app that will be recommend relevant articles to users based on their implicit preferences, their profiles and the articles content. This is known as a Recommender System and is a very common challenge in any content-based website (blog, news, audio/video, ...) or service (social network, marketplace, streaming platform, dating platform, ....).
We will compare different models (Content-Based Filtering, Collaborative Filtering, Matrix Factorization, ...) on the Globo.com dataset. Then, we will integrate one model in a mobile app that will be able to recommend relevant articles to users. Finally, we will use Azure Machine Learning and Azure Functions to store the recommendations in Azure CosmosDB and to make the recommendations available to the users.
The goal of Recommender Systems is to suggest relevant content to users, given :
There are three main categories of recommender systems :
notebooks/
:We will use the Python programming language, and present here the code and results in this Notebook JupyterLab file.
We will use the usual libraries for data exploration, modeling and visualisation :
We will also use libraries specific to the goals of this project :
Let's download the data from the Globo.com dataset and look at what it contains.
The dataset is composed of the following files :
clicks/clicks/
: contains 385 CSV filesclicks_hour_*.csv
: contains one hour of clicks on the websitearticles_embeddings.pickle
: pickle file containing the embeddings of the articlesarticles_metadata.csv
: CSV file containing the metadata of the articlesData profile reports :
To evaluate our models, we left out the last click of each user. We train our models on the remaining clicks. We then predict recommendations (ranking of all articles) for the next article the user will click. Our model score is the average predicted rank of the actual article the user has clicked last.
Type | Library | Model/Algo | Mean Rank (lower is better) | Predict time (s) |
---|---|---|---|---|
Content-based | scikit-learn | cosine similarity with mean of last click | 264 | 17 |
Content-based | scikit-learn | cosine similarity with mean of last session | 252 | 17 |
Content-based | scikit-learn | cosine similarity with mean of all clicks | 216 | 17 |
Collaborative | Surprise | BaselineOnly | 5972 | 1.68 |
Collaborative | Surprise | SVD | 51874 | 1.85 |
Collaborative | Implicit | AlternatingLeastSquares | 3.83 | 0.05 |
Hybrid | LightFM | LightFM | 228 | 0.17 |
## Download raw data
!cd .. && make dataset && cd notebooks
>>> Downloading and extracting data files... Data files already downloaded. >>> OK.
## Import libraries
from datetime import datetime
from pathlib import Path
import pandas as pd
import plotly.express as px
from pandas_profiling import ProfileReport
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from tqdm import tqdm
import plotly.io as pio
pio.renderers.default = "notebook"
pd.options.plotting.backend = "plotly"
RAW_DATA_PATH = "../data/raw"
## Load and describe Articles Metada data
articles_metadata = pd.read_csv(
Path(RAW_DATA_PATH, "articles_metadata.csv"),
parse_dates=["created_at_ts"],
date_parser=lambda x: datetime.fromtimestamp(int(x) / 1000),
dtype={
"article_id": "category",
"category_id": "category",
"publisher_id": "category",
"words_count": "int",
},
)
articles_metadata = articles_metadata.astype({"created_at_ts": "datetime64[ns]"})
articles_metadata.describe(include="all", datetime_is_numeric=True)
article_id | category_id | created_at_ts | publisher_id | words_count | |
---|---|---|---|---|---|
count | 364047 | 364047 | 364047 | 364047 | 364047.000000 |
unique | 364047 | 461 | NaN | 1 | NaN |
top | 0 | 281 | NaN | 0 | NaN |
freq | 1 | 12817 | NaN | 364047 | NaN |
mean | NaN | NaN | 2016-09-17 01:25:54.949498624 | NaN | 190.897727 |
min | NaN | NaN | 2006-09-27 13:14:35 | NaN | 0.000000 |
25% | NaN | NaN | 2015-10-15 18:00:43.500000 | NaN | 159.000000 |
50% | NaN | NaN | 2017-03-13 17:27:29 | NaN | 186.000000 |
75% | NaN | NaN | 2017-11-05 15:09:11 | NaN | 218.000000 |
max | NaN | NaN | 2018-03-13 13:12:30 | NaN | 6690.000000 |
std | NaN | NaN | NaN | NaN | 59.502766 |
## Visualize Articles Categories distribution
articles_metadata["category_id"].value_counts().plot(
kind="bar",
labels={
"index": "Category ID",
"value": "Count",
},
color="value",
title="Distribution of categories",
)
## Visualize number of articles per category distribution
articles_metadata["category_id"].value_counts().plot(
kind="box",
x="category_id",
title="Distribution of categories counts",
labels={
"index": "Category ID",
"category_id": "Count",
},
notched=True,
points="suspectedoutliers",
)
## Visualize Articles Creation time distribution
articles_metadata.sample(frac=0.01)["created_at_ts"].plot(
kind="histogram",
title="Distribution of creation time (sampling = 1%)",
labels={
"value": "Creation time",
},
text_auto=True,
marginal="box",
)
## Visualize Articles Word count distribution
articles_metadata["words_count"].sample(frac=0.01).plot(
kind="histogram",
title="Distribution of words count (sampling = 1%)",
labels={
"value": "Words count",
},
text_auto=True,
marginal="box",
)
## Publish Articles Metadata ProfileReport
profile = ProfileReport(
articles_metadata,
title="Pandas Profiling Report",
explorative=True,
minimal=True,
)
profile.to_file(Path("../docs/articles_metadata_profile_report.html"))
Summarize dataset: 0%| | 0/5 [00:00<?, ?it/s]
Generate report structure: 0%| | 0/1 [00:00<?, ?it/s]
Render HTML: 0%| | 0/1 [00:00<?, ?it/s]
Export report to file: 0%| | 0/1 [00:00<?, ?it/s]
## Load and describe Clicks data
clicks = pd.concat(
[
pd.read_csv(
click_file_path,
parse_dates=["session_start", "click_timestamp"],
date_parser=lambda x: datetime.fromtimestamp(int(int(x) / 1000)),
dtype={
"user_id": "category",
"session_id": "category",
"session_size": "int",
"click_article_id": "category",
"click_environment": "category",
"click_deviceGroup": "category",
"click_os": "category",
"click_country": "category",
"click_region": "category",
"click_referrer_type": "category",
},
).replace(
{
"click_environment": {
"1": "1 - Facebook Instant Article",
"2": "2 - Mobile App",
"3": "3 - AMP (Accelerated Mobile Pages)",
"4": "4 - Web",
},
"click_deviceGroup": {
"1": "1 - Tablet",
"2": "2 - TV",
"3": "3 - Empty",
"4": "4 - Mobile",
"5": "5 - Desktop",
},
"click_os": {
"1": "1 - Other",
"2": "2 - iOS",
"3": "3 - Android",
"4": "4 - Windows Phone",
"5": "5 - Windows Mobile",
"6": "6 - Windows",
"7": "7 - Mac OS X",
"8": "8 - Mac OS",
"9": "9 - Samsung",
"10": "10 - FireHbbTV",
"11": "11 - ATV OS X",
"12": "12 - tvOS",
"13": "13 - Chrome OS",
"14": "14 - Debian",
"15": "15 - Symbian OS",
"16": "16 - BlackBerry OS",
"17": "17 - Firefox OS",
"18": "18 - Android",
"19": "19 - Brew MP",
"20": "20 - Chromecast",
"21": "21 - webOS",
"22": "22 - Gentoo",
"23": "23 - Solaris",
},
}
)
for click_file_path in tqdm(
sorted(Path(RAW_DATA_PATH, "clicks/clicks").glob("clicks_hour_*.csv"))
)
],
sort=False,
ignore_index=True,
verify_integrity=True,
)
clicks = clicks.astype(
{"session_start": "datetime64[ns]", "click_timestamp": "datetime64[ns]"}
)
clicks.describe(include="all", datetime_is_numeric=True)
100%|██████████| 385/385 [02:12<00:00, 2.91it/s]
user_id | session_id | session_start | session_size | click_article_id | click_timestamp | click_environment | click_deviceGroup | click_os | click_country | click_region | click_referrer_type | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 2988181 | 2988181 | 2988181 | 2.988181e+06 | 2988181 | 2988181 | 2988181 | 2988181 | 2988181 | 2988181 | 2988181 | 2988181 |
unique | 322897 | 1048594 | NaN | NaN | 46033 | NaN | 3 | 5 | 8 | 11 | 28 | 7 |
top | 5890 | 1507563657895091 | NaN | NaN | 160974 | NaN | 4 - Web | 1 - Tablet | 17 - Firefox OS | 1 | 25 | 2 |
freq | 1232 | 124 | NaN | NaN | 37213 | NaN | 2904478 | 1823162 | 1738138 | 2852406 | 804985 | 1602601 |
mean | NaN | NaN | 2017-10-08 16:17:08.013155328 | 3.901885e+00 | NaN | 2017-10-08 16:51:05.070374400 | NaN | NaN | NaN | NaN | NaN | NaN |
min | NaN | NaN | 2017-10-01 04:37:03 | 2.000000e+00 | NaN | 2017-10-01 05:00:00 | NaN | NaN | NaN | NaN | NaN | NaN |
25% | NaN | NaN | 2017-10-04 15:35:52 | 2.000000e+00 | NaN | 2017-10-04 16:20:52 | NaN | NaN | NaN | NaN | NaN | NaN |
50% | NaN | NaN | 2017-10-08 22:09:00 | 3.000000e+00 | NaN | 2017-10-08 22:35:30 | NaN | NaN | NaN | NaN | NaN | NaN |
75% | NaN | NaN | 2017-10-11 21:16:54 | 4.000000e+00 | NaN | 2017-10-11 21:43:24 | NaN | NaN | NaN | NaN | NaN | NaN |
max | NaN | NaN | 2017-10-17 05:36:19 | 1.240000e+02 | NaN | 2017-11-13 21:04:14 | NaN | NaN | NaN | NaN | NaN | NaN |
std | NaN | NaN | NaN | 3.929941e+00 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
## Visualize Click sessions distribution over time
clicks.sample(frac=0.01)["session_start"].plot(
kind="histogram",
title="Distribution of sessions (sampling = 1%)",
labels={
"value": "Sessions",
},
text_auto=True,
marginal="box",
)
## Visualize Clicks distribution over time
clicks.sample(frac=0.01)["click_timestamp"].plot(
kind="histogram",
title="Distribution of clicks (sampling = 1%)",
labels={
"value": "Clicks",
},
text_auto=True,
marginal="box",
)
## Visualize Clicks Sessions size
clicks.sample(frac=0.01)["session_size"].plot(
kind="histogram",
title="Distribution of session sizes (sampling = 1%)",
labels={
"value": "Session size",
},
text_auto=True,
marginal="box",
)
## Visualize number of clicked articles per user (user engagement)
clicks.sample(frac=0.01).groupby("user_id").agg(
COUNT_unique_article=("click_article_id", lambda x: len(set(list(x)))),
).plot(
kind="histogram",
title="Distribution of number of clicked articles (sampling = 1%)",
labels={
"value": "Number of clicked articles",
},
text_auto=True,
marginal="box",
)
## Visualize click context : user environment
fig = px.parallel_categories(
clicks.sample(frac=0.01),
dimensions=["click_environment", "click_deviceGroup", "click_os"],
title="Distribution of Environment x Device Group x OS (sampling = 1%)",
labels={
"click_environment": "Environment",
"click_deviceGroup": "Device group",
"click_os": "OS",
},
)
fig.show()
## Visualize click context : user geolocation
fig = px.parallel_categories(
clicks.sample(frac=0.01),
dimensions=["click_country", "click_region"],
title="Distribution of Country x Region (sampling = 1%)",
labels={
"click_country": "Country",
"click_region": "Region",
},
)
fig.show()
## Visualize click context : user referrer
clicks.sample(frac=0.01)["click_referrer_type"].plot(
kind="histogram",
title="Distribution of referrer types (sampling = 1%)",
labels={
"value": "Referrer type",
},
category_orders={
"value": [str(i) for i in range(1, 8)],
},
text_auto=True,
)
## Publish Clicks Metadata ProfileReport
profile = ProfileReport(
clicks, title="Pandas Profiling Report", explorative=True, minimal=True
)
profile.to_file(Path("../docs/clicks_profile_report.html"))
Summarize dataset: 0%| | 0/5 [00:00<?, ?it/s]
Generate report structure: 0%| | 0/1 [00:00<?, ?it/s]
Render HTML: 0%| | 0/1 [00:00<?, ?it/s]
Export report to file: 0%| | 0/1 [00:00<?, ?it/s]
## Load and describe Articles Embeddings data
articles_embeddings = pd.read_pickle(Path(RAW_DATA_PATH, "articles_embeddings.pickle"))
articles = pd.DataFrame(
articles_embeddings,
columns=["embedding_" + str(i) for i in range(articles_embeddings.shape[1])],
)
articles["words_count"] = articles_metadata["words_count"]
articles["category_id"] = articles_metadata["category_id"]
articles["article_id"] = articles_metadata["article_id"]
articles.describe(include="all", datetime_is_numeric=True)
articles_sample = articles.sample(frac=0.01)
## Visualize Articles Embeddings in 2D PCA
pca = PCA(n_components=2)
articles_pca = pca.fit_transform(
articles_sample[
["embedding_" + str(i) for i in range(articles_embeddings.shape[1])]
]
)
# Plot the data in the PCA space
fig = px.scatter(
x=articles_pca[:, 0],
y=articles_pca[:, 1],
color=articles_sample["category_id"],
symbol=articles_sample["category_id"],
title="PCA 2D",
width=1200,
height=800,
)
fig.show()
## Visualize Articles Embeddings in 2D t-SNE
tsne = TSNE(n_components=2)
articles_tsne = tsne.fit_transform(
articles_sample[
["embedding_" + str(i) for i in range(articles_embeddings.shape[1])]
]
)
# Plot the data in the PCA space
fig = px.scatter(
x=articles_tsne[:, 0],
y=articles_tsne[:, 1],
color=articles_sample["category_id"],
symbol=articles_sample["category_id"],
title="t-SNE 2D",
width=1200,
height=800,
)
fig.show()