Fly Me is a travel agency offering all-inclusive booking services for private or professional customers.
In this project, we want to create a ChatBot that will help users to book their trip. This task requires to solve multiple challenges such as Language Understanding (identify the intent of a user), Entities extraction (identify relevant information), User Experience (interact in a relevant and appropriate manner).
We will use the Azure Frames Dataset to train a bot to understand the intention of a user during a dialog, and identify the relevant entities.
In our case, the intention will be the Booking of a trip, and the entities that we will be looking for are :
or_city
)dst_city
)str_date
)end_date
)budget
)In this notebook, we will simply perform an Exploratory Data Analysis of our dataset.
This is the project architecture in production :
## Download and extract dataset files
!cd .. && make dataset && cd notebooks
>>> Downloading and saving data files... Data files already downloaded. >>> OK.
## Import and configure libraries
import json
import warnings
from pathlib import Path
import modin.pandas as pd
import pandas
import plotly.io as pio
from pandas_profiling import ProfileReport
from tqdm import tqdm
warnings.filterwarnings("ignore", category=UserWarning)
warnings.filterwarnings("ignore", category=FutureWarning)
pio.renderers.default = "notebook"
pd.options.plotting.backend = "plotly"
# Set constants
DATA_PATH = Path("../data")
FRAMES_JSON_PATH = Path(DATA_PATH, "raw/frames.json")
## Load the dataset in a Pandas Dataframe (in memory)
raw_data = pd.read_json(FRAMES_JSON_PATH)
raw_data.describe(include="all")
[codecarbon INFO @ 18:20:35] Energy consumed for RAM : 0.000024 kWh. RAM Power : 5.7580060958862305 W [codecarbon INFO @ 18:20:35] Energy consumed for all CPUs : 0.000000 kWh. All CPUs Power : 0.0 W [codecarbon INFO @ 18:20:35] 0.000024 kWh of electricity used since the begining.
user_id | turns | wizard_id | id | labels | |
---|---|---|---|---|---|
count | 1369 | 1369 | 1369 | 1369 | 1369 |
unique | 11 | 1369 | 12 | 1369 | 16 |
top | U22K1SX9N | [{'text': 'I'd like to book a trip to Atlantis... | U21T9NMKM | e2c0fc6c-2134-4891-8353-ef16d8412c9a | {'userSurveyRating': 5.0, 'wizardSurveyTaskSuc... |
freq | 345 | 1 | 301 | 1 | 929 |
The dataset is composed of 1369 annotated dialogs (composed of several turns) between a bot and a human user trying to book a flight.
frames = raw_data[["id", "wizard_id", "user_id"]]
frames[["userSurveyRating", "wizardSurveyTaskSuccessful"]] = [
[x["userSurveyRating"], x["wizardSurveyTaskSuccessful"]]
for x in raw_data.labels
]
frames = frames.astype(
{"userSurveyRating": "float", "wizardSurveyTaskSuccessful": "bool"}
)
frames.describe(include="all")
id | wizard_id | user_id | userSurveyRating | wizardSurveyTaskSuccessful | |
---|---|---|---|---|---|
count | 1369 | 1369 | 1369 | 1366.000000 | 1369 |
unique | 1369 | 12 | 11 | NaN | 2 |
top | e2c0fc6c-2134-4891-8353-ef16d8412c9a | U21T9NMKM | U22K1SX9N | NaN | True |
freq | 1 | 301 | 345 | NaN | 1287 |
mean | NaN | NaN | NaN | 4.573419 | NaN |
std | NaN | NaN | NaN | 0.839596 | NaN |
min | NaN | NaN | NaN | 1.000000 | NaN |
25% | NaN | NaN | NaN | 4.000000 | NaN |
50% | NaN | NaN | NaN | 5.000000 | NaN |
75% | NaN | NaN | NaN | 5.000000 | NaN |
max | NaN | NaN | NaN | 5.000000 | NaN |
for turn in raw_data["turns"]:
print()
print("XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX")
print()
known_facts = {}
for i, frame in enumerate(turn):
print(f'{i} - { frame["author"] } says : \n"{ frame["text"] }"')
known_facts.update(
{
info_key: info[-1]["val"] if not info[-1]["negated"] else None
for f in frame["labels"]["frames"]
for info_key, info in f["info"].items()
}
)
print(f"Known facts : \n{known_facts}")
print()
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX 0 - user says : "I'd like to book a trip to Atlantis from Caprica on Saturday, August 13, 2016 for 8 adults. I have a tight budget of 1700." Known facts : {'intent': 'book', 'budget': '1700.0', 'dst_city': 'Atlantis', 'or_city': 'Caprica', 'str_date': 'august 13', 'n_adults': '8'} 1 - wizard says : "Hi...I checked a few options for you, and unfortunately, we do not currently have any trips that meet this criteria. Would you like to book an alternate travel option?" Known facts : {'intent': 'book', 'budget': '1700.0', 'dst_city': 'Atlantis', 'or_city': 'Caprica', 'str_date': 'august 13', 'n_adults': '8', 'NO_RESULT': True} 2 - user says : "Yes, how about going to Neverland from Caprica on August 13, 2016 for 5 adults. For this trip, my budget would be 1900." Known facts : {'intent': 'book', 'budget': '1900.0', 'dst_city': 'Neverland', 'or_city': 'Caprica', 'str_date': 'august 13', 'n_adults': '5', 'NO_RESULT': True} 3 - wizard says : "I checked the availability for this date and there were no trips available. Would you like to select some alternate dates?" Known facts : {'intent': 'book', 'budget': '1900.0', 'dst_city': 'Neverland', 'or_city': 'Caprica', 'str_date': 'august 13', 'n_adults': '5', 'NO_RESULT': True} 4 - user says : "I have no flexibility for dates... but I can leave from Atlantis rather than Caprica. How about that?" Known facts : {'intent': 'book', 'budget': '1700.0', 'dst_city': 'Atlantis', 'or_city': 'Atlantis', 'str_date': 'august 13', 'n_adults': '8', 'NO_RESULT': True, 'flex': False} 5 - wizard says : "I checked the availability for that date and there were no trips available. Would you like to select some alternate dates?" Known facts : {'intent': 'book', 'budget': '1700.0', 'dst_city': 'Atlantis', 'or_city': 'Atlantis', 'str_date': 'august 13', 'n_adults': '8', 'NO_RESULT': True, 'flex': False} 6 - user says : "I suppose I'll speak with my husband to see if we can choose other dates, and then I'll come back to you.Thanks for your help" Known facts : {'intent': 'book', 'budget': '1700.0', 'dst_city': 'Atlantis', 'or_city': 'Atlantis', 'str_date': 'august 13', 'n_adults': '8', 'NO_RESULT': True, 'flex': False} ...
We can see the identified intent and extracted entities for each turn.
if Path(DATA_PATH, "processed/turns.csv").exists():
turns = pd.read_csv(Path(DATA_PATH, "processed/turns.csv"))
else:
turns = pd.DataFrame()
for turn in tqdm(raw_data["turns"]):
known_facts = {}
for i, frame in enumerate(turn):
if frame["author"] == "wizard":
continue
turn_dict = {
"text": frame["text"],
}
turn_dict.update(
{f"old_{key}": value for key, value in known_facts.items()}
)
known_facts.update(
{
info_key: info[-1]["val"]
if not info[-1]["negated"]
else None
for f in frame["labels"]["frames"]
for info_key, info in f["info"].items()
}
)
turn_dict.update(
{f"new_{key}": value for key, value in known_facts.items()}
)
turns = turns.append(turn_dict, ignore_index=True)
turns.to_csv(Path(DATA_PATH, "processed/turns.csv"), index=False)
turns
[codecarbon INFO @ 18:20:50] Energy consumed for RAM : 0.000048 kWh. RAM Power : 5.7580060958862305 W [codecarbon INFO @ 18:20:50] Energy consumed for all CPUs : 0.000000 kWh. All CPUs Power : 0.0 W [codecarbon INFO @ 18:20:50] 0.000048 kWh of electricity used since the begining.
text | new_intent | new_budget | new_dst_city | new_or_city | new_str_date | new_n_adults | old_intent | old_budget | old_dst_city | ... | new_count_seat | old_count_seat | new_dst_city_ok | old_dst_city_ok | new_impl_anaphora | old_impl_anaphora | new_str_date_ok | new_end_date_ok | old_str_date_ok | old_end_date_ok | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | I'd like to book a trip to Atlantis from Capri... | book | 1700.0 | Atlantis | Caprica | august 13 | 8 | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | Yes, how about going to Neverland from Caprica... | book | 1900.0 | Neverland | Caprica | august 13 | 5 | book | 1700.0 | Atlantis | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 | I have no flexibility for dates... but I can l... | book | 1700.0 | Atlantis | Atlantis | august 13 | 8 | book | 1900.0 | Neverland | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 | I suppose I'll speak with my husband to see if... | book | 1700.0 | Atlantis | Atlantis | august 13 | 8 | book | 1700.0 | Atlantis | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
4 | Hello, I am looking to book a vacation from Go... | book | 2100.0 | Mos Eisley | Gotham City | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
10402 | 5 adults and 7 kids! Yup, the lot of us. We wa... | book | 32800.0 | -1 | Tampa | NaN | 5 | book | NaN | -1 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
10403 | Oh yes! Between September 12 and 26! | book | 32800.0 | -1 | Tampa | september 12 | 5 | book | 32800.0 | -1 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
10404 | That sounds amazing, and it's within those dat... | book | 32800.0 | Queenstown | Tampa | september 12 | 5 | book | 32800.0 | -1 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
10405 | Ok perfect, book me! | book | 32800.0 | Queenstown | Tampa | september 12 | 5 | book | 32800.0 | Queenstown | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
10406 | Thanks! | book | 32800.0 | Queenstown | Tampa | september 12 | 5 | book | 32800.0 | Queenstown | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
10407 rows x 120 columns
turns.describe(include="all")
text | new_intent | new_budget | new_dst_city | new_or_city | new_str_date | new_n_adults | old_intent | old_budget | old_dst_city | ... | new_count_seat | old_count_seat | new_dst_city_ok | old_dst_city_ok | new_impl_anaphora | old_impl_anaphora | new_str_date_ok | new_end_date_ok | old_str_date_ok | old_end_date_ok | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 10407 | 9362 | 6229 | 9631 | 9620 | 7430 | 5570 | 8078 | 5255 | 8307 | ... | 7 | 6 | 8 | 7 | 5 | 4 | 3 | 3 | 2 | 2 |
unique | 9695 | 1 | 228 | 392 | 339 | 155 | 57 | 1 | 225 | 382 | ... | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
top | Thanks! | book | -1 | Punta Cana | -1 | -1 | 1 | book | -1 | -1 | ... | two | two | True | True | category | category | True | True | True | True |
freq | 73 | 9362 | 1704 | 283 | 174 | 655 | 2462 | 8078 | 1469 | 257 | ... | 7 | 6 | 8 | 7 | 5 | 4 | 3 | 3 | 2 | 2 |
4 rows x 120 columns
columns = ["text"] + [
f"{prefix}_{key}"
for key in ["or_city", "dst_city", "str_date", "end_date", "budget"]
for prefix in ["old", "new"]
]
data = turns[columns]
data
text | old_or_city | new_or_city | old_dst_city | new_dst_city | old_str_date | new_str_date | old_end_date | new_end_date | old_budget | new_budget | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | I'd like to book a trip to Atlantis from Capri... | NaN | Caprica | NaN | Atlantis | NaN | august 13 | NaN | NaN | NaN | 1700.0 |
1 | Yes, how about going to Neverland from Caprica... | Caprica | Caprica | Atlantis | Neverland | august 13 | august 13 | NaN | NaN | 1700.0 | 1900.0 |
2 | I have no flexibility for dates... but I can l... | Caprica | Atlantis | Neverland | Atlantis | august 13 | august 13 | NaN | NaN | 1900.0 | 1700.0 |
3 | I suppose I'll speak with my husband to see if... | Atlantis | Atlantis | Atlantis | Atlantis | august 13 | august 13 | NaN | NaN | 1700.0 | 1700.0 |
4 | Hello, I am looking to book a vacation from Go... | NaN | Gotham City | NaN | Mos Eisley | NaN | NaN | NaN | NaN | NaN | 2100.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
10402 | 5 adults and 7 kids! Yup, the lot of us. We wa... | Tampa | Tampa | -1 | -1 | NaN | NaN | NaN | NaN | NaN | 32800.0 |
10403 | Oh yes! Between September 12 and 26! | Tampa | Tampa | -1 | -1 | NaN | september 12 | NaN | 26 | 32800.0 | 32800.0 |
10404 | That sounds amazing, and it's within those dat... | Tampa | Tampa | -1 | Queenstown | september 12 | september 12 | 26 | 26 | 32800.0 | 32800.0 |
10405 | Ok perfect, book me! | Tampa | Tampa | Queenstown | Queenstown | september 12 | september 12 | 26 | 25 | 32800.0 | 32800.0 |
10406 | Thanks! | Tampa | Tampa | Queenstown | Queenstown | september 12 | september 12 | 25 | 25 | 32800.0 | 32800.0 |
10407 rows x 11 columns
data.describe(include="all")
text | old_or_city | new_or_city | old_dst_city | new_dst_city | old_str_date | new_str_date | old_end_date | new_end_date | old_budget | new_budget | |
---|---|---|---|---|---|---|---|---|---|---|---|
count | 10407 | 8287 | 9620 | 8307 | 9631 | 6287 | 7430 | 4787 | 5734 | 5255 | 6229 |
unique | 9695 | 332 | 339 | 382 | 392 | 151 | 155 | 129 | 131 | 225 | 228 |
top | Thanks! | -1 | -1 | -1 | Punta Cana | -1 | -1 | -1 | -1 | -1 | -1 |
freq | 73 | 158 | 174 | 257 | 283 | 567 | 655 | 344 | 404 | 1469 | 1704 |
For each user turn, we can see what the bot has inferred given what it previously knew and what the user said.
## Publish Articles Metadata ProfileReport
profile = ProfileReport(
pandas.DataFrame(data),
title="Pandas Profiling Report",
explorative=True,
minimal=True,
)
profile.to_file(Path("../docs/profile_report.html"))
Summarize dataset: 0%| | 0/5 [00:00<?, ?it/s]
Generate report structure: 0%| | 0/1 [00:00<?, ?it/s]
Render HTML: 0%| | 0/1 [00:00<?, ?it/s]
Export report to file: 0%| | 0/1 [00:00<?, ?it/s]
The dataset profile report is available online : dataset profile report