This repository is the solution for a technical test given by a company during the interview process for a Senior Data Engineer position.
The test has two parts:
- we need to answer some questions about the data presented below. The questions are presented in challenge/QUESTIONS.md.
- we need to provide an architecture for a generalization of this challenge. More details at challenge/ARCHITECTURE.md
The company have developed a ChatBot that register a person's favourite Pokémon and some basic information.
We have extracted 50000 users conversations with the Chatbot and stored in a CSV
file separeted by ;
. This file is saved in a public S3 bucket, and can be accessed in s3://---REDACTED---
.1
Each line of that file is a message with the following information:
Timestamp
: date & time of the message in unix epoch;MessageId
: uniqueuuid
for each message;ConversationId
: uniqueuuid
for each conversation;UserId
: an identifier for the user involved in the conversation;MessageText
: the message text;Channel
: channel where the message was sent;BotId
: an identifier for the bot involved in the conversation;Source
: whether the message was sent by abot
or anuser
.
More information about the data can be found in the document challenge/DATA.md.
For non-portuguese speakers, here are the Chatbot questions translated to English:
Portuguese | English |
---|---|
Quais são seus pokemons favoritos? | What are your favourite Pokémon? |
Qual seu nome? | What is your name? |
Qual a sua cidade? | Where do you live? |
Qual a sua idade? | How old are you? |
Olá, eu sou o robô de cadastro de pokemon favorito, vamos começar | Hello, I'm the Favourite Pokémon Chatbot, let's begin! |
The final answer for each question can be seen in the answers/ANSWERS.md file. The proposed architecture can be seen in the answers/ARCHITECTURE.md file.
If you wish to see how we obtained the answers, you would need to run two Python notebooks, both located in the notebook
folder:
- the
data_treatment
notebook that should run first, since it downloads the data and do some necessary data cleaning. You need an.env
file in this directory (see the section .env File). - the
analysis
notebook that makes use of the cleaned dataset from thedata_treatment
notebook and it is where we do all the necessary calculations to give the required answers.
To run these notebooks, we made use of a local Spark cluster and a local Jupyterlab server. This infrastructure requires docker
and can be setup by running make spark
(if you are running Linux) or the equivalent command docker-compose up -d --build
.
Once the containers start, you may go to localhost:8889 and you will see the Jupyterlab interface:
Then, just open the data_treatment
notebook and it should run seamlessly.
One final and important observation is that, to run the data_treatment
notebook, we need an .env
file with a variable named S3_FILE_PATH
written in it (see the .env.example
file). We use this environment variable to load the data. The .env
was (obviously) omitted from this repository since it contains the original location of the file.
We made two deliberate choices when building the general solution for this challenge:
- the first one is obvious: we chose to use Apache Spark as a processing engine;
- we chose to use
SQL
as much as we could.
Since Spark is cloud agnostic, this means our solution can run in AWS
, GCP
, other clouds or even in a on-premises cluster. Also, since SQL
is a well-defined language, our work can be easily translated to different data platforms or data warehouses (e.g. BigQuery, Snowflake and, obviously, Databricks). This means that we could use dbt as a framework, for example.
As one can note, you need both docker
and docker-compose
.
Footnotes
-
Redacted to protect the company's identity. ↩