Skip to content

Latest commit

 

History

History
115 lines (77 loc) · 8.09 KB

README.md

File metadata and controls

115 lines (77 loc) · 8.09 KB

Bungalow Take Home Project for Analytics Engineer Role (V1. 2022-03-03)

Welcome to the Bungalow Takehome Challenge for Data Analytics! This is a barebones repo to get you started.

What to build

A common task for data analytics engineers at Bungalow involves modelling of data from the internal datasets, storing it and making it available for downstream teams such as finance and product and ultimately the entire organization. For this challenge we'd like to give a brief snapshot of a common workload may entail. Of course, this might become a big task. Therefore, to save time for you, we did some of the heavy lifting, like the set up and some scaffolding of the environment.

For this challenge we will collect the current weather data from OpenWeatherMap. The free API will work for this assignment. You shouldn’t pay for the API key.

Please install Docker Desktop on your laptop. It will contain the environment that we would need for the next steps.

The Docker compose would have two software applications and simple setup required for them.

  • Airflow: To run your additions to the boilerplate DAGs.

  • Postgres: To maintain your tables. (You can swap it with any other database or your choice, i.e. SQLite, MySQL)

Below are the steps in the data flow diagram:

  • fetcher.py script, that represents the fetcher DAG, would retrieve the data from the current weather API.

  • The fetcher script would process and clean the data, then stores it in the Postgres database considering relationships, integrity, performance, and extendability. We made a basic version of the fetcher Python script for you to save your time. You can improve the fetcher if you need to. For example, you will need to add 10 more cities to the list of cities for the next step.

  • The transformer.py script, that represents the Transformer DAG, would transform the data from the previous step to prepare some derived dataset tables. You will have the choice to implement the transformations both in Python or SQL. This is the main part of your takehome challenge.

  • The Transformer writes the datasets back to Postgres.

  • The downstream customer(s) would read both original and derived tables. They will execute historical queries to run analytics and science models.

This project is meant to be flexible as to showcase your decision making capabilities and your overall technical experience.

Note: If you are uncomfortable with Docker, Postgres or Airflow, please feel free to remove or replace them. They are meant to save time for you. As long as you can achieve the outcome feel free to use any additional tooling, programming language (i.e. Java or Scala) and approach you see fit. We will ask follow up questions about your decision mechanism in the follow up conversation.

We are more interested in seeing your thought process and approach to solving the problem!

Deliverables

We will expect to see the following items in your Github pull request:

  • Your Python code for data fetcher and transformer.

    • In the transformer, please create data models for:
      • Top hot cities in your city list per day
      • Top 7 hottest day per city in each calendar year
      • An UPSERT dataset that keeps the latest weather information per city
      • The least humid city per state
      • Moving average of the temperature per city for 5 readings
  • The data model SQL and your design for its data modelling

  • Readme file with your notes

Evaluation

We will use this project as our basis for our evaluation of your overall fit for a data engineering role from a technical viewpoint.

To do this, we will review your code with an eye for the following:

  • Readability and usability

  • Data processing and relational modelling

  • Python and SQL know-how

Time expectations

We know you are busy and likely have other commitments in your life, so we don't want to take too much of your time. We don't expect you to spend more than 2 hours working on this project. That being said, if you choose to put more or less time into it for whatever reason, that is your choice.

Feel free to indicate in your notes below if you worked on this for a different amount of time and we will keep that in mind while evaluating the project. You can also provide us with additional context if you would like to.

Additionally, we have left a spot below for you to note. If you have ideas for pieces that you would have done differently or additional things you would have implemented if you had more time, you can indicate those in your notes below as well, and we will use those as part of the evaluation.

Public forks

We encourage you to try this project without looking at the solutions others may have posted. This will give the most honest representation of your abilities and skills. However, we also recognize that day-to-day programming often involves looking at solutions others have provided and iterating on them. Being able to pick out the best parts and truly understand them well enough to make good choices about what to copy and what to pass on by is a skill in and of itself. As such, if you do end up referencing someone else's work and building upon it, we ask that you note that as a comment. Provide a link to the source so we can see the original work and any modifications that you chose to make.

Challenge instructions

Fork this repository and clone to your local environment

  • Prepare your environment with Python and any other tools you may need. Docker can do it for you.
    • To run the docker-compose, you need to run the following commands:
      # Create you own .env file from our sample and edit the .env file with the OpenWeatherMap API key
      cp env.sample .env
      # Initializing the folders and the non-root user for Airflow
      mkdir -p  ./logs ./plugins
      echo -e "AIRFLOW_UID=$(id -u)" >> .env
      # Initializing airflow database
      docker-compose up airflow-init
      # Running the docker-compose
      docker-compose up 
      # You can see the Airflow UI in http://localhost:8080 with username/password: airflow
    • If you run to any problems with the environment, please refer to here.
  • Fill in the TODO in the repository. There are currently less than 5 TODOS, but you can go beyond and above.
    • Any problems with the DAGs? They are taken from here. Please take a look at the rest of tutorial if needed.
    • You can check Postgres operator from here
    • To keep it simple, let's use the Airflow database for the storage of your dataset
  • Write down the notes, in the Readme.md file.
  • Complete the challenge and push back to the repo
    • If you have any questions in any step, please reach out to your recruiter. A member of engineering team will be involved to support you, as if you were working for Bungalow.
  • Note: If you are using Apple hardware with M1 processor, there is a common challenge with Docker. You can read more about it here.

Your notes (Readme.md)

@TODO: Add any additional notes / documentation in this file.

Time spent

Give us a rough estimate of the time you spent working on this. If you spent time learning in order to do this project please feel free to let us know that too. This makes sure that we are evaluating your work fairly and in context. It also gives us the opportunity to learn and adjust our process if needed.

Assumptions

Did you find yourself needing to make assumptions to finish this? If so, what were they and how did they impact your design/code?

Next steps

Provide us with some notes about what you would do next if you had more time. Are there additional features that you would want to add? Specific improvements to your code you would make?

Instructions to the evaluator

Provide any end user documentation you think is necessary and useful here