Udacity Data Engineering Nanodegree projects

This repository contains the links to the projects I submitted for the Udacity Data Engineering Nanodegree Program.

Check the program's syllabus for more details.

Project 1a: Data Modeling with Postgres (Relational Database)

Model user activity data to create a relational database and ETL pipeline in PostgreSQL for a music streaming app.

Installed PostgreSQL and configured a new user and database.
Designed an optimized star schema (i.e. with a fact and dimensions tables) with normalized tables (i.e. using constraints) for queries on song play analysis.
Built an ETL pipeline to extract data from .json files, transform incorrect values and insert them into the tables with psycopg2 Python package.
Ran test to verify the database creation.
Created examples queries and expected results.

Project 1b: Data Modeling with Apache Cassandra (NoSQL Database)

Model event data to create a non-relational database and ETL pipeline in Apache Cassandra for a music streaming app.

Installed Apache Cassandra.
Designed optimized queries using denormalized tables.
Built an ETL pipeline to extract data from .csv files, transform incorrect values and insert them into the tables with cassandra Python package.

Project 2: Data Warehouse on AWS with Amazon Redshift

Build a Data Warehouse and an ETL pipeline that extracts data from Amazon S3, stages them in Amazon Redshift, and transforms data into a set of dimensional tables for their analytics team.

Create an IAM role to make Redshift access S3 (read only).
Create a Security group to access Redshift from a specific IP address.
Create programmatically a Redshift Cluster and attaching the previous policies using boto3 Python package.
Copy Data from S3 to staging tables in Redshift.
Transform the data using SQL (PostgreSQL) to create a set of dimensional analytics tables in Redshift.

Project 3: Data Lake on AWS S3 using Apache Spark

Build a Data Lake and an ETL pipeline in Apache Spark that loads data from S3, processes the data into analytics tables, and loads them back into S3.

Project 4: Data Pipelines with Airflow

Improve the company's data infrastructure by creating and automating a set of data pipelines with Airflow, monitoring and debugging production pipelines.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
images		images
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Udacity Data Engineering Nanodegree projects

Project 1a: Data Modeling with Postgres (Relational Database)

Project 1b: Data Modeling with Apache Cassandra (NoSQL Database)

Project 2: Data Warehouse on AWS with Amazon Redshift

Project 3: Data Lake on AWS S3 using Apache Spark

Project 4: Data Pipelines with Airflow

About

Releases

Packages

nasseredine/udacity-dend

Folders and files

Latest commit

History

Repository files navigation

Udacity Data Engineering Nanodegree projects

Project 1a: Data Modeling with Postgres (Relational Database)

Project 1b: Data Modeling with Apache Cassandra (NoSQL Database)

Project 2: Data Warehouse on AWS with Amazon Redshift

Project 3: Data Lake on AWS S3 using Apache Spark

Project 4: Data Pipelines with Airflow

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages