PROJECT: Built an ETL Data Pipeline that extracts job data from API and transform data into a single csv file based on business requirements and then load the transformed data into Amazon S3 bucket

Author: 👤 Joshua Omolewa

1. Business Scenario

Data engineer is required to build a data pipeline on amazon EC2 that transform job data from API to a job.csv file for data analysis by data analyst and store transformed data in S3 and also generate log files for error tracking . API for job data https://www.themuse.com/developers/api/v2

2. Business Requirements

Download the data from API. Transformed data should include publication date, job name, job type, job location (i.e city and country) &company name. Store data in S3 for use by data analyst

3. My Deliverable

shell scripts, python script and job.csv to S3.

Shell script: The shell script will control every operation, setting virtual environment, log setting, python script running.

Python script: The Python script is used to transform the data and upload data to s3 bucket.

job.csv: The final transformed data file based on business requirement.

4. Specification Detail

The data required is gotten from API by querying jobs from the first 50 pages https://www.themuse.com/api/public/jobs?page=50

5. Project Architecture

6. Project Diagram

The diagram shows the folder structure for the project and the how the shell scripts create virtual enviroment containing dependecies contained in the requirements.txt file. The run.sh shell script activates the virtual enviroment and run the run.py python script which connect to the API, transform the dat using pandas and then upload the transform job.csv file to S3 bucket for the data analyst

7. STEPS USED TO COMPLETE THIS PROJECT

Create Amazon AWS account and login into AWS console, create Amazon Elastic Compute Cloud (EC2) instance (Ubuntu) and S3 bucket with directory to store transformed csv file. Ensuring EC2 instance and S3 are in created in the same region. Ensure EC2 is attached to default amazon VPC and default subnet so EC2 can have access to internet through default Internet gateway

SSH into EC2 instance (ensuring my ip is allowed to access instance through the security group) via VSCODE (using remote explorer) and create the project structure containing shell scripts (init.sh, run.sh), python scripts( run.py), .env file(to store access keys to my AWS console), .config.toml file (containing config files to access specific S3 bucket directory and to store API url), .gitignore to ignore specific files (.env and virtual environment folder created by runing init.sh),requirments.txt (containing all library dependencies required for the project)

Write the code into the shell scripts (init.sh, run.sh). The init.sh installs required libaries,virtual environment, install the dependencies for the virtual environment from requirment.txt file & create a log folder. I then run the init.sh using ./init.sh as shown below to that creates virtual environment folder project_venv and log folder

Write the code for the python script run.py that will be intialized using run.sh shell script. The run.sh shell script will also create log files and print out if the python script was successfully executed or not. I run the run.sh script using ./run.sh , the python script(run.py) will send request to the API, store the payload, transform the payload to required data partitions based on business requirement, combine data partitions into a single file (job.csv) using pandas library and then job.csv to S3 bucket. The run.sh shell script will create log files each time it is executed and if succesful it print it as shown below

Checking Amazon S3 bucket to ensure the job.csv file has been successfuly uploaded and viewing a copy in Github repository

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

PROJECT: Built an ETL Data Pipeline that extracts job data from API and transform data into a single csv file based on business requirements and then load the transformed data into Amazon S3 bucket

Author: 👤 Joshua Omolewa

1. Business Scenario

2. Business Requirements

3. My Deliverable

4. Specification Detail

5. Project Architecture

6. Project Diagram

7. STEPS USED TO COMPLETE THIS PROJECT

Note: (1) Pipeline can be automated using chronjob if needed (2) log folder contains all log files generated when working on the project

PROJECT FILES

Follow Me On

Show your support

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 51 Commits
img		img
log		log
.env		.env
.gitignore		.gitignore
README.md		README.md
config.toml		config.toml
init.sh		init.sh
job.csv		job.csv
requirements.txt		requirements.txt
run.py		run.py
run.sh		run.sh

Joshua-omolewa/Job_API_ETL_datapipeline_project

Folders and files

Latest commit

History

Repository files navigation

PROJECT: Built an ETL Data Pipeline that extracts job data from API and transform data into a single csv file based on business requirements and then load the transformed data into Amazon S3 bucket

Author: 👤 Joshua Omolewa

1. Business Scenario

2. Business Requirements

3. My Deliverable

4. Specification Detail

5. Project Architecture

6. Project Diagram

7. STEPS USED TO COMPLETE THIS PROJECT

Note: (1) Pipeline can be automated using chronjob if needed (2) log folder contains all log files generated when working on the project

PROJECT FILES

Follow Me On

Show your support

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages