Bigdata-Processing-AWS

Introduction

In this project, we are going to process a dataset from kaggle (Amazon Book Ratings.csv). The dataset is huge (3 GB) and has 1 million records. We are going to process this 3 GB data in AWS EMR Framework.

Architecture

Implementation Steps

Load the source data from Kaggle to S3 bucket
Create an EMR cluster with EC2 instance
Create a key value pair and connect to the cluster using SSH in your terminal
Create a script Amazon_Book_Review.py
Execute the script using spark-submit
After processing, we can see the filtered data loaded to S3 bucket

Without spark and hadoop installation, we can process the bigdata using AWS Elastic Mapreduce (EMR) Framework

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
Amazon_Book_Review.py		Amazon_Book_Review.py
Architecture_EMR.png		Architecture_EMR.png
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Bigdata-Processing-AWS

Introduction

Architecture

Implementation Steps

Without spark and hadoop installation, we can process the bigdata using AWS Elastic Mapreduce (EMR) Framework

About

Releases

Packages

Languages

vekr1518/Bigdata-Processing-AWS

Folders and files

Latest commit

History

Repository files navigation

Bigdata-Processing-AWS

Introduction

Architecture

Implementation Steps

Without spark and hadoop installation, we can process the bigdata using AWS Elastic Mapreduce (EMR) Framework

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages