Skip to content

this will be a archive repo for the big data computing project.

Notifications You must be signed in to change notification settings

hassanteymoori/Income-classification

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Income classification

This project is related to the Big-data computing (2020-2021) course at Sapienza University as a final project.

Project Status: [Done]

Project Intro/Objective

The purpose of this project is to make predictions where the prediction task is to determine whether a person makes over 50K a year or not. (classification task).
I have worked with many supervised ML algorithms to analyse the performance of the model. I am expected to use PySpark with mllib instead of plain python with sk-learn. To analyze the data, I have done the best model selection to choose the best classifier to predict whether a person makes over 50K a year.

Methods Used

  • Machine Learning
  • Data Visualization
  • Predictive Modeling
  • MLlib
  • PySpark

Technologies

  • Python
  • Pandas, jupyter
  • Numpy
  • PySpark

Project Description and dataset

I have used the Income classification dataset for this project which is publicly availabe in the kaggle website. This dataset contains more than 40k entries and 15 columns. Many pre-processing, cleaning, imputing, encoding, balancing and scaling were addressed. Since the dataset contains many categorical features, the number of features as result of encoding, were increased to more than 100 features. Therefore, feature engineering were addressed to increase the performance.

Needs of this project

  • data exploration/descriptive statistics
  • data processing/cleaning
  • statistical modeling
  • writeup/reporting
  • mllib learning
  • PySpark workflow
  • big-data concepts

Outcome

The related results and performance metrics will be found in the outcome directory

About

this will be a archive repo for the big data computing project.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published