Skip to content

Latest commit

 

History

History
40 lines (30 loc) · 1.65 KB

README.md

File metadata and controls

40 lines (30 loc) · 1.65 KB

Income classification

This project is related to the Big-data computing (2020-2021) course at Sapienza University as a final project.

Project Status: [Done]

Project Intro/Objective

The purpose of this project is to make predictions where the prediction task is to determine whether a person makes over 50K a year or not. (classification task).
I have worked with many supervised ML algorithms to analyse the performance of the model. I am expected to use PySpark with mllib instead of plain python with sk-learn. To analyze the data, I have done the best model selection to choose the best classifier to predict whether a person makes over 50K a year.

Methods Used

  • Machine Learning
  • Data Visualization
  • Predictive Modeling
  • MLlib
  • PySpark

Technologies

  • Python
  • Pandas, jupyter
  • Numpy
  • PySpark

Project Description and dataset

I have used the Income classification dataset for this project which is publicly availabe in the kaggle website. This dataset contains more than 40k entries and 15 columns. Many pre-processing, cleaning, imputing, encoding, balancing and scaling were addressed. Since the dataset contains many categorical features, the number of features as result of encoding, were increased to more than 100 features. Therefore, feature engineering were addressed to increase the performance.

Needs of this project

  • data exploration/descriptive statistics
  • data processing/cleaning
  • statistical modeling
  • writeup/reporting
  • mllib learning
  • PySpark workflow
  • big-data concepts

Outcome

The related results and performance metrics will be found in the outcome directory