Income classification

This project is related to the Big-data computing (2020-2021) course at Sapienza University as a final project.

Project Status: [Done]

Project Intro/Objective

The purpose of this project is to make predictions where the prediction task is to determine whether a person makes over 50K a year or not. (classification task).
I have worked with many supervised ML algorithms to analyse the performance of the model. I am expected to use PySpark with mllib instead of plain python with sk-learn. To analyze the data, I have done the best model selection to choose the best classifier to predict whether a person makes over 50K a year.

Methods Used

Machine Learning
Data Visualization
Predictive Modeling
MLlib
PySpark

Technologies

Python
Pandas, jupyter
Numpy
PySpark

Project Description and dataset

I have used the Income classification dataset for this project which is publicly availabe in the kaggle website. This dataset contains more than 40k entries and 15 columns. Many pre-processing, cleaning, imputing, encoding, balancing and scaling were addressed. Since the dataset contains many categorical features, the number of features as result of encoding, were increased to more than 100 features. Therefore, feature engineering were addressed to increase the performance.

Needs of this project

data exploration/descriptive statistics
data processing/cleaning
statistical modeling
writeup/reporting
mllib learning
PySpark workflow
big-data concepts

Outcome

The related results and performance metrics will be found in the outcome directory

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Income classification

Project Status: [Done]

Project Intro/Objective

Methods Used

Technologies

Project Description and dataset

Needs of this project

Outcome

Files

README.md

Latest commit

History

README.md

File metadata and controls

Income classification

Project Status: [Done]

Project Intro/Objective

Methods Used

Technologies

Project Description and dataset

Needs of this project

Outcome