This project is related to the Big-data computing (2020-2021) course at Sapienza University as a final project.
The purpose of this project is to make predictions where the prediction task is to determine whether a person makes over 50K a year or not. (classification task).
I have worked with many supervised ML algorithms to analyse the performance of the model. I am expected to use PySpark with mllib instead of plain python with sk-learn. To analyze the data, I have done the best model selection to choose the best classifier to predict whether a person makes over 50K a year.
- Machine Learning
- Data Visualization
- Predictive Modeling
- MLlib
- PySpark
- Python
- Pandas, jupyter
- Numpy
- PySpark
I have used the Income classification dataset for this project which is publicly availabe in the kaggle website. This dataset contains more than 40k entries and 15 columns. Many pre-processing, cleaning, imputing, encoding, balancing and scaling were addressed. Since the dataset contains many categorical features, the number of features as result of encoding, were increased to more than 100 features. Therefore, feature engineering were addressed to increase the performance.
- data exploration/descriptive statistics
- data processing/cleaning
- statistical modeling
- writeup/reporting
- mllib learning
- PySpark workflow
- big-data concepts
The related results and performance metrics will be found in the outcome directory