This is a data science project leveraging a dataset of over 170,000 songs. We seek to create a playlist of songs of similar characteristics that shuffles the song popularity to improve the likelihood a user identifies a new song that he or she will enjoy.
In order to accomplish to this goal, our project will more succintly investigate the following three questions in order:
- Can we predict whether a song is popular or not based on its attributes?
- Can we predict whether a song is relatively more popular than another based on their attributes?
- Can we suggest to a user a new song based on the current song they are listening to?
- To access the full detailed report, review final_report.pdf
- To access the code notebook, to preview on GitHub, review SongPredictorFinal.ipynb
- To access the code nodebook, to preview on browser, review SongPredictorFinal.html
- To access the dataset, review cleaned_spotify.csv
(I recognize that the facecolor of the plot axes may blend on dark mode. I fill fix on a future iteration of this README.me. For a closer look, please review the provided report/notebook.)
To answer question 1, we used a voting ensemble classifier with a random forest classifier base estimator with default parameters to predict whether a song is popular or not popular. The model's F1 score was 0.878.
To answer question 2, we arrived at a gradient boosting regressor with a learning rate of 0.07 and a max depth of 10 to produce a pairwise ranking accuracy of 0.82. The feature importance is given below.
To answer question 3, we reduced the dimensionality of the dataset to 2 dimensions with PCA and leveraged k-means clustering to cluster similar songs. From the elbow method, we determined that 4 clusters were most effective at describing the dataset.
Contributors names
- Eric Li
- Wanqin Chen
- Yuwei Wang
Dataset and supplemental material