GitHub - bess-cater/Data-Mining-Class: Here the assignments for Data Mining are stored

Project Description

The dataset used throughout the assignment is Human Stress Detection in and through Sleep.

To get insights into the data:

EDA was conducted;
Different models were used to predict level of stress as variable dependent on sleep characteristics variables.
Color Decision Tree was constructed to better visualize the data in the dataset.

Stack

for fitting KNeighbors Classifier Algorithm, Logistic Regression and other models:
for modelling data:
for visualizing data:

All the graphs used below are reproducable with the code in the dedicated notebooks.

EDA

Refer to EDA notebook.

The dataset has 8 predictors (features) and one dependant variable, stress level (from 0 to 4), which is later predicted with different models.

We can see that all the predictors have no outliers and the data can be very accurately divided into 5 corresponding levels of stress! Hmm... Suspicious! 🤔

The correlation between the features is also just too good to be true!

Modelling

Refer to modelling notebook.

Now, let's look at the results of the models prediction.

However, first we need some data preprocessing.

First, to use binary prediction models, we need to change 5 classes of stress levels into two: to do this, we'll consider level of 0 as no stress, and others - as data samples collected when one was experiencing a certain level of stress.

However, now we have imbalance problem... to fix this, we'll modify our data using SMOTE algorithm!

Now everything looks good and we are ready to fit some models... Or not?

Data looks too perfect and there have been speculations about it being artificially generated...

Real data collected from users is supposed to be noisy, so let's add some noise to it!

# Example code from a Jupyter Notebook cell
#Two-by-four array of samples from the normal distribution with mean 3 and standard deviation 2.5:
# random_mean = [60, 19.8, 94.5, 9.5, 92, 81, 5.3, 59.9 ]
random_std=[18, 4, 3.6, 4.45, 3.9, 13.6, 3.19, 9.8]
gen_upd=gen.sample(254).index
for col, beta in zip(gen.columns, random_std):
  if col=='eye_movement': continue
  a = beta * np.random.randn(254)
  gen.loc[gen_upd, col]=gen.loc[gen_upd, col]+a

I guess it looks more naturally now and there is overlap in predictors!

After this last modification, we fit models...

Model	Area Under the Curve (AUC)	Class	Precision	Recall	F1-score	Support
KNN	0.9900	0	0.99	0.99	0.99	93
		1	0.99	0.99	0.99	109
NB	0.8791	0	0.79	0.98	0.88	93
		1	0.98	0.78	0.87	109
LR	0.9954	0	0.99	1.00	0.99	93
		1	1.00	0.99	1.00	109
GB	1.0000	0	1.00	1.00	1.00	93
		1	1.00	1.00	1.00	109

Accuracy		Precision	Recall	F1-score	Support
KNN	0.99	0.99	0.99	0.99	202
NB	0.87	0.88	0.88	0.87	202
LR	1.00	0.99	1.00	1.00	202
GB	1.00	1.00	1.00	1.00	202

Well, the results are still suspiciously still very good!

Color DT

DISCLAIMER: Color Decision Tree as a concept is not my invention and I do not claim any credit for it.

Refer to Color DT notebook.

Color tree is a Decision tree with a few modifications:

Colors of the nodes indicate whether data samples of one or another class dominate the node;
Color of the arrows indicate whether proportion of data sample of a corresponding class has increased after partition;
Double arrows additionally indicate that Gini Index has increased after partition.

Here is one example of such tree:

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
imgs		imgs
.gitignore		.gitignore
EDA.ipynb		EDA.ipynb
README.md		README.md
colorDT.ipynb		colorDT.ipynb
modelling.ipynb		modelling.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Project Description

Stack

EDA

Modelling

Color DT

About

Releases

Packages

Languages

bess-cater/Data-Mining-Class

Folders and files

Latest commit

History

Repository files navigation

Project Description

Stack

EDA

Modelling

Color DT

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages