Skip to content

Aarhus-Psychiatry-Research/psycop-train-test-split

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

71 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

psycop-train-test-split

python versions Code style: black

Code for splitting data derived from the PSYCOP project into train test splits.

If each combination of stratification_cols is considered a stratum, some strata might have only one member. E.g. there may only be one patient that gets both lung_cancer and schizophrenia.

But we don't care about each combination being balanced, we only care about each category being balanced individually.

E.g. if we have two binary variables, cancer and schizophrenia:

Unsplit

patient_id cancer schizophrenia
1 1 1
2 1 0
3 0 1

One of these categories, cancer + schizophrenia has only one individual – and thus can't get balanced! sklearn then outputs

ValueError: The least populated class in y has only 1 member, which is too few. 
The minimum number of groups for any class cannot be less than 2.

But we don't care about that, we just want both cancer and schizophrenia to be balanced. A split like this would be great for us:

Train

patient_id cancer schizophrenia
1 1 1

Test

patient_id cancer schizophrenia
2 1 0
3 0 1

This is balanced, because in both train and test there's exactly one individual with cancer, and one with schizophrenia.

About

Generation of train-test-split.

Resources

Stars

Watchers

Forks

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages