psycop-train-test-split

Code for splitting data derived from the PSYCOP project into train test splits.

If each combination of stratification_cols is considered a stratum, some strata might have only one member. E.g. there may only be one patient that gets both lung_cancer and schizophrenia.

But we don't care about each combination being balanced, we only care about each category being balanced individually.

E.g. if we have two binary variables, cancer and schizophrenia:

Unsplit

patient_id	cancer	schizophrenia
1	1	1
2	1	0
3	0	1

One of these categories, cancer + schizophrenia has only one individual – and thus can't get balanced! sklearn then outputs

ValueError: The least populated class in y has only 1 member, which is too few. 
The minimum number of groups for any class cannot be less than 2.

But we don't care about that, we just want both cancer and schizophrenia to be balanced. A split like this would be great for us:

Train

patient_id	cancer	schizophrenia
1	1	1

Test

patient_id	cancer	schizophrenia
2	1	0
3	0	1

This is balanced, because in both train and test there's exactly one individual with cancer, and one with schizophrenia.

Name		Name	Last commit message	Last commit date
Latest commit History 71 Commits
.github/workflows		.github/workflows
src		src
tests		tests
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

psycop-train-test-split

About

Releases 1

Packages

Contributors 3

Languages

Aarhus-Psychiatry-Research/psycop-train-test-split

Folders and files

Latest commit

History

Repository files navigation

psycop-train-test-split

About

Resources

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 3

Languages

Packages