Fall 2021 Final Project for CM226 - Machine Learning in Bioinformatics
- TCGA - The Cancer Genome Atlas PanCanAtlas RNAseq data from the National Cancer Institute Genomic Data Commons
- These data consisted of 11,069 samples with 20,531 measured genes. Preprocessing-
- Tumors that were measured from multiple sites were removed.
- Data was normalised
- This resulted in a final TCGA PanCanAtlas gene expression matrix with 11,060 samples, which included 33 different cancer types, and 16,148 genes.
- The data is split into 90% training and 10% testing partitions. The data is partitioned such that each split contained relatively equal representation of each cancer type.
- These models have been built using sklearn library
- This VAE model is inspired from Tybalt's implementation
- Rushi Bhatt
- Ronak Kaoshik
- Shruti Mohanty
See also the list of contributors who participated in this project.
This project is licensed under the MIT License - see the LICENSE.md file for details