Skip to content

Latest commit

 

History

History
142 lines (115 loc) · 10.7 KB

Readme.md

File metadata and controls

142 lines (115 loc) · 10.7 KB

Characterization and predictive role of human-specific genes in Acute Lymphoblastic Leukemia

Table of Contents

Authors

Project developed by:

All authors contribute equally

Overview

  • Analyzing data related to Acute Lymphoid Leukemia (ALL)
  • Utilizing bioinformatics techniques
  • Seeking insights into underlying mechanisms
  • work in progress

Dataset

Dataset Origin_Bone_marrow_or_Blood Cells Number_Samples N_Healty_sample N_tumor_sample Pediatric_Adult
GSE84445 Blood CD4_CD8_T 20 (of 10 donors divide evenly between CD4 and CD8) 20 0 Adults
GSE133499 Both All 42 0 38 Pediatric
GSE181157 Both All 173 0 173 5 Adult, 168 Pediatric
GSE227832 Both All 331 (with double for some samples 340) 10 321 Pediatric
Cohort_7_8 Unknown All 108 0 108 12 Pediatric, 2 Not Known, 94 Adult
GSE139073 Bone Bone marrow stromal cells 40 40 0 Adult
GSE115736 Blood CD4_CD8_T, B 18 18 0 Unknown
GSE228632 Bone Marrow Bone marrow mononuclear cells: mixed population of single nucleus cells including monocytes, lymphocytes, and hematopoietic stem and progenitor cells 65 0 65 Pediatric
MaSpore.RNASeq Both All (i think) 377 0 377 Pediatric

Overview

This project investigates the role of human-specific genes in Acute Lymphoblastic Leukemia (ALL). By leveraging advanced quantitative methodologies, we aim to extend the current understanding of these genes and their association with ALL. This repository contains all relevant data, methods, and tools used in our study.

Motivation

Background

Humans and chimpanzees diverged approximately 6 million years ago. This evolutionary split has led to rapid genetic alterations in the human lineage, resulting in significant differences, particularly in diet, immune function, and anatomy. These unique genetic features, termed ‘human-specific’ genes, are not entirely understood. Many studies have aimed to identify these genes and understand their association with human-specific diseases.

Acute Lymphoblastic Leukemia (ALL)

ALL is the most common type of leukemia in children, accounting for 80% of pediatric leukemia cases. It involves the malignant transformation of lymphoid progenitor cells, leading to abnormal proliferation and differentiation. This disease is closely linked to genetic aberrations, including complex chromosomal rearrangements.

Study Objective

This study aims to characterize human-specific genes associated with ALL and evaluate their predictive role. By analyzing differential gene expression and performing enrichment analysis, we seek to identify key human-specific genes involved in ALL and develop classifiers to predict tumor subtypes.

Methods

Data Collection

We utilized 8 comprehensive databases ( more to come, work in progress ) from the Gene Expression Omnibus (GEO), which include mRNA expression profiles derived from bone marrow and blood cells of 794 patients. These patients' samples were categorized into tumoral ALL samples and controls. Metadata allowed further classification based on age (adult and pediatric) and tumor subtype (B, T, PreB, and PreT).

Data Preprocessing

  • Batch Correction: Combat-Seq from the sva package in R was employed to correct batch effects in the count matrices.
  • Normalization: Data normalization was performed using the Trimmed Mean of M-values (TMM) method to ensure comparability across samples.

Differential Gene Expression Analysis

  • Tools Used: The EdgeR package in R was used to identify differentially expressed genes (DEGs).
  • Filtering Criteria: DEGs were filtered using a p-value threshold of < 0.01. Significant genes were determined by setting a ±1.5 threshold on log-transformed fold changes.

Extraction of Human-Specific Genes

A reference list of human-specific genes, derived from existing literature, was used to extract relevant DEGs from our dataset.

Enrichment and Pathway Analysis

  • Functional Enrichment: Enrichment analysis was performed using the clusterProfiler and EnrichR packages in R. The analysis focused on the Biological Process (BP) sub-ontology.
  • Pathway Analysis: WikiPathway was utilized to perform pathway analysis, identifying key pathways involving the identified DEGs.

Classification and Predictive Modeling

  • Methodology: Various classification methods were implemented using R libraries, including ensemble methods, non-parametric, and CPU-based deep learning.
  • Training and Validation: The classifiers were trained on differentially expressed human-specific genes. A three-fold cross-validation with two repeats was used, employing the ADASYN algorithm for balanced sampling.
  • Hyperparameter Tuning: Hyperparameters were tuned using Latin hypercube sampling followed by Simulated Annealing, executed for 25 search iterations.
  • Model Evaluation: Post-classification, variable importance was extracted from the models, and features were ranked.

Results

Differential Gene Expression

We identified numerous human-specific genes that characterize ALL and its various subtypes (B, T, PreB, and PreT). The differential expression patterns highlighted the involvement of these genes in critical biological processes such as immune response deregulation, differentiation, and splicing.

Enrichment Analysis

Enrichment analysis revealed significant involvement of human-specific genes in pathways related to cancer. The deregulation of immune response and cell differentiation processes were particularly prominent.

Predictive Modeling

We developed a consensus classifier capable of accurately associating unknown data with specific tumor subtypes. The classifier demonstrated robust performance with:

  • Mean Balanced accuracy: Approximately 0.89%
  • F Mean Score: Approximately 0.81
  • Kappa Mean Score: Approximately 0.75

Key Genes Identified

Among the human-specific genes with high importance in our models, we identified:

  • EBF1: A gene related to signal transduction in leukemia.
  • MYO7B: A known proto-oncogenic driver.
  • RAB6C: A member of the RAS oncogene family.

Conclusion

This study successfully highlights a set of human-specific genes associated with ALL, providing a basis for patient characterization by age and subtype. The developed classifier effectively predicts tumor subtypes in unknown samples, demonstrating high accuracy. Identifying significant human-specific genes offers potential biomarkers for ALL subtypes, aiding in targeted therapies and personalized medicine approaches.

We are committed to continuous improvement and refinement of our existing pipeline. Additionally, we are working diligently to expand our analytical capabilities. Stay tuned for more updates as we progress. (work in progress)

Repository Contents

  • data: Includes the study's expression profiles and associated metadata.
  • Code: Contains R scripts for data preprocessing, differential expression analysis, enrichment analysis, and classification.

Supplementary Information

For detailed information on the methodologies and results, please take a look at our future publication (work in progress).

License

This project is licensed under the MIT License. Please take a look at the LICENSE file for more details.

Contact

For any questions or comments, please open an issue on this repository or contact the authors via email.


Image 1 Image 2 Image 3 Image 4