Skip to content

SciComp8/NGSOmics_Programming

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

OIMCS

This repository is created for helping either newcomers or seasoned researchers find what kind of computational biology/bioinformatics questions they can work on and candidate solutions. Unlike well-structured frames in mathematics and statistics, where theorems and definitions are clearly laid out, biology and medicine often involve complex, less organized domains of knowledge. However, this repo sets itself apart by offering you a structured and clear overview of the computational concepts and techniques of next-generation sequencing, for you to easily and effectively navigate the omics landscape.

Last updated: 22 Jun 2025

Features

Epigenomics sequencing

Under construction

Analyze ChIP-seq data

πŸ”

Analyze Cut-And-Run data

πŸ”

Analyze bulk ATAC-seq data

πŸ”


Analyze single cell ATAC-seq data

πŸ”


DNA sequencing

Analyze whole genome sequencing data

  • Solve raw read quality control and preprocessing

    • Run fastp to remove reads with low average quality score, trim adapters, and eliminate poly-G tails in Illumina NovaSeq/NextSeq data
    • Run MultiQC to evaluate pre- and post-trimming metrics
    • Validate sample identity using genetically inferred markers (e.g., sex chromosomes, SNP fingerprinting) and file hashing to ensure data integrity
    • Check sequencing coverage (e.g., 30–50Γ— for human genomes)/read length uniformity/read quality score distribution/GC content distribution
  • Solve alignment and variant calling

    • For human study: adopt and adapt GRCh38 build by Genome Reference Consortium, considering ambiguous mapping
    • Run bwa-mem2 for short-read alignment, while Minimap2 for long-read alignment
    • Detect single-nucleotide polymorphism (SNP)/indel with Genome Analysis Toolkit (GATK) for germline DNA
    • Detect structural variant (SV) with VISTA, which optimizes the F1-score of SV calls by combining different high-performing SV callers; or run multiple SV callers (e.g., Manta, DELLY, GRIDSS), and then infer shared SV calls
    • Detect copy number gains and losses (CNV) detection with CNVKit
    • Detect splice-altering intronic variants with spliceAI
    • Understand VCF, a common variant report file format, and MAF, which aggregates mutation information from VCF
    • Translate a VCF file from its current reference genome build to another build version: run LiftoverVcf (Picard)
    • Common variants may less likely cause rare or highly penetrant diseases: exclude variants with allele frequency > 1% in Genome Aggregation Database (gnomAD)
  • Solve genome annotation

    • Annotate and assess how genetic variants affect genes and proteins, including specific changes to amino acids with GATK-compatible SnpEff
    • Look out for coding, splice-site, and ClinVar pathogenic variants
    • In a clinical setting, prioritize summary and report of variants using Human Phenotype Ontology (HPO) terms; for rare diseases, to improve certainty about whether a variant is pathogenic, look for the one showing phenotypic features in common with DECIPHER; for cancers, mine the biological consequences and therapeutic/diagnostic/prognostic implications of genetic variants with OncoKB and oncokb-annotator
  • Solve downstream analysis for biomedical insights

    • Assess the relationship between genotype and gene expression with MatrixEQTL, which operates linear regression with additive genotype effect/ANOVA genotype effect
    • Test whether groups of SNPs, often linked to sets of functionally related genes, show a stronger overall association with a phenotype than would be expected by randomness with INRICH
    • Infer differentially expressed genes and enriched pathyways for the trait-associated SNPs with GIGSEA
  • Solve scalable and reproducible processing

    • Leverage cloud-based computating with Terra and Galaxy
    • Make WGS analysis workflow reproducible with Python-based snakemake
    • Do whole‑genome association analysis at biobank scale (i.e., thousands of phenotypes across hundreds of thousands of samples) for both quantitative and binary traits, while being computationally and statistically efficient with regenie

πŸ”


RNA sequencing (RNA-seq)

Analyze single cell RNA-seq data

πŸ”


Analyze bulk RNA-seq data

πŸ”


Analyze other omics data

Proteomics

Metabolomics

Spatial transcriptomics

  • Profile the whole transcriptome from Formalin-Fixed Paraffin-Embedded tissue sections with Visium
  • Profile subcellular-level RNA targets with Xenium
  • Read spatial transcriptomics data with SpatialData
  • Analyze spatial DNA/RNA/protein data in subcellular/single cell/multiple cells with Giotto Suite

πŸ”


Conceptual lens

πŸ”


Citation

Mary Piper, Meeta Mistry, Jihe Liu, William Gammerdinger, & Radhika Khetani. (2022, January 6). hbctraining/scRNA-seq_online: scRNA-seq Lessons from HCBC (first release). Zenodo. https://doi.org/10.5281/zenodo.5826256.