OIMCS

This repository is created for helping either newcomers or seasoned researchers find what kind of computational biology/bioinformatics questions they can work on and candidate solutions. Unlike well-structured frames in mathematics and statistics, where theorems and definitions are clearly laid out, biology and medicine often involve complex, less organized domains of knowledge. However, this repo sets itself apart by offering you a structured and clear overview of the computational concepts and techniques of next-generation sequencing, for you to easily and effectively navigate the omics landscape.

Last updated: 22 Jun 2025

Features

Single cell RNA-seq analysis
Bulk RNA-seq analysis
Single cell ATAC-seq analysis
Bulk ATAC-seq analysis
Proteomics | metabolomics | spatial transcriptomics analysis

Epigenomics sequencing

Under construction

Analyze ChIP-seq data

🔝

Analyze Cut-And-Run data

🔝

Analyze bulk ATAC-seq data

Introduction to transcription regulation
Practical guide
Run ENCODE ATAC-seq pipeline to perform alignment, quality assurance, peaking calling, and signal track generation
If we're interested in inspecting every step in each analytical phase, or even leveraging advanced/unique features of other tools that the current pipeline ignores, ...
- For alignment and post-alignment phases, we can ...
  - Use Rsubread or Rbowtie2 to align the fastq files relative to hg19/hg38/hs1
  - Use GenomicAlignments and GenomicRanges to perform post-alignment processing including reading properly paired reads, estimating MapQ scores/insert sizes, reconstructing the full-length fragment, and others
  - Use ATACseqQC to perform comprehensive ATAC-seq quality assurance
- For TSS analysis phase, we can ...
  - Use soGGi to assess the transcriptional start site signal in the nucleosome-free open region
- For peaking calling phase, we can ...
  - Use MACS2 and ChIPQC to call peaks in the nucleosome-free open region, and perform quality assurance
  - Or use Genrich to call peaks in the nucleosome-free open region
  - Or use MACS3/MACSr (R wrapper of MACS3) to call peaks in the nucleosome-free open region
  - Use ChIPseeker to annotate peak regions with genomic features
- For functional analysis phase, we can ...
  - Use rGREAT to functionally interpret the peak regions based on the GO database
  - Use GenomicRanges and GenomicAlignments to select and count non-redundant peaks
  - Use DESeq2/DESeq2-based DiffBind and ChIPseeker to analyze differences in peaks with gene annotations across conditions
  - Use clusterProfiler to perform enrichment analysis of differential peak regions
  - However, functional insights gained by peak annotations can hardly illustrate what key regulators shape the transcription mechanism.
- To further infer transcription factors acting in peak regions, we can ...
  - Use MotifDb/JASPAR2022 and seqLogo/ [recommend] ggseqlogo to search and visualize motifs
  - Use motifmatchr (R wrapper of MOODS) to map peaks to motifs, DNA sequences preferred by transcription factors
  - Use chromVAR to analyze differences in motifs across conditions

🔝

Analyze single cell ATAC-seq data

Profile distinct open chromatin regions across the genome at single-cell resolution with Epi ATAC
Epigenomic and transcriptomic signatures of aging and cancer at single-cell resolution
Technical Q & A
Correct batch effect
Transfer cell type labels from single-cell RNA-seq data to separately collected single-cell ATAC-seq data
Find DNA motifs linked to differences in single-cell or bulk chromatin accessibility with chromVAR
Profile somatic mutations with epigenetic alterations at single-cell resolution with GoT–ChA

🔝

DNA sequencing

Analyze whole genome sequencing data

Solve raw read quality control and preprocessing
- Run fastp to remove reads with low average quality score, trim adapters, and eliminate poly-G tails in Illumina NovaSeq/NextSeq data
- Run MultiQC to evaluate pre- and post-trimming metrics
- Validate sample identity using genetically inferred markers (e.g., sex chromosomes, SNP fingerprinting) and file hashing to ensure data integrity
- Check sequencing coverage (e.g., 30–50× for human genomes)/read length uniformity/read quality score distribution/GC content distribution
Solve alignment and variant calling
- For human study: adopt and adapt GRCh38 build by Genome Reference Consortium, considering ambiguous mapping
- Run bwa-mem2 for short-read alignment, while Minimap2 for long-read alignment
- Detect single-nucleotide polymorphism (SNP)/indel with Genome Analysis Toolkit (GATK) for germline DNA
- Detect structural variant (SV) with VISTA, which optimizes the F1-score of SV calls by combining different high-performing SV callers; or run multiple SV callers (e.g., Manta, DELLY, GRIDSS), and then infer shared SV calls
- Detect copy number gains and losses (CNV) detection with CNVKit
- Detect splice-altering intronic variants with spliceAI
- Understand VCF, a common variant report file format, and MAF, which aggregates mutation information from VCF
- Translate a VCF file from its current reference genome build to another build version: run LiftoverVcf (Picard)
- Common variants may less likely cause rare or highly penetrant diseases: exclude variants with allele frequency > 1% in Genome Aggregation Database (gnomAD)
Solve genome annotation
- Annotate and assess how genetic variants affect genes and proteins, including specific changes to amino acids with GATK-compatible SnpEff
- Look out for coding, splice-site, and ClinVar pathogenic variants
- In a clinical setting, prioritize summary and report of variants using Human Phenotype Ontology (HPO) terms; for rare diseases, to improve certainty about whether a variant is pathogenic, look for the one showing phenotypic features in common with DECIPHER; for cancers, mine the biological consequences and therapeutic/diagnostic/prognostic implications of genetic variants with OncoKB and oncokb-annotator
Solve downstream analysis for biomedical insights
- Assess the relationship between genotype and gene expression with MatrixEQTL, which operates linear regression with additive genotype effect/ANOVA genotype effect
- Test whether groups of SNPs, often linked to sets of functionally related genes, show a stronger overall association with a phenotype than would be expected by randomness with INRICH
- Infer differentially expressed genes and enriched pathyways for the trait-associated SNPs with GIGSEA
Solve scalable and reproducible processing
- Leverage cloud-based computating with Terra and Galaxy
- Make WGS analysis workflow reproducible with Python-based snakemake
- Do whole‑genome association analysis at biobank scale (i.e., thousands of phenotypes across hundreds of thousands of samples) for both quantitative and binary traits, while being computationally and statistically efficient with regenie

🔝

RNA sequencing (RNA-seq)

When was the term 'RNA-seq' first coined? At what point did researchers start RNA-seq before the term 'RNA-seq' was formally introduced?
Besides its common use in understanding gene expression differences, and isoform and splicing patterns across tissues or patient of comparison interest, what other insights can RNA-seq technology uncover?

Analyze single cell RNA-seq data

Introduction to single cell RNA-seq
Profile single cell transcriptome with Chromium Single Cell Universal 3' Gene Expression
A mini scRNA-seq pipeline | Why apply pipelines?
If given raw bcl files, we convert them to fastq files
As inputs are fastq files, we can ...
- Run FastQC to evaluate sequence quality and content
- Use Trim Galore to trim reads if we spot unexpected low-quality base calls/adaptor contamination
- Re-run FastQC to re-evaluate sequence quality and content
- If single-cell RNA-seq data is generated from the plate-based protocol, we can ...
  - Use STAR to perform alignment and FeatureCounts to generate the count matrix
- Else if single-cell RNA-seq data is generated from the droplet-based protocol, we can ...
  - Use kb-python package to perform pseudo sequence alignment and generate the count matrix
  - Use Cell Ranger pipelines to perform sequence alignment and generate the count matrix
After having the feature-barcode matrices at hand, we can ...
- Use Scanpy workflow to perform quality assurance, cell clustering, marker gene detection for cell identities, and trajectory inference
- Use Seurat workflow to perform quality assurance, cell clustering, and marker gene detection for cell identities case 1 | case 2
  - If we observe the factor-specific clustering and want cells of the same cell type cluster together across single/multiple confounding factors, we can use canonical correlation analysis or Harmony (suitable for complicated confounding effects) to integrate cells
  - We can leverage SingleR or ScType to partially or fully automate cell-type identification
    - Other options of automating cell-type identification by mapping to references and then transfering labels: scArches, Symphony
- Use Bioconductor packages to perform single cell RNA-Seq data analysis
- Generate pseudobulk, which aggregates the gene expression levels specific to each cell type within an individual
- Perform pseudobulk-based differentially gene expression analysis in edgeR or DESeq2
- Use bulk RNAseq-based pathway analysis tools (e.g., clusterProfiler, GSEA, GSVA) or single cell RNAseq-based Pagoda2 to evaluate if a predefined set of genes shows statistically significant and consistent variations between biological conditions
- Run scGen to model perturbation responses; for heterogeneous perturbed cell population, run CellOT
Review 2025 single cell genomics day
Learn frontier single-cell RNA-seq analytical progress
Identify single-cell eQTL
Define spatial architecture in single cell data | spacexr
Capture gene expression and chromatin accessibility together in a single cell
Experiment with your data analysis process using COVID-19 RNA-seq data resources

🔝

Analyze bulk RNA-seq data

Can we obtain cell-type-specific gene expression information without using single-cell or single-nucleus RNA-seq, as they are costly in clinical research?
How do RNA molecules get prepared and sequenced using Illumina technology?
- What types of library preparation kits are offered to prepare a complementary DNA (cDNA) library for sequencing?
  - Prior to cDNA library preparation, is RNA extracted sufficiently? How is the degradation level of extracted RNA? Here is a more thorough guide on checking RNA integrity
- What types of sequencers are offered to sequence the stable double-stranded cDNA?
How accurate are the transcripts measured by the sequencer?
- Run FastQC or fastp to evaluate sequence quality and content
Which genome regions are transcribed? What are the exact genomic coordinates (of the reference genome) our sequencing reads come from?
- There are many ways to find the reads location by aligning reads with the reference genome, but you can choose a tool that's especially useful for your own scientific design. The splice-aware genome aligner STAR is strongly recommened.
  - Other splice-aware alignment tool options include Olego, HISAT2, MapSplice, ABMapper, Passion, BLAT, RUM ...
  - Other alignment tools disregarding isoforms include BWA, Bowtie2 ...
- Use Rsubread to align the reads
Use Qualimap to perform quality assurance on the aligned reads
Use MultiQC to harmonize all QC and alignment metadata from FastQC, STAR, Qualimap, and other tools
Use GenomicAlignments for aligned reads to obtain the gene-level or exon-level quantification
Use featureCounts for aligned reads to count the fragments
[Recommend] Use Salmon for unaligned reads to obtain the transcript-level quantification
- Why unalign? To speed up the counting process of reads
- Next step: Use tximport to aggregate transcript-level quantification to the gene level
Perform differential gene expression analysis
Perform principal component analysis, heatmap, and clustering
Perform gene set enrichment analysis
Experiment with your data analysis process using COVID-19 RNA-seq data resources

🔝

Analyze other omics data

Proteomics

A quick start from loading an online spectrum, performing peak quality control, annotating peaks, to visualizing the annotated peaks

Metabolomics

MetaboAnalystR

Spatial transcriptomics

Profile the whole transcriptome from Formalin-Fixed Paraffin-Embedded tissue sections with Visium
Profile subcellular-level RNA targets with Xenium
Read spatial transcriptomics data with SpatialData
Analyze spatial DNA/RNA/protein data in subcellular/single cell/multiple cells with Giotto Suite

🔝

Conceptual lens

🔝

Citation

Mary Piper, Meeta Mistry, Jihe Liu, William Gammerdinger, & Radhika Khetani. (2022, January 6). hbctraining/scRNA-seq_online: scRNA-seq Lessons from HCBC (first release). Zenodo. https://doi.org/10.5281/zenodo.5826256.

Name		Name	Last commit message	Last commit date
Latest commit History 1,263 Commits
ATACSeq		ATACSeq
Bash		Bash
BulkRNASeq		BulkRNASeq
ChIPSeq		ChIPSeq
Conda		Conda
Docker		Docker
FastQC		FastQC
Git		Git
Methylation		Methylation
Perl		Perl
Proteomics/spectrum_utils		Proteomics/spectrum_utils
QuantitativeGenomicsGenetics		QuantitativeGenomicsGenetics
Science_Reading		Science_Reading
SingleCellRNASeq		SingleCellRNASeq
Spatial_Transcriptomics		Spatial_Transcriptomics
StudyDesign		StudyDesign
WGS		WGS
Concept_Genetics.md		Concept_Genetics.md
HighLevelIdea_MultiOmics.md		HighLevelIdea_MultiOmics.md
LICENSE		LICENSE
Note_MultimodalDataIntegration.md		Note_MultimodalDataIntegration.md
README.md		README.md
dna.png		dna.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

OIMCS

Features

Epigenomics sequencing

Analyze ChIP-seq data

Analyze Cut-And-Run data

Analyze bulk ATAC-seq data

Analyze single cell ATAC-seq data

DNA sequencing

Analyze whole genome sequencing data

RNA sequencing (RNA-seq)

Analyze single cell RNA-seq data

Analyze bulk RNA-seq data

Analyze other omics data

Proteomics

Metabolomics

Spatial transcriptomics

Conceptual lens

Citation

About

Uh oh!

Releases

Packages

Languages

License

SciComp8/NGSOmics_Programming

Folders and files

Latest commit

History

Repository files navigation

OIMCS

Features

Epigenomics sequencing

Analyze ChIP-seq data

Analyze Cut-And-Run data

Analyze bulk ATAC-seq data

Analyze single cell ATAC-seq data

DNA sequencing

Analyze whole genome sequencing data

RNA sequencing (RNA-seq)

Analyze single cell RNA-seq data

Analyze bulk RNA-seq data

Analyze other omics data

Proteomics

Metabolomics

Spatial transcriptomics

Conceptual lens

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages