This repository is created for helping either newcomers or seasoned researchers find what kind of computational biology/bioinformatics questions they can work on and candidate solutions. Unlike well-structured frames in mathematics and statistics, where theorems and definitions are clearly laid out, biology and medicine often involve complex, less organized domains of knowledge. However, this repo sets itself apart by offering you a structured and clear overview of the computational concepts and techniques of next-generation sequencing, for you to easily and effectively navigate the omics landscape.
Last updated: 22 Jun 2025
- Single cell RNA-seq analysis
- Bulk RNA-seq analysis
- Single cell ATAC-seq analysis
- Bulk ATAC-seq analysis
- Proteomics | metabolomics | spatial transcriptomics analysis
Under construction
- Introduction to transcription regulation
- Practical guide
- Run ENCODE ATAC-seq pipeline to perform alignment, quality assurance, peaking calling, and signal track generation
- If we're interested in inspecting every step in each analytical phase, or even leveraging advanced/unique features of other tools that the current pipeline ignores, ...
- For alignment and post-alignment phases, we can ...
- Use Rsubread or Rbowtie2 to align the fastq files relative to hg19/hg38/hs1
- Use GenomicAlignments and GenomicRanges to perform post-alignment processing including reading properly paired reads, estimating MapQ scores/insert sizes, reconstructing the full-length fragment, and others
- Use ATACseqQC to perform comprehensive ATAC-seq quality assurance
- For TSS analysis phase, we can ...
- Use soGGi to assess the transcriptional start site signal in the nucleosome-free open region
- For peaking calling phase, we can ...
- Use MACS2 and ChIPQC to call peaks in the nucleosome-free open region, and perform quality assurance
- Or use Genrich to call peaks in the nucleosome-free open region
- Or use MACS3/MACSr (R wrapper of MACS3) to call peaks in the nucleosome-free open region
- Use ChIPseeker to annotate peak regions with genomic features
- For functional analysis phase, we can ...
- Use rGREAT to functionally interpret the peak regions based on the GO database
- Use GenomicRanges and GenomicAlignments to select and count non-redundant peaks
- Use DESeq2/DESeq2-based DiffBind and ChIPseeker to analyze differences in peaks with gene annotations across conditions
- Use clusterProfiler to perform enrichment analysis of differential peak regions
- However, functional insights gained by peak annotations can hardly illustrate what key regulators shape the transcription mechanism.
- To further infer transcription factors acting in peak regions, we can ...
- Use MotifDb/JASPAR2022 and seqLogo/ [recommend] ggseqlogo to search and visualize motifs
- Use motifmatchr (R wrapper of MOODS) to map peaks to motifs, DNA sequences preferred by transcription factors
- Use chromVAR to analyze differences in motifs across conditions
- For alignment and post-alignment phases, we can ...
- Profile distinct open chromatin regions across the genome at single-cell resolution with Epi ATAC
- Epigenomic and transcriptomic signatures of aging and cancer at single-cell resolution
- Technical Q & A
- Correct batch effect
- Transfer cell type labels from single-cell RNA-seq data to separately collected single-cell ATAC-seq data
- Find DNA motifs linked to differences in single-cell or bulk chromatin accessibility with chromVAR
- Profile somatic mutations with epigenetic alterations at single-cell resolution with GoTβChA
-
Solve raw read quality control and preprocessing
- Run fastp to remove reads with low average quality score, trim adapters, and eliminate poly-G tails in Illumina NovaSeq/NextSeq data
- Run MultiQC to evaluate pre- and post-trimming metrics
- Validate sample identity using genetically inferred markers (e.g., sex chromosomes, SNP fingerprinting) and file hashing to ensure data integrity
- Check sequencing coverage (e.g., 30β50Γ for human genomes)/read length uniformity/read quality score distribution/GC content distribution
-
Solve alignment and variant calling
- For human study: adopt and adapt GRCh38 build by Genome Reference Consortium, considering ambiguous mapping
- Run bwa-mem2 for short-read alignment, while Minimap2 for long-read alignment
- Detect single-nucleotide polymorphism (SNP)/indel with Genome Analysis Toolkit (GATK) for germline DNA
- Detect structural variant (SV) with VISTA, which optimizes the F1-score of SV calls by combining different high-performing SV callers; or run multiple SV callers (e.g., Manta, DELLY, GRIDSS), and then infer shared SV calls
- Detect copy number gains and losses (CNV) detection with CNVKit
- Detect splice-altering intronic variants with spliceAI
- Understand VCF, a common variant report file format, and MAF, which aggregates mutation information from VCF
- Translate a VCF file from its current reference genome build to another build version: run LiftoverVcf (Picard)
- Common variants may less likely cause rare or highly penetrant diseases: exclude variants with allele frequency > 1% in Genome Aggregation Database (gnomAD)
-
Solve genome annotation
- Annotate and assess how genetic variants affect genes and proteins, including specific changes to amino acids with GATK-compatible SnpEff
- Look out for coding, splice-site, and ClinVar pathogenic variants
- In a clinical setting, prioritize summary and report of variants using Human Phenotype Ontology (HPO) terms; for rare diseases, to improve certainty about whether a variant is pathogenic, look for the one showing phenotypic features in common with DECIPHER; for cancers, mine the biological consequences and therapeutic/diagnostic/prognostic implications of genetic variants with OncoKB and oncokb-annotator
-
Solve downstream analysis for biomedical insights
- Assess the relationship between genotype and gene expression with MatrixEQTL, which operates linear regression with additive genotype effect/ANOVA genotype effect
- Test whether groups of SNPs, often linked to sets of functionally related genes, show a stronger overall association with a phenotype than would be expected by randomness with INRICH
- Infer differentially expressed genes and enriched pathyways for the trait-associated SNPs with GIGSEA
-
Solve scalable and reproducible processing
- Leverage cloud-based computating with Terra and Galaxy
- Make WGS analysis workflow reproducible with Python-based snakemake
- Do wholeβgenome association analysis at biobank scale (i.e., thousands of phenotypes across hundreds of thousands of samples) for both quantitative and binary traits, while being computationally and statistically efficient with regenie
- When was the term 'RNA-seq' first coined? At what point did researchers start RNA-seq before the term 'RNA-seq' was formally introduced?
- Besides its common use in understanding gene expression differences, and isoform and splicing patterns across tissues or patient of comparison interest, what other insights can RNA-seq technology uncover?
- Introduction to single cell RNA-seq
- Profile single cell transcriptome with Chromium Single Cell Universal 3' Gene Expression
- A mini scRNA-seq pipeline | Why apply pipelines?
- If given raw
bcl
files, we convert them to fastq files - As inputs are
fastq
files, we can ...- Run FastQC to evaluate sequence quality and content
- Use Trim Galore to trim reads if we spot unexpected low-quality base calls/adaptor contamination
- Re-run FastQC to re-evaluate sequence quality and content
- If single-cell RNA-seq data is generated from the plate-based protocol, we can ...
- Use STAR to perform alignment and FeatureCounts to generate the count matrix
- Else if single-cell RNA-seq data is generated from the droplet-based protocol, we can ...
- Use kb-python package to perform pseudo sequence alignment and generate the count matrix
- Use Cell Ranger pipelines to perform sequence alignment and generate the count matrix
- After having the
feature-barcode matrices
at hand, we can ...- Use Scanpy workflow to perform quality assurance, cell clustering, marker gene detection for cell identities, and trajectory inference
- Use Seurat workflow to perform quality assurance, cell clustering, and marker gene detection for cell identities case 1 | case 2
- If we observe the factor-specific clustering and want cells of the same cell type cluster together across single/multiple confounding factors, we can use canonical correlation analysis or Harmony (suitable for complicated confounding effects) to integrate cells
- We can leverage SingleR or ScType to partially or fully automate cell-type identification
- Other options of automating cell-type identification by mapping to references and then transfering labels: scArches, Symphony
- Use Bioconductor packages to perform single cell RNA-Seq data analysis
- Generate pseudobulk, which aggregates the gene expression levels specific to each cell type within an individual
- Perform pseudobulk-based differentially gene expression analysis in edgeR or DESeq2
- Use bulk RNAseq-based pathway analysis tools (e.g., clusterProfiler, GSEA, GSVA) or single cell RNAseq-based Pagoda2 to evaluate if a predefined set of genes shows statistically significant and consistent variations between biological conditions
- Run scGen to model perturbation responses; for heterogeneous perturbed cell population, run CellOT
- Review 2025 single cell genomics day
- Learn frontier single-cell RNA-seq analytical progress
- Identify single-cell eQTL
- Define spatial architecture in single cell data | spacexr
- Capture gene expression and chromatin accessibility together in a single cell
- Experiment with your data analysis process using COVID-19 RNA-seq data resources
- Can we obtain cell-type-specific gene expression information without using single-cell or single-nucleus RNA-seq, as they are costly in clinical research?
- How do RNA molecules get prepared and sequenced using Illumina technology?
- What types of library preparation kits are offered to prepare a complementary DNA (cDNA) library for sequencing?
- Prior to cDNA library preparation, is RNA extracted sufficiently? How is the degradation level of extracted RNA? Here is a more thorough guide on checking RNA integrity
- What types of sequencers are offered to sequence the stable double-stranded cDNA?
- What types of library preparation kits are offered to prepare a complementary DNA (cDNA) library for sequencing?
- How accurate are the transcripts measured by the sequencer?
- Which genome regions are transcribed? What are the exact genomic coordinates (of the reference genome) our sequencing reads come from?
- There are many ways to find the reads location by aligning reads with the reference genome, but you can choose a tool that's especially useful for your own scientific design. The splice-aware genome aligner STAR is strongly recommened.
- Other splice-aware alignment tool options include Olego, HISAT2, MapSplice, ABMapper, Passion, BLAT, RUM ...
- Other alignment tools disregarding isoforms include BWA, Bowtie2 ...
- Use Rsubread to align the reads
- There are many ways to find the reads location by aligning reads with the reference genome, but you can choose a tool that's especially useful for your own scientific design. The splice-aware genome aligner STAR is strongly recommened.
- Use Qualimap to perform quality assurance on the aligned reads
- Use MultiQC to harmonize all QC and alignment metadata from FastQC, STAR, Qualimap, and other tools
- Use GenomicAlignments for aligned reads to obtain the gene-level or exon-level quantification
- Use featureCounts for aligned reads to count the fragments
- [Recommend] Use Salmon for unaligned reads to obtain the transcript-level quantification
- Why unalign? To speed up the counting process of reads
- Next step: Use tximport to aggregate transcript-level quantification to the gene level
- Perform differential gene expression analysis
- Perform principal component analysis, heatmap, and clustering
- Perform gene set enrichment analysis
- Experiment with your data analysis process using COVID-19 RNA-seq data resources
- A quick start from loading an online spectrum, performing peak quality control, annotating peaks, to visualizing the annotated peaks
- Profile the whole transcriptome from Formalin-Fixed Paraffin-Embedded tissue sections with Visium
- Profile subcellular-level RNA targets with Xenium
- Read spatial transcriptomics data with SpatialData
- Analyze spatial DNA/RNA/protein data in subcellular/single cell/multiple cells with Giotto Suite
Mary Piper, Meeta Mistry, Jihe Liu, William Gammerdinger, & Radhika Khetani. (2022, January 6). hbctraining/scRNA-seq_online: scRNA-seq Lessons from HCBC (first release). Zenodo. https://doi.org/10.5281/zenodo.5826256.