Skip to content

The "New" Pipeline

Latest
Compare
Choose a tag to compare
@pvstodghill pvstodghill released this 10 Nov 17:24
· 80 commits to main since this release

I have spent a bit of time over the last several weeks making
substantial changes to our RNA-Seq differential expression
pipeline. Here are some of the significant changes.

  • It's on Github https://github.com/pvstodghill/pipeline-rnaseq/

  • It uses featureCounts
    http://subread.sourceforge.net/featureCounts.html instead of my
    crappy, slow, buggy scripts for counting reads/genes.

  • It has been tested using Docker, Singularity, and Conda to provide
    the various requirement components (e.g., Bowtie2, DESeq2, etc.).

  • It uses FALCO instead of FastQC. It uses FASTP instead of
    Trimomatic.

  • Several of the stages that relied on non-parallel components
    (running FALCO, generating profiles, etc.) are now executed
    concurrently using GNU Parallel.

  • Components that are used for other pipelines ("howto", "stubs", and
    "scripts") have been factored out into GIT submodules for easy
    reuse.

  • It no longer converts the SAM/BAM files to GFF before counting reads
    and making profiles. This is a huge space savings.

  • There is a simple way of testing that the "sense" of the reads is
    correct (i.e., are there more reads aligning to the "sense" or
    "anti-sense" of the tmRNA gene?).

  • The pipeline is geared towards using RefSeq genomes and
    annotations. However, it's now possible to provide "additional" gene
    annotations (say, from DC3000's "old" Genbank" annotation) and a
    list of gene name "aliases" (say, from DC3000's "old" Genbank"
    annotation) which get added to the DESeq2 results.

To summarize

  • This version is much easier to share with others. It should be much
    easier to get it running on, e.g., the BioHPC cluster.

  • Because it uses faster components (FALCO, FASTP, featureCounts) and
    GNU Parallel where necessary, it is much faster to run.

  • Counting reads/gene is much more robust.

  • The RefSeq and Genbank annotations for DC3000 are starting to
    diverge. Going forward, we can use the RefSeq annotations, but we
    can carry forward the bits of the Genbank annotation that we don't
    want to lose.