I have spent a bit of time over the last several weeks making
substantial changes to our RNA-Seq differential expression
pipeline. Here are some of the significant changes.
-
It's on Github https://github.com/pvstodghill/pipeline-rnaseq/
-
It uses featureCounts
http://subread.sourceforge.net/featureCounts.html instead of my
crappy, slow, buggy scripts for counting reads/genes. -
It has been tested using Docker, Singularity, and Conda to provide
the various requirement components (e.g., Bowtie2, DESeq2, etc.). -
It uses FALCO instead of FastQC. It uses FASTP instead of
Trimomatic. -
Several of the stages that relied on non-parallel components
(running FALCO, generating profiles, etc.) are now executed
concurrently using GNU Parallel. -
Components that are used for other pipelines ("howto", "stubs", and
"scripts") have been factored out into GIT submodules for easy
reuse. -
It no longer converts the SAM/BAM files to GFF before counting reads
and making profiles. This is a huge space savings. -
There is a simple way of testing that the "sense" of the reads is
correct (i.e., are there more reads aligning to the "sense" or
"anti-sense" of the tmRNA gene?). -
The pipeline is geared towards using RefSeq genomes and
annotations. However, it's now possible to provide "additional" gene
annotations (say, from DC3000's "old" Genbank" annotation) and a
list of gene name "aliases" (say, from DC3000's "old" Genbank"
annotation) which get added to the DESeq2 results.
To summarize
-
This version is much easier to share with others. It should be much
easier to get it running on, e.g., the BioHPC cluster. -
Because it uses faster components (FALCO, FASTP, featureCounts) and
GNU Parallel where necessary, it is much faster to run. -
Counting reads/gene is much more robust.
-
The RefSeq and Genbank annotations for DC3000 are starting to
diverge. Going forward, we can use the RefSeq annotations, but we
can carry forward the bits of the Genbank annotation that we don't
want to lose.