This example workflow demonstrates how to parallelize the SSAHA2 (Sequence Search and Alignment by Hasing Algorithm) tool published by the Sanger institute.
If you have not done so already, please clone this example repository like so:
git clone https://github.com/cooperative-computing-lab/makeflow-examples.git
cd ./makeflow-examples/ssaha
First, download and install a suitable binary for SSAHA2:
export SSAHA_BINARY=ssaha2_v2.5.5_x86_64
wget ftp://ftp.sanger.ac.uk/pub/resources/software/ssaha2/${SSAHA_BINARY}.tgz
tar xvzf ${SSAHA_BINARY}.tgz
cp ${SSAHA_BINARY}/ssaha2 .
If you do not have any data of your own, you can generate some random data for testing purposes in FASTQ format. The first argument to fastq generate is the number of sequences, and the second is the length of sequences.
./fastq_generate.pl 10000 2000 > db.fastq
./fastq_generate.pl 100000 100 db.fastq > query.fastq
Make sure that the sequential ssaha executable works. This should run in ~5 minutes, so cancel it once you are satisfied it is working.
./ssaha2 db.fastq query.fastq
Then, generate a workflow to parallelize the job into sub-jobs of 1000 sequences each:
./make_ssaha_workflow db.fastq query.fastq output.fastq 1000 > ssaha.mf
Finally, run the workflow using makeflow locally, or using a batch system:
makeflow ssaha.mf
makeflow -T condor ssaha.mf
makeflow -T sge ssaha.mf
makeflow -T wq.ssaha.mf
Alternatively, the makeflow can be run using the JX
or JSON
implementation
makeflow --jx ssaha.jx
makeflow --json ssaha.json
Workflow Size | Reference Size(Number x Length) | Query Size(Number x Length) | Number of seq per split | Approx Time with Machine |
Small | 10000x2000 (Fixed 20M) | 100000x100 (237K) | 100 | ~5 min : 1 machine |
Medium | 10000x2000 (Fixed 1000M) | 1000000x100 (237M) | 1000 | ~3 : 20 machines |
Large | 100000x2000 (Fixed 4.0G) | 1000000x500 (386M) | 1000 | ~15 min : 20 machines |