A task and pipeline execution system for directed acyclic graphs to support scientific, and more specifically, genomic analysis workflows. We are currently in alpha development; please see the Roadmap. The latest API documentation can be found here.
- Goals
- Building
- Command line
- Include Dagr in your project
- Roadmap
- Overview
- List of features
- Authors
- License
There are many toolkits available for creating and executing pipelines of dependent jobs; dagr does not aim to be all things to all people but to make certain types of pipelines easier and more pleasurable to write. It is specifically focused on:
- Writing pipelines that are concise, legible, and type-safe
- Easy composition of pipelines into bigger pipelines
- Providing safe and coherent ways to dynamically change the graph during execution
- Making the full power and expressiveness of scala available to pipeline authors
- Efficiently executing tasks concurrently within the constraints of a single machine/instance
It is a tool for working data scientists, programmers and bioinformaticians.
The following is an example of a simple Example pipeline in dagr, minus import and package statements:
@clp(description="Example FASTQ to BAM pipeline.", group = classOf[Pipelines])
class ExamplePipeline
( @arg(flag="i", doc="Input FASTQ.") val fastq: PathToFastq,
@arg(flag="r", doc="Reference FASTA.") val ref: PathToFasta,
@arg(flag="t", doc="Target regions.") val targets: Option[PathToIntervals] = None,
@arg(flag="o", doc="Output directory.") val out: DirPath,
@arg(flag="p", doc="Output file prefix.") val prefix: String
) extends Pipeline(Some(out)) {
override def build(): Unit = {
val bam = out.resolve(prefix + ".bam")
val tmpBam = out.resolve(prefix + ".tmp.bam")
val metricsPrefix: Some[DirPath] = Some(out.resolve(prefix))
Files.createDirectories(out)
val bwa = new BwaMem(fastq=fastq, ref=ref)
val sort = new SortSam(in=Io.StdIn, out=tmpBam, sortOrder=SortOrder.coordinate)
val mark = new MarkDuplicates(in=tmpBam, out=Some(bam), prefix=metricsPrefix)
val rmtmp = new DeleteBam(tmpBam)
root ==> (bwa | sort) ==> mark ==> rmtmp
targets.foreach(path => root ==> new CollectHsMetrics(in=bam, prefix=metricsPrefix, targets=path, ref=ref))
}
}
The @clp
and @arg
annotations are required to expose this pipeline for execution via the command line interface. For pipelines that do not need to be run via the command line (for example if they are only used as building blocks in other pipelines) they can be omitted.
Use sbt assembly
to build an executable jar in target/scala-2.11/
.
Tests may be run with sbt test
.
java -jar target/scala-2.11/dagr-0.1.0-SNAPSHOT.jar
to see the full list of options.
You can include the three sub-projects that make up dagr using:
libraryDependencies += "com.fulcrumgenomics" %% "dagr-core" % "0.1.0"
libraryDependencies += "com.fulcrumgenomics" %% "dagr-tasks" % "0.1.0"
libraryDependencies += "com.fulcrumgenomics" %% "dagr-pipelines" % "0.1.0"
Or you can depend on the following which will pull in the three dependencies above:
libraryDependencies += "com.fulcrumgenomics" %% "dagr" % "0.1.0",
We are currently working on the first release of dagr
and therefore rapidly evolving features are subject-to-change.
dagr
contains three projects:
dagr-core
for specifying, scheduling, and executing tasks with dependencies.dagr-tasks
for common genomic analysis tasks, such as those in Picard tools, JeanLuc, Bwa, and elsewhere.dagr-pipelines
for common genomic pipelines, such as mapping, variant calling, and quality control.
dagr
endeavors to combine the full features of the Scala programming language with a simplifying DSL for fast and easy authoring of complicated tasks and pipelines.
dagr
pipelines execute on a single-host or machine. For resource-intense pipelines, we recommend provisioning large compute instances.
Please see the example dagr
configuration for customizing dagr for your environment.
In no particular order ...
- Manages complex dependencies among tasks and pipelines.
- Operators to pipe input and output between tasks and files without writing to disk.
- Resource-aware scheduling across tasks and pipelines to maximize parallelism.
- A simple gnu-style option parser and sub-command system, making pipelines into first-class command line programs.
- Supports pre-compiled pipelines and pipelines from scala script files that are compiled on the fly.
- Tasks are not fully realized until all dependencies are met, allowing for conditional logic.
- See EitherTask for a good example.
- Tasks can execute processes or be pure scala methods run in the JVM.
- Mechanisms for passing state between tasks without coupling the tasks or having to rely on manual co-ordination (ex. storing an intermediate result on disk).
- See Linker
- Contains a small set of predefined genomic analysis tasks and pipelines.
- Configuration (using HOCON and TypeSafe config to fully specify the dagr environment.
- Tim Fennell (maintainer)
- Nils Homer (maintainer)
dagr
is open source software released under the MIT License.