-
Notifications
You must be signed in to change notification settings - Fork 3
Home
This project is focused on incorporating methods of "collapsing" a bam file across reads that come from the same initial cfDNA molecule, as determined by the Unique Molecular Indices included during library preparation.
We use Toil and CWL to combine various tools into a pipeline for collapsing UMI-tagged DNA molecules.
We draw on the work done by the Platform Informatics team at MSKCC in building Roslin and the IMPACT pipeline.
Our repository consists of the following directories:
-
/cwl_tools
: .cwl files that specify the parameters to be supplied to the underlying tools -
/python_tools
.py files that can be called from the .cwl files -
/workflows
.cwl files that chain together the .cwl files from/cwl_tools
into full workflows -
/test
testinputs.yaml
files and data
Here is a brief introduction to the frameworks that we use here at MSK to write our bioinformatics pipelines. Use the linked resources and tutorials to get more familiar with the details of our software and tools.
CWL is a standard intended to simplify and unify bioinformatics workflows across institutions and between users. It is used to describe the tools that should be run on bioinformatics data files (in our case .bam
, .fastq
.vcf
, etc).
CWL is a subset of the yaml syntax, with inputs
, outputs
, steps
, arguments
, baseCommand
, stdout
, and cwlVersion
as the only keys allowed in the root of the file.
When you first write a .cwl
file, you will be working with either a CommandLineTool
or a Workflow
, and you will use the keys mentioned above to fill out the steps that should be applied to that file.
For developing and testing workflows will use cwltool
, which is the reference implementation of a CWL file parser. It automatically parses the .cwl
files and turns them into a directed, acyclic graph of jobs that get run in the order by which they are dependent on one another. Toil uses cwltool
under the hood (see next section).
Tutorial on inputs / outputs / CWL tools and CWL workflows
Join the Gitter community (great place to ask questions about CWL)
Ask questions on Biostars (look out for questions by me @ionox0)
Documentation (for version 3.15.0)
Toil is the "workflow engine" that uses bsub
under the hood to submit workflow jobs to either IBM's Load Sharing Facility (lsf), or Sun Grid Engine (SGE) job scheduler. You can see the actual place in the code where this bsub happens here.
It is basically doing the same thing that you could do with a python or bash script, but in addition it gives you the ability to specify your workflow in CWL. It uses cwltool
under the hood to create the in-memory job graph.
Analysis of Circulating cfDNA to Examine Somatic Status
DNA molecules isolated from the plasma of our samples are "tagged" with random 3-base sequences called Unique Molecular Indices. These bases are used to determine which original source molecules came from the same initial fragment of DNA from the sample. After sequencing, our pipeline removes the first 4 or 5 bases from the 5' ends of the reads, the first three of which get used (along with the start position) to group all of the reads into UMI "families".
Within each family, there are reads from both original strands (positive or negative / Watson or Crick / top or bottom), and each of these reads will have the same 6 UMI bases, designated as XXX+XXX in the read header, for example ACT+GCT
(there is an optional wobble
parameter in Marianas which can used to specify how many mismatches to allow to merge two families together into one). This additional information provides a more accurate representation of which reads came from the same original source molecule of double-stranded DNA, as we would expect all of the UMIs to be identical if they came from the same molecule before PCR amplification.
When you run virtualenv ~/my_virtual_env
you get a new folder, and in this folder you can see that inside ~/my_virtual_env/bin
there is a python
symlink to the system python, but ~/my_virtual_env/lib
will be a directory to install fresh, and isolated versions of python packages.
This directory will only be used while your are “inside” your virtual env, as in, after you've done source ~/my_virtual_env/bin/activate
.
Under the hood, this really just adds and entry at the beginning of your $PATH
variable, which makes this new python symlink the default, as well as the pip
binary alongside of it and all of the python packages in the ~/my_virtual_env/lib
directory.
Otherwise, when not inside this virtual environment, you can see that the $PATH variable will not have this entry for ~/my_virtual_env/bin
, so the system python for /Library/.../python-2.7.10/bin
will be used.
Footer is such a weird word. Footer.