Skip to content
Ian edited this page Jan 27, 2021 · 28 revisions

This project is focused on incorporating methods of "collapsing" a bam file across reads that come from the same initial cfDNA molecule, as determined by the Unique Molecular Indices included during library preparation.

We use Toil and CWL to combine various tools into a pipeline for collapsing UMI-tagged DNA molecules.

We draw on the work done by the Platform Informatics team at MSKCC in building Roslin and the IMPACT pipeline.

Our repository consists of the following directories:

  • /cwl_tools: .cwl files that specify the parameters to be supplied to the underlying tools
  • /python_tools .py files that can be called from the .cwl files
  • /workflows .cwl files that chain together the .cwl files from /cwl_tools into full workflows
  • /test test inputs.yaml files and data

General Information for new users

Here is a brief introduction to the frameworks that we use here at MSK to write our bioinformatics pipelines. Use the linked resources and tutorials to get more familiar with the details of our software and tools.

Common Workflow Language:

CWL is a standard intended to simplify and unify bioinformatics workflows across institutions and between users. It is used to describe the tools that should be run on bioinformatics data files (in our case .bam, .fastq .vcf, etc).

CWL is a subset of the yaml syntax, with inputs, outputs, steps, arguments, baseCommand, stdout, and cwlVersion as the only keys allowed in the root of the file.

When you first write a .cwl file, you will be working with either a CommandLineTool or a Workflow, and you will use the keys mentioned above to fill out the steps that should be applied to that file.

For developing and testing workflows will use cwltool, which is the reference implementation of a CWL file parser. It automatically parses the .cwl files and turns them into a directed, acyclic graph of jobs that get run in the order by which they are dependent on one another. Toil uses cwltool under the hood (see next section).

Tutorial on inputs / outputs / CWL tools and CWL workflows

Join the Gitter community (great place to ask questions about CWL)

Ask questions on Biostars (look out for questions by me @ionox0)

Toil:

Documentation (for version 3.15.0)

Join the Gitter community

Toil is the "workflow engine" that uses bsub under the hood to submit workflow jobs to either IBM's Load Sharing Facility (lsf), or Sun Grid Engine (SGE) job scheduler. You can see the actual place in the code where this bsub happens here.

It is basically doing the same thing that you could do with a python or bash script, but in addition it gives you the ability to specify your workflow in CWL. It uses cwltool under the hood to create the in-memory job graph.

ACCESS:

Analysis of Circulating cfDNA to Examine Somatic Status

DNA molecules isolated from the plasma of our samples are "tagged" with random 3-base sequences called Unique Molecular Indices. These bases are used to determine which original source molecules came from the same initial fragment of DNA from the sample. After sequencing, our pipeline removes the first 4 or 5 bases from the 5' ends of the reads, the first three of which get used (along with the start position) to group all of the reads into UMI "families".

Within each family, there are reads from both original strands (positive or negative / Watson or Crick / top or bottom), and each of these reads will have the same 6 UMI bases, designated as XXX+XXX in the read header, for example ACT+GCT (there is an optional wobble parameter in Marianas which can used to specify how many mismatches to allow to merge two families together into one). This additional information provides a more accurate representation of which reads came from the same original source molecule of double-stranded DNA, as we would expect all of the UMIs to be identical if they came from the same molecule before PCR amplification.

Understanding Pip and Virtualenv and the $PATH variable

When you run virtualenv ~/my_virtual_env you get a new folder, and in this folder you can see that inside ~/my_virtual_env/bin there is a python symlink to the system python, but ~/my_virtual_env/lib will be a directory to install fresh, and isolated versions of python packages.

This directory will only be used while your are “inside” your virtual env, as in, after you've done source ~/my_virtual_env/bin/activate.

Under the hood, this really just adds and entry at the beginning of your $PATH variable, which makes this new python symlink the default, as well as the pip binary alongside of it and all of the python packages in the ~/my_virtual_env/lib directory.

Otherwise, when not inside this virtual environment, you can see that the $PATH variable will not have this entry for ~/my_virtual_env/bin, so the system python for /Library/.../python-2.7.10/bin will be used.