-
Notifications
You must be signed in to change notification settings - Fork 3
Bioinformatics File Formats & Tools
This page will become more organized over time, but for now here are some notes that should be helpful for new developers getting started with genomics analysis:
-
File formats to familiarize yourself with:
-
Here is an example of two paired reads from a mapped Bam (you might want to copy and paste them into somewhere with a wider view):
K00217:149:HT3N3BBXX:7:1101:10084:15416:ATT+GGT 1185 2 47982455 51 16M13D13I67M = 47982526 167 CAACTCCTGGGCTCAACGCTTCCACCTGCCTTGGCCTCCTAATGTGCTGGGATTACAGACATGCGCCACCATTCCTGGCCAGGCTAATGTTTTAAA JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ YA:Z:2:47982264:207M13D13I354M PG:Z:MarkDuplicates RG:Z:DH-ntrk-012-pl-T18 NM:i:28 YM:i:0 YO:Z:2:47982484:+:29S67M MQ:i:60 AS:i:67 XS:i:45 YX:i:10
K00217:149:HT3N3BBXX:7:1101:10084:15416:ATT+GGT 1105 2 47982526 60 96M = 47982455 -167 TTCCTGGCCAGGCTAATGTTTTAAATGTAAAATAAGAGTATTGATAATCCAGACGTTGTGTTGCATTTTATTCTTCTGTGCCCATGGTCATTTCAG JJJJJJJJ<JFJJJJJJJJJJJAJJJJJJJJJJJJJFJJJJJJJJJFJJJJJJFJJJJJJJJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJAF YA:Z:2:47982264:574M PG:Z:MarkDuplicates RG:Z:DH-ntrk-012-pl-T18 NM:i:1 MQ:i:51 AS:i:91 XS:i:21
Copy the base sequence from the first read, then copy the bases from the second read onto the next line. Try to find out where the two sequences match up by putting spacing before the second read.
Also, notice that the 47982526
from the first read matches the 47982526
from the second read. And the 47982455
from the first read matches the 47982455
from the second read. These are the start positions of each read and the start position of each mate. The 167
refers to the read length, and is positive for the first read, and negative for the second.
-
The GATK best practices are a set of comprehensive standards that describe the basic steps that should be performed to generate variant calls from initial fastq data. We have been using most tools from version 3 of the GATK, but may like to upgrade to version 4.
-
This repository builds on the steps implemented in the standard GATK pipeline, using two new libraries Fulcrum and Marianas. These tools leverage the UMIs on the ends of our reads in order to deduplicate reads that came from the same original DNA molecule in the sample, and thus give us more accurate results for our final "consensus reads".
-
It may be a bit confusing at first to differentiate between several versions of similar pipelines. In addition to the GATK pipeline from the Broad Institute, this pipeline which you are currently viewing draws heavily from two other existing pipelines. The first is the IMPACT-Pipeline, which is associated with a number of configuration and run scripts in this repository Innovation-Meta-Pipeline.
-
The second inspiration is our MSKCC pipeline Roslin. Roslin has five modules, which are grouped together into the full pipeline that is found in
setup/cwl/project-workflow.cwl
. Roslin also depends on another set of setup and run scripts for the configuration and actually running of the pipeline. These scrips can be found in Roslin Core. They have also been adapted and used in this repository, e.g.pipeline-runner.sh
andpipeline_submit.py
. -
Abbreviations to know. Bam filenames will often have a host of suffixes added to indicate the processing that has been done:
- MD = Picard's Mark Duplicates step
- IR = Abra's Indel Realignment step
- FX = Picard's FixMateInformation step
- CL = Usually refers to fastqs that have been CLipped (adapter sequence remove from end)
- BR = GATK's Base Quality Score Recalibration step
-
"Therefore if you look at a SAM/BAM file (for Illumina data at least), it should be the case that in any pair of reads with the 0×02 bit set (i.e. considered a proper pair), exactly one of the two reads will have the 0×10 bit set as well (i.e. it is reverse-complemented; again, see the SAM file spec). For the read with its 0×10 bit set, the “SEQ” listed in the SAM file will be the reverse complement of the original read as seen in the FASTQ. That means that in the SAM file, the SEQs for a pair of reads are now both being presented in forward orientation even though the “FR” orientation information is stored in the FLAG." http://www.cureffi.org/2012/12/19/forward-and-reverse-reads-in-paired-end-sequencing/
Marianas:ACT+TTA:2:48033828:4:3:2:48033899:4:3
These fields are:
Marianas
UMI1+UMI2
read 1 contig
read 1 start
read 1 Positive Strand Read Count
read 1 Negative Strand Read Count
read 2 contig
read 2 start
read 2 Positive Strand Read Count
read 2 Negative Strand Read Count
Footer is such a weird word. Footer.