Skip to content

Bioinformatics File Formats & Tools

Ian edited this page Jan 16, 2019 · 2 revisions

This page will become more organized over time, but for now here are some notes that should be helpful for new developers getting started with genomics analysis:

  • File formats to familiarize yourself with:

    • Fastq (~120 base-pair sequence from this file corresponds to a "read")
    • Sam
    • Bam == Binary (compressed) Sam
    • Cigar
    • Pileup
    • Vcf
    • Bed
  • Here is an example of two paired reads from a mapped Bam (you might want to copy and paste them into somewhere with a wider view):

K00217:149:HT3N3BBXX:7:1101:10084:15416:ATT+GGT	1185	2	47982455	51	16M13D13I67M	=	47982526	167	CAACTCCTGGGCTCAACGCTTCCACCTGCCTTGGCCTCCTAATGTGCTGGGATTACAGACATGCGCCACCATTCCTGGCCAGGCTAATGTTTTAAA	JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ	YA:Z:2:47982264:207M13D13I354M	PG:Z:MarkDuplicates	RG:Z:DH-ntrk-012-pl-T18	NM:i:28	YM:i:0	YO:Z:2:47982484:+:29S67M	MQ:i:60	AS:i:67	XS:i:45	YX:i:10
K00217:149:HT3N3BBXX:7:1101:10084:15416:ATT+GGT	1105	2	47982526	60	96M	=	47982455	-167	TTCCTGGCCAGGCTAATGTTTTAAATGTAAAATAAGAGTATTGATAATCCAGACGTTGTGTTGCATTTTATTCTTCTGTGCCCATGGTCATTTCAG	JJJJJJJJ<JFJJJJJJJJJJJAJJJJJJJJJJJJJFJJJJJJJJJFJJJJJJFJJJJJJJJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJAF	YA:Z:2:47982264:574M	PG:Z:MarkDuplicates	RG:Z:DH-ntrk-012-pl-T18	NM:i:1	MQ:i:51	AS:i:91	XS:i:21

Copy the base sequence from the first read, then copy the bases from the second read onto the next line. Try to find out where the two sequences match up by putting spacing before the second read.

Also, notice that the 47982526 from the first read matches the 47982526 from the second read. And the 47982455 from the first read matches the 47982455 from the second read. These are the start positions of each read and the start position of each mate. The 167 refers to the read length, and is positive for the first read, and negative for the second.

  • The GATK best practices are a set of comprehensive standards that describe the basic steps that should be performed to generate variant calls from initial fastq data. We have been using most tools from version 3 of the GATK, but may like to upgrade to version 4.

  • This repository builds on the steps implemented in the standard GATK pipeline, using two new libraries Fulcrum and Marianas. These tools leverage the UMIs on the ends of our reads in order to deduplicate reads that came from the same original DNA molecule in the sample, and thus give us more accurate results for our final "consensus reads".

  • It may be a bit confusing at first to differentiate between several versions of similar pipelines. In addition to the GATK pipeline from the Broad Institute, this pipeline which you are currently viewing draws heavily from two other existing pipelines. The first is the IMPACT-Pipeline, which is associated with a number of configuration and run scripts in this repository Innovation-Meta-Pipeline.

  • The second inspiration is our MSKCC pipeline Roslin. Roslin has five modules, which are grouped together into the full pipeline that is found in setup/cwl/project-workflow.cwl. Roslin also depends on another set of setup and run scripts for the configuration and actually running of the pipeline. These scrips can be found in Roslin Core. They have also been adapted and used in this repository, e.g. pipeline-runner.sh and pipeline_submit.py.

  • Abbreviations to know. Bam filenames will often have a host of suffixes added to indicate the processing that has been done:

    • MD = Picard's Mark Duplicates step
    • IR = Abra's Indel Realignment step
    • FX = Picard's FixMateInformation step
    • CL = Usually refers to fastqs that have been CLipped (adapter sequence remove from end)
    • BR = GATK's Base Quality Score Recalibration step
  • "Therefore if you look at a SAM/BAM file (for Illumina data at least), it should be the case that in any pair of reads with the 0×02 bit set (i.e. considered a proper pair), exactly one of the two reads will have the 0×10 bit set as well (i.e. it is reverse-complemented; again, see the SAM file spec). For the read with its 0×10 bit set, the “SEQ” listed in the SAM file will be the reverse complement of the original read as seen in the FASTQ. That means that in the SAM file, the SEQs for a pair of reads are now both being presented in forward orientation even though the “FR” orientation information is stored in the FLAG." http://www.cureffi.org/2012/12/19/forward-and-reverse-reads-in-paired-end-sequencing/

Marianas Read Name information

Marianas:ACT+TTA:2:48033828:4:3:2:48033899:4:3

These fields are:

Marianas
UMI1+UMI2
read 1 contig
read 1 start
read 1 Positive Strand Read Count
read 1 Negative Strand Read Count

read 2 contig
read 2 start
read 2 Positive Strand Read Count
read 2 Negative Strand Read Count