Skip to content
Ben Vandervalk edited this page Feb 21, 2014 · 17 revisions

Adj

Sequence overlap graph in ABySS adj (adjacency) format.

Example:

23 44 198       ; 3193- 56- [d=-23]     ; 3681-
25 30 1045      ; 3983- 1794- [d=-28] 2808+ [d=-28] 3136- [d=-28]       ; 2699+ 4758+
27 54 175       ; 1255+ 4657-   ;
28 51 3854      ; 875+ 3725-    ; 1314- [d=-21]
29 73 1151      ; 3015+ ; 2199+
30 34 4896      ; 229- 4236+ [d=-26] 4060+ [d=-24]      ; 2091+ 4267+
31 58 2483      ; 1454+ [d=-28] ; 3453+ [d=-28]
32 33 530       ; 2566- ; 3453+ [d=-28]

The .adj files generated by ABySS describe the sequence overlap graph at each stage of an assembly. In the sequence overlap graph, each node represents a sequence (e.g. a contig) and each edge represents a perfect overlap between the ends of two sequences. In most cases, the length of the sequence overlap is k - 1 bases.

An .adj file consists of 3 fields per line, separated by semicolons (';').

The first field (e.g. "28 51 3854") provides information about the subject sequence and consist of 3 parts: <SEQ_ID> <SEQ_LEN> <KMERS>, where SEQ_ID is a unique identifier for the sequence assigned by ABySS, SEQ_LEN is the length of the sequence in bases, and KMERS is the number of KMERS that mapped to the sequence during assembly (i.e. the sum of kmer multiplicities for each kmer in the sequence.)

The second and third fields (e.g. "3193- 56- [d=-23]", "3681-") list the SEQ_ID of sequences that overlap the subject sequence. Each field consists of a list of whitespace-separated SEQ_ID, each of which has a + or - suffix to indicate the sense of the sequence that produces the overlap. The +/- sense of a given sequence is determined by the form it takes in the FASTA file corresponding to the .adj file. Using the naming conventions of the ABySS output files, this correspondence should usually be clear (e.g. "myassembly-1.adj" corresponds to "myassembly-1.fa"). The sense of the sequence listed in the FASTA file is considered to be the + sense. By default, the length of the overlap between two sequences is assumed to be k - 1 bases. If this not the case, an additional distance specifier (e.g. "[d=-23]") must be inserted following the SEQ_ID to indicate that the overlap is of a different length. A negative distance value indicates the overlap size in bases, whereas a positive distance indicates a gap size in bases.

It is important to note that the order of the second and third fields is exactly the opposite of what one would expect; the second field lists sequences that overlap the subject sequence on the right side and the third field lists sequences that overlap the subject sequence on the left side.

The .adj format is a ABySS-specific format for describing graphs that should preferably be replaced by the more standard .dot format.

AGP

NCBI GenBank Accessioned Golden Path

CSV

Tabular data in Comma-separated values format

Dist

Distance estimates in ABySS dist format (similar to adj format)

Dist.dot

Distance estimates in Graphviz dot format

Dot

Sequence overlap graph in Graphviz dot format

FA

Contig and scaffold sequences in FASTA format

Hist

A histogram of the fragment size distribution in tab-separated values format, without a header

Column Description
1 Fragment size
2 Count
  • Positive fragment sizes are oriented forward-reverse (FR).
  • Negative fragment sizes are oriented reverse-forward (RF).

MD

Reports in Markdown format

Path

Scaffolds in ABySS PATH format

If the line is composed of a single identifier, the specified contig is removed from the assembly. Otherwise, the first column of the line is the ID of the new scaffold, and the remainder of the line is a sequence of contig IDs and their orientation.

SAM

Reads aligned to contig/scaffold sequences in Sequence Alignment/Map format

Example:

@SQ     SN:5105 LN:122
@SQ     SN:5106 LN:92
@SQ     SN:5107 LN:186
*       161     4       1       32      6S32M63S        77      1       77      *       *
*       161     4       1       32      40S32M29S       215     1       134     *       *
*       129     4       1       32      69S32M  358     1       50      *       *
*       161     4       1       31      31M70S  1390    1       34      *       *
*       161     4       1       32      6S32M63S        1390    1       46      *       *
*       161     4       1       32      6S32M63S        77      1       59      *       *
*       161     4       1       32      6S32M63S        77      1       55      *       *
*       177     4       1       32      58S32M11S       147     1       13      *       *
*       161     4       1       30      13S30M58S       77      1       60      *       *

The SAM format is used by ABySS to describe alignments of reads to assembled sequences at different stages of the assembly. As of ABySS version 1.3.8, the reads are aligned to the assembled sequences twice: once during the construction of the contigs, and once during the construction of the scaffolds.

By default, ABySS omits field 1 (QNAME, the ID of the aligned sequence), field 10 (SEQ, the aligned sequence), and field 11 (QUAL, the quality string of the aligned sequence) from any generated SAM data, placing a * in these fields instead. This is done because the fields are not needed by ABySS to calculate distances estimates, and omitting the fields greatly reduces the size of the SAM data. (To generate SAM data that contains all of fields, ABySS may be compiled with the --enable-samseqqual during the configure stage.)

The ABySS assembly pipeline does not generate any output SAM files by default, because the files tend to be very large. Instead, ABySS streams the SAM data through a Unix command pipeline to generate the distance estimates that are used to link contigs and scaffolds. Only the distance estimates are saved to disk (.dist and .dist.dot files). However, an ABySS user may force generation of output SAM files by specifying the "pe-sam" and/or "mp-sam" targets on the "abyss-pe" commandline, e.g.

abyss-pe name=myassembly k=40 in='read1.fastq read2.fastq' pe-sam mp-sam scaffolds

The "pe-sam" and "mp-sam" targets will generate the files myassembly-3.sam.gz and myassembly-6.sam.gz, respectively. There are also "pe-bam" and "mp-bam" targets if the user wishes to generate equivalent BAM (compressed SAM) files. See the abyss-pe man page for more info.

Stats

Statistics of contig/scaffold contiguity in TSV, CSV and Markdown formats

Tab

Tabular data in tab-separated values format, including a one-line header

Clone this wiki locally