-
Notifications
You must be signed in to change notification settings - Fork 110
ABySS File Formats
Sequence overlap graph in ABySS adj (adjacency) format.
Example:
23 44 198 ; 3193- 56- [d=-23] ; 3681-
25 30 1045 ; 3983- 1794- [d=-28] 2808+ [d=-28] 3136- [d=-28] ; 2699+ 4758+
27 54 175 ; 1255+ 4657- ;
28 51 3854 ; 875+ 3725- ; 1314- [d=-21]
29 73 1151 ; 3015+ ; 2199+
30 34 4896 ; 229- 4236+ [d=-26] 4060+ [d=-24] ; 2091+ 4267+
31 58 2483 ; 1454+ [d=-28] ; 3453+ [d=-28]
32 33 530 ; 2566- ; 3453+ [d=-28]
The .adj
files generated by ABySS describe the sequence overlap graph at each stage of an assembly. In the sequence overlap graph, each node represents a sequence (e.g. a contig) and each edge represents a perfect overlap between the ends of two sequences. In most cases, the length of the sequence overlap is k - 1 bases.
An .adj
file consists of 3 fields per line, separated by semicolons (';').
The first field (e.g. "28 51 3854") provides information about the subject sequence and consist of 3 parts: <SEQ_ID> <SEQ_LEN> <KMERS>
, where SEQ_ID
is a unique identifier for the sequence assigned by ABySS, SEQ_LEN
is the length of the sequence in bases, and KMERS
is the number of KMERS that mapped to the sequence during assembly (i.e. the sum of kmer multiplicities for each kmer in the sequence.)
The second and third fields (e.g. "3193- 56- [d=-23]", "3681-") list the SEQ_ID
of sequences that overlap the subject sequence. Each field consists of a list of whitespace-separated SEQ_ID
, each of which has a +
or -
suffix to indicate the sense of the sequence that produces the overlap. The +/- sense of a given sequence is determined by the form it takes in the FASTA file corresponding to the .adj
file. Given the naming conventions of the ABySS output files, this correspondence should usually be clear (e.g. "myassembly-1.adj" corresponds to "myassembly-1.fa"). The sense of the sequence listed in the FASTA file is considered to be the +
sense. By default, the length of the overlap between two sequences is assumed to be k - 1 bases. If this not the case, an additional distance specifier (e.g. "[d=-23]") must be inserted following the SEQ_ID
to indicate that the overlap is of a different length. A negative distance value indicates the overlap size in bases, whereas a positive distance indicates a gap size in bases.
It is important to note that the order of the second and third fields is exactly the opposite of what one would expect; the second field lists sequences that overlap the subject sequence on the right side and the third field lists sequences that overlap the subject sequence on the left side.
The .adj
format is a ABySS-specific format for describing graphs that should preferably be replaced by the more standard .dot
format.
NCBI GenBank Accessioned Golden Path
Tabular data in Comma-separated values format
Distance estimates in ABySS dist format (similar to adj format)
Distance estimates in Graphviz dot format
Sequence overlap graph in Graphviz dot format
Contig and scaffold sequences in FASTA format
A histogram of the fragment size distribution in tab-separated values format, without a header
Column | Description |
---|---|
1 | Fragment size |
2 | Count |
- Positive fragment sizes are oriented forward-reverse (FR).
- Negative fragment sizes are oriented reverse-forward (RF).
Reports in Markdown format
Scaffolds in ABySS PATH format
If the line is composed of a single identifier, the specified contig is removed from the assembly. Otherwise, the first column of the line is the ID of the new scaffold, and the remainder of the line is a sequence of contig IDs and their orientation.
Reads aligned to contig/scaffold sequences in Sequence Alignment/Map format
Statistics of contig/scaffold contiguity in TSV, CSV and Markdown formats
Tabular data in tab-separated values format, including a one-line header