-
-
Notifications
You must be signed in to change notification settings - Fork 20
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Complete rewrite for v2. See the changelog for more details.
- Loading branch information
1 parent
f03bccf
commit f44c498
Showing
37 changed files
with
3,624 additions
and
2,456 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,29 +1,27 @@ | ||
name = "FASTX" | ||
uuid = "c2308a5c-f048-11e8-3e8a-31650f418d12" | ||
authors = [ | ||
"Sabrina J. Ward <[email protected]>", | ||
"Jakob N. Nissen <[email protected]>" | ||
] | ||
version = "1.3.0" | ||
authors = ["Sabrina J. Ward <[email protected]>", "Jakob N. Nissen <[email protected]>"] | ||
version = "2.0.0" | ||
|
||
[deps] | ||
Automa = "67c07d97-cdcb-5c2c-af73-a7f9c32a568b" | ||
BioGenerics = "47718e42-2ac5-11e9-14af-e5595289c2ea" | ||
BioSequences = "7e6ae17a-c86d-528c-b3b9-7f778a29fe59" | ||
BioSymbols = "3c28c6f8-a34d-59c4-9654-267d177fcfa9" | ||
ScanByte = "7b38b023-a4d7-4c5e-8d43-3f3097f304eb" | ||
StringViews = "354b36f9-a18e-4713-926e-db85100087ba" | ||
TranscodingStreams = "3bb67fe8-82b1-5028-8e26-92a6c54297fa" | ||
|
||
[compat] | ||
Automa = "0.7, 0.8" | ||
BioGenerics = "0.1" | ||
Automa = "0.8" | ||
BioGenerics = "0.1.2" | ||
BioSequences = "3" | ||
BioSymbols = "5" | ||
ScanByte = "0.3" | ||
StringViews = "1" | ||
TranscodingStreams = "0.9.5" | ||
julia = "1.6" | ||
|
||
[extras] | ||
FormatSpecimens = "3372ea36-2a1a-11e9-3eb7-996970b6ffbd" | ||
Test = "8dfed614-e22c-5e08-85e1-65c5234f0b40" | ||
Random = "9a3f8284-a2c9-5f02-9a11-845980a1fd5c" | ||
|
||
[targets] | ||
test = ["Test", "FormatSpecimens"] | ||
test = ["Random"] |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,28 +1,29 @@ | ||
using Documenter, FASTX | ||
|
||
# Build documentation. | ||
DocMeta.setdocmeta!(FASTX, :DocTestSetup, :(using FASTX, BioSequences); recursive=true) | ||
|
||
makedocs( | ||
modules = [FASTX], | ||
format = Documenter.HTML(), | ||
modules = [FASTX, FASTX.FASTQ, FASTX.FASTA], | ||
sitename = "FASTX.jl", | ||
doctest = false, | ||
strict = false, | ||
doctest = true, | ||
pages = [ | ||
"Home" => "index.md", | ||
"Manual" => [ | ||
"FASTA formatted files" => "manual/fasta.md", | ||
"FASTQ formatted files" => "manual/fastq.md" | ||
"Overview" => Any[ | ||
"Overview" => "index.md", | ||
"Records" => "records.md", | ||
"File I/O" => "files.md", | ||
], | ||
"Library" => [ | ||
"Public" => "lib/public.md" | ||
] | ||
"FASTA" => "fasta.md", | ||
"FASTQ" => "fastq.md", | ||
"FAI" => "fai.md" | ||
], | ||
authors = "Ben J. Ward, The BioJulia Organisation and other contributors." | ||
authors = "Sabrina J. Ward, Jakob N. Nissen, The BioJulia Organisation and other contributors.", | ||
checkdocs = :all | ||
) | ||
|
||
deploydocs( | ||
repo = "github.com/BioJulia/FASTX.jl.git", | ||
push_preview = true, | ||
deps = nothing, | ||
make = nothing | ||
) | ||
) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,148 @@ | ||
```@meta | ||
CurrentModule = FASTX | ||
DocTestSetup = quote | ||
using FASTX | ||
end | ||
``` | ||
|
||
# FASTA index (FAI files) | ||
FASTX.jl supports FASTA index (FAI) files. | ||
When a FASTA file is indexed with a FAI file, one can seek records by their name, or extract parts of records easily. | ||
|
||
See the FAI specifcation here: http://www.htslib.org/doc/faidx.html | ||
|
||
### Making an `Index` | ||
A FASTA index (of type `Index`) can be constructed from an `IO` object representing a FAI file: | ||
|
||
```jldoctest | ||
julia> io = IOBuffer("seqname\t9\t2\t6\t8"); | ||
julia> Index(io) isa Index | ||
true | ||
``` | ||
|
||
Or from a path representing a FAI file: | ||
```julia | ||
julia> Index("/path/to/file.fai") | ||
``` | ||
|
||
Alternatively, a FASTA file can be indexed to produce an `Index` using `faidx`. | ||
|
||
```jldoctest | ||
julia> faidx(IOBuffer(">abc\nTAGA\nTA")) | ||
Index: | ||
abc 6 5 4 5 | ||
``` | ||
|
||
Alternatively, a FASTA file can be indexed, and the index immediately written to a FAI file, | ||
by passing an `AbstractString` to `faidx`: | ||
|
||
```julia | ||
julia> ispath("/path/to/fasta.fna.fai") | ||
false | ||
|
||
julia> faidx("/path/to/fasta.fna"); | ||
|
||
julia> ispath("/path/to/fasta.fna.fai") | ||
true | ||
``` | ||
|
||
Note that the restrictions on FASTA files for indexing are stricter than Julia's FASTA parser, | ||
so not all FASTA files that can be read can be indexed: | ||
|
||
```jldoctest | ||
julia> str = ">\0\n\0"; | ||
julia> first(FASTAReader(IOBuffer(str))) isa FASTARecord | ||
true | ||
julia> Index(IOBuffer(str)) | ||
ERROR | ||
[...] | ||
``` | ||
|
||
### Attaching an `Index` to a `Reader` | ||
When opening a `FASTA.Reader`, you can attach an `Index` by passing the `index` keyword. | ||
You can either pass an `Index` directly, or else an `IO`, in which case an `Index` will be parsed from the `IO`, | ||
or an `AbstractString` that will be interpreted as a path to a FAI file: | ||
|
||
```jldoctest | ||
julia> str = ">abc\nTAG\nTA"; | ||
julia> idx = faidx(IOBuffer(str)); | ||
julia> rdr = FASTAReader(IOBuffer(str), index=idx); | ||
``` | ||
|
||
You can also add a index to an existing reader using the `index!` function: | ||
|
||
```@docs | ||
index! | ||
``` | ||
|
||
### Seeking using an `Index` | ||
With an `Index` attached to a `Reader`, you can do the following operation in O(1) time. | ||
In these examples, we will use the following FASTA file: | ||
|
||
``` | ||
>seq1 sequence | ||
TAGAAAGCAA | ||
TTAAAC | ||
>seq2 sequence | ||
AACGG | ||
UUGC | ||
``` | ||
|
||
```@meta | ||
DocTestSetup = quote | ||
using FASTX | ||
data = """>seq1 sequence | ||
TAGAAAGCAA | ||
TTAAAC | ||
>seq2 sequence | ||
AACGG | ||
UUGC | ||
""" | ||
reader = FASTA.Reader(IOBuffer(data), index=faidx(IOBuffer(data))) | ||
end | ||
``` | ||
|
||
* Seek to a Record using its identifier: | ||
```jldoctest | ||
julia> seekrecord(reader, "seq2"); | ||
julia> record = first(reader); sequence(record) | ||
"AACGGUUGC" | ||
``` | ||
|
||
* Directly extract a record using its identifier | ||
```jldoctest | ||
julia> record = reader["seq1"]; | ||
julia> description(record) | ||
"seq1 sequence" | ||
``` | ||
|
||
* Extract a sequence directly without loading the whole record into memory. | ||
This is useful for huge sequences like chromosomes | ||
```jldoctest | ||
julia> extract(reader, "seq1", 3:5) | ||
"GAA" | ||
``` | ||
|
||
```@meta | ||
DocTestSetup = nothing | ||
``` | ||
|
||
FASTX.jl does not yet support indexing FASTQ files. | ||
|
||
### Reference: | ||
```@docs | ||
faidx | ||
seekrecord | ||
extract | ||
Index | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,59 @@ | ||
```@meta | ||
CurrentModule = FASTX | ||
DocTestSetup = quote | ||
using FASTX | ||
end | ||
``` | ||
|
||
# FASTA formatted files | ||
__NB: First read the overview in the sidebar__ | ||
|
||
FASTA is a text-based file format for representing biological sequences. | ||
A FASTA file stores a list of sequence records with name, description, and | ||
sequence. | ||
|
||
The template of a sequence record is: | ||
|
||
``` | ||
>{description} | ||
{sequence} | ||
``` | ||
|
||
Where the "identifier" is the first part of the description up to the first whitespace | ||
(or the entire description if there is no whitespace) | ||
|
||
Here is an example of a chromosomal sequence: | ||
|
||
``` | ||
>chrI chromosome 1 | ||
CCACACCACACCCACACACCCACACACCACACCACACACCACACCACACC | ||
CACACACACACATCCTAACACTACCCTAACACAGCCCTAATCTA | ||
``` | ||
|
||
## The `FASTARecord` | ||
FASTA records are, by design, very lax in what they can contain. | ||
They can contain almost arbitrary byte sequences, including invalid unicode, and trailing whitespace on their sequence lines, which will be interpreted as part of the sequence. | ||
If you want to have more certainty about the format, you can either check the content of the sequences with a regex, or (preferably), convert them to the desired `BioSequence` type. | ||
|
||
```@docs | ||
FASTA.Record | ||
``` | ||
|
||
### Reference: | ||
```@docs | ||
identifier | ||
description | ||
sequence | ||
``` | ||
|
||
## `FASTAReader` and `FASTAWriter` | ||
`FASTAWriter` can optionally be passed the keyword `width` to control the line width. | ||
If this is zero or negative, it will write all record sequences on a single line. | ||
Else, it will wrap lines to the given maximal width. | ||
|
||
### Reference: | ||
```@docs | ||
FASTA.Reader | ||
FASTA.Writer | ||
validate_fasta | ||
``` |
Oops, something went wrong.