Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite for v2 #68

Merged
merged 80 commits into from
Aug 15, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
80 commits
Select commit Hold shift + click to select a range
17350e4
Remove autodetection of FASTA sequence type
jakobnissen Nov 29, 2021
0204c8e
Disallow record construction from anything (#60)
jakobnissen Jan 4, 2022
1121700
Update Julia requirement to 1.6 (#64)
jakobnissen Jan 8, 2022
129a7c3
Export all user-facing functions
jakobnissen Oct 22, 2021
3fd61ff
Bump BioSequences/BioSymbols to v3/v5 (#66)
jakobnissen Feb 19, 2022
04921c6
Return header, identifier and desc as string views
jakobnissen Jan 4, 2022
fb35d5f
Update docstrings of functions returning string views
jakobnissen Jan 4, 2022
a85aaa2
Fix tests
jakobnissen Feb 20, 2022
0a74051
Make identifier and header always return a string
jakobnissen Feb 20, 2022
cfa37a8
Make description return Union{Nothing, StringView}
jakobnissen Feb 20, 2022
87906a7
Make sequence mandatory in FASTA
jakobnissen Feb 20, 2022
1428cb1
Make sequence mandatory in FASTQ
jakobnissen Feb 20, 2022
b698270
Deprecate now-removed functionality
jakobnissen Feb 20, 2022
07782be
Apply suggestions from code review
TransGirlCodes Mar 29, 2022
5130e27
WIP: Simplify layout
jakobnissen Jun 8, 2022
8964cd7
Rewrite FASTA part
jakobnissen Jul 15, 2022
2d2f342
FASTA tests: Part 1
jakobnissen Jul 15, 2022
e7e6684
Migrate to Automa v1
jakobnissen Jul 26, 2022
dafb50b
Add seqview function
jakobnissen Jul 26, 2022
4b47ca9
More tests and tweaks to FASTA
jakobnissen Jul 26, 2022
6888127
FASTA writer tests
jakobnissen Jul 26, 2022
3ca0c38
Add default seqview
jakobnissen Jul 26, 2022
713106f
Fix minimal fasta and test flush
jakobnissen Jul 26, 2022
f057edc
Make common sequence logic for FASTA/Q
jakobnissen Jul 26, 2022
23bec6f
Move some copying logic to common FASTX
jakobnissen Jul 26, 2022
6a194d6
Rewrite FASTQ
jakobnissen Jul 26, 2022
835a5b5
Make Readers carry own Record
jakobnissen Jul 26, 2022
13a822d
Add FASTQ tests
jakobnissen Jul 27, 2022
8a4b973
Add equality FASTA test and FASTX common tests
jakobnissen Jul 27, 2022
4d7ee74
More tests
jakobnissen Jul 27, 2022
492c4de
Have FASTQ writer optionally respect record extra header
jakobnissen Jul 27, 2022
2c47c85
Squash: Export QualityEncoding
jakobnissen Jul 27, 2022
03ddf50
More FASTQ tests
jakobnissen Jul 27, 2022
f513e63
More changes
jakobnissen Jul 27, 2022
3a696fd
Handle unparseable format specimens better in tests
jakobnissen Jul 28, 2022
7a11472
Remove apparently unneeded BioGenerics
jakobnissen Jul 28, 2022
9bb171a
Minor tweaks
jakobnissen Jul 28, 2022
833492c
WIP index: Add Automa index
jakobnissen Jul 28, 2022
7885282
Add FASTQ->FASTA conversion
jakobnissen Jul 28, 2022
bf27650
Move common tests out of FASTQ tests
jakobnissen Jul 28, 2022
e97740b
Add more exports
jakobnissen Jul 28, 2022
591089d
Refactor read! vs iterate
jakobnissen Jul 28, 2022
ac68e4e
Remove FASTQRead
jakobnissen Jul 28, 2022
6b7f628
Remove seq_transform
jakobnissen Jul 28, 2022
6a3e1c8
Do not allow > in FASTA sequence
jakobnissen Jul 28, 2022
ad49de9
Add docstrings
jakobnissen Jul 28, 2022
379d146
Add quality(String, record, part)
jakobnissen Jul 28, 2022
75bd2d6
Remove BioSymbols dep
jakobnissen Jul 28, 2022
ab2d626
Make quality return string views like other accessors
jakobnissen Jul 29, 2022
674e7b9
Add indexing tests
jakobnissen Jul 29, 2022
f7c970f
Add test-specific dependencies
jakobnissen Jul 29, 2022
0594c24
Fixup using JET and Aqua
jakobnissen Jul 29, 2022
8ac8f19
Do now allow records to be constructed from String, use parse
jakobnissen Jul 29, 2022
6a0b453
Small tweaks to docstrings and exports
jakobnissen Jul 29, 2022
a527fd3
Rewrite documentation
jakobnissen Jul 30, 2022
21ed306
Add FASTA.Record!
jakobnissen Jul 30, 2022
506a70d
Test making records from AbstractString
jakobnissen Aug 3, 2022
b2d1e88
Allow constructing FASTQ with string quality
jakobnissen Aug 3, 2022
cba936e
Add FASTA indexer
jakobnissen Aug 5, 2022
ea60d67
Add validator functions
jakobnissen Aug 5, 2022
cbe28d2
Improve code coverage and fix doctests
jakobnissen Aug 8, 2022
c2ec9e8
Use Random as test dep
jakobnissen Aug 8, 2022
0bedad9
Update CHANGELOG for v2
jakobnissen Aug 8, 2022
7f87995
Add DOCUMENTER_KEY
jakobnissen Aug 8, 2022
dd0971f
Unexport seqlen
jakobnissen Aug 8, 2022
7f0a64f
Touchup docs
jakobnissen Aug 9, 2022
be6ea71
Fix more doctests
jakobnissen Aug 9, 2022
6e13aa2
Make out-of-order Index files work
jakobnissen Aug 9, 2022
f30555b
More doc fixes
jakobnissen Aug 9, 2022
54765bd
Fix typo in CHANGELOG
jakobnissen Aug 9, 2022
020e384
Improve parsing logic
jakobnissen Aug 9, 2022
e29fdc7
Use one implementation of memcmp
jakobnissen Aug 10, 2022
5c89b58
GC preserve in validate_fastq
jakobnissen Aug 10, 2022
c3b1f3c
Remove Record!, add copy!(::Record, ::Record)
jakobnissen Aug 10, 2022
fb8ba3e
Remove use of BioGenerics.Automa.State
jakobnissen Aug 10, 2022
22fa94a
Use finalizers for Writers
jakobnissen Aug 10, 2022
67f0fdb
Make finalizers async
jakobnissen Aug 10, 2022
145d444
Minor tweaks
jakobnissen Aug 11, 2022
b828fc6
Improve error message on parsing
jakobnissen Aug 11, 2022
c66d035
Add index! function
jakobnissen Aug 13, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 1 addition & 2 deletions .github/workflows/Documentation.yml
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,5 @@ jobs:
run: julia --color=yes --project=docs/ -e 'using Pkg; Pkg.develop(PackageSpec(path=pwd())); Pkg.instantiate()'
- name: Build and deploy
env:
# GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }} # For authentication with GitHub Actions token
DOCUMENTER_KEY: ${{ secrets.DOCUMENTER_KEY }} # For authentication with SSH deploy key
run: julia --color=yes --project=docs/ docs/make.jl
run: julia --color=yes --project=docs/ docs/make.jl
75 changes: 72 additions & 3 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,77 @@ All notable changes to this project will be documented in this file.
The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/)
and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.html).

## [2.0.0]
Version 2 is a near-complete rewrite of FASTX.
It brings strives to provide an easier and more consistent API, while also being
faster, more memory efficient, and better tested than v1.

The changes are comprehensive, but code should only need a few minor tweaks to
work with v2. I recommend upgrading your packages using a static analysis tool like JET.jl.

### Breaking changes
#### Records
* `description` has changed meaning: In v1, it meant the part of the header after the '>' symbol
and up until first whitespace. Now it extends to the whole header line until the ending newline.
This implies the identifier is a prefix of the description.
* `header` has been removed, and is now replaced by `description`.
* All `Record` objects now have an identifier, a description and a sequence, and all `FASTQRecord`s
have a quality. These may be empty, but will not throw an error when accessing them.
* As a consequence, all "checker" functions like `hassequence`, `isfilled`, `hasdescription` and
so on has been removed, since the answer now is trivially "yes" in all cases.
* `identifier`, `description`, `sequence` and `quality` now returns an `AbstractString` by default.
Although it is an implementation detail, it uses zero-copy string views for performance.
* You can no longer construct a record using e.g. `Record(::String)`. Instead, use `parse(Record, ::String)`.

#### Readers/writers
* All readers/writers now take any other arguments than the main IO as a keyword for clarity
and consistency.
* FASTQ.Writers will no longer by default modify `FASTQ.Records`'s second header.
An optional keyword forces the reader to always write/skip second header if set to `true` or `false`,
but it defaults to `nothing`, meaning it leaves it intact.
* FASTQ writers now can no longer fill in ambiguous bases in Records transparently,
or otherwise transform records, when writing.
If the user wishes to transform records, they must do it my manually calling a function that transforms the records.

#### Other breaking changes
* `FASTQ.Read` has been removed. To subset a read, extract the sequence and quality, and construct
a new Record object from these.
* `transcribe` has been removed, as it is now trivial to do the same thing.
It may be added in a future release with new functionality.

### New features
* Function `quality_scores` return the qualities of a FASTQ record as a lazy, validating iterator
of PHRED quality scores.
* New object: `QualityEncoding` can be used to construct custom PHRED/ASCII quality encodings.
accessing quality scores uses an existing default object.
* Readers now have a keyword `copy` that defaults to `true`. If set to `false`, iterating over
a reader will overwrite the same record for performance. Use with care.
This makes the old `while !eof(reader)`-idiom obsolete in favor of iterating over a reader
constructed with `copy=false`.
* Users can now use the following syntax to make processing gzipped readers easier:
```
Reader(GzipDecompressorStream(open(path)); kwargs...) do reader
# stuff
end
```
this is a change in BioGenerics.jl, but is guaranteed to work in FASTX.jl v2.
* FAI (FASTX index) files can now be written as well as read.
* FASTA files can now be indexed with the new function `faidx`.
* Function `extract` can extract parts of a sequence from an indexed FASTA reader
without loading the entire sequence into memory.
You can use this to e.g. extract a small part of a large chromosome. (see #29)
* New functions `validate_fasta` and `validate_fastq` validates if an `IO` is formatted
validly, faster and more memory-efficiently than loading in the file.

### Other changes
* All practically useful functions and types are now exported directly from FASTX,
so users don't need to prepend identifiers with `FASTA.` or `FASTQ.`.
* FASTA readers are more liberal in what formats they will accept (#73)

### Removed
* The method `FASTA.sequence(::FASTA.Record)` has been removed, since the auto-detection of sequence
type chould not be made reliable enough.

## [1.2.0] - 2021-07-13
### Added:
* `header(::Union{FASTA.Record, FASTQ.Record})` returns the full header line.
Expand All @@ -18,9 +89,7 @@ and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.
* Various small fixes to the internal parsing regex
* Writers are now parametric and buffered for increased writing speed
* Fixed a bug where Windows-style newlines would break the parser

## Unreleased

[4;1386;2550t]
## [1.1.0] - 2019-08-07
### Added
- `Base.copyto!` methods for copying record data to LongSequences.
Expand Down
22 changes: 10 additions & 12 deletions Project.toml
Original file line number Diff line number Diff line change
@@ -1,29 +1,27 @@
name = "FASTX"
uuid = "c2308a5c-f048-11e8-3e8a-31650f418d12"
authors = [
"Sabrina J. Ward <[email protected]>",
"Jakob N. Nissen <[email protected]>"
]
version = "1.3.0"
authors = ["Sabrina J. Ward <[email protected]>", "Jakob N. Nissen <[email protected]>"]
version = "2.0.0"

[deps]
Automa = "67c07d97-cdcb-5c2c-af73-a7f9c32a568b"
BioGenerics = "47718e42-2ac5-11e9-14af-e5595289c2ea"
BioSequences = "7e6ae17a-c86d-528c-b3b9-7f778a29fe59"
BioSymbols = "3c28c6f8-a34d-59c4-9654-267d177fcfa9"
ScanByte = "7b38b023-a4d7-4c5e-8d43-3f3097f304eb"
StringViews = "354b36f9-a18e-4713-926e-db85100087ba"
TranscodingStreams = "3bb67fe8-82b1-5028-8e26-92a6c54297fa"

[compat]
Automa = "0.7, 0.8"
BioGenerics = "0.1"
Automa = "0.8"
BioGenerics = "0.1.2"
BioSequences = "3"
BioSymbols = "5"
ScanByte = "0.3"
StringViews = "1"
TranscodingStreams = "0.9.5"
julia = "1.6"

[extras]
FormatSpecimens = "3372ea36-2a1a-11e9-3eb7-996970b6ffbd"
Test = "8dfed614-e22c-5e08-85e1-65c5234f0b40"
Random = "9a3f8284-a2c9-5f02-9a11-845980a1fd5c"

[targets]
test = ["Test", "FormatSpecimens"]
test = ["Random"]
3 changes: 2 additions & 1 deletion docs/Project.toml
Original file line number Diff line number Diff line change
Expand Up @@ -3,4 +3,5 @@ BioSequences = "7e6ae17a-c86d-528c-b3b9-7f778a29fe59"
Documenter = "e30172f5-a6a5-5a46-863b-614d45cd2de4"

[compat]
Documenter = "~0.22"
BioSequences = "3"
Documenter = "0.27"
27 changes: 14 additions & 13 deletions docs/make.jl
Original file line number Diff line number Diff line change
@@ -1,28 +1,29 @@
using Documenter, FASTX

# Build documentation.
DocMeta.setdocmeta!(FASTX, :DocTestSetup, :(using FASTX, BioSequences); recursive=true)

makedocs(
modules = [FASTX],
format = Documenter.HTML(),
modules = [FASTX, FASTX.FASTQ, FASTX.FASTA],
sitename = "FASTX.jl",
doctest = false,
strict = false,
doctest = true,
pages = [
"Home" => "index.md",
"Manual" => [
"FASTA formatted files" => "manual/fasta.md",
"FASTQ formatted files" => "manual/fastq.md"
"Overview" => Any[
"Overview" => "index.md",
"Records" => "records.md",
"File I/O" => "files.md",
],
"Library" => [
"Public" => "lib/public.md"
]
"FASTA" => "fasta.md",
"FASTQ" => "fastq.md",
"FAI" => "fai.md"
],
authors = "Ben J. Ward, The BioJulia Organisation and other contributors."
authors = "Sabrina J. Ward, Jakob N. Nissen, The BioJulia Organisation and other contributors.",
checkdocs = :all
)

deploydocs(
repo = "github.com/BioJulia/FASTX.jl.git",
push_preview = true,
deps = nothing,
make = nothing
)
)
148 changes: 148 additions & 0 deletions docs/src/fai.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,148 @@
```@meta
CurrentModule = FASTX
DocTestSetup = quote
using FASTX
end
```

# FASTA index (FAI files)
FASTX.jl supports FASTA index (FAI) files.
When a FASTA file is indexed with a FAI file, one can seek records by their name, or extract parts of records easily.

See the FAI specifcation here: http://www.htslib.org/doc/faidx.html

### Making an `Index`
A FASTA index (of type `Index`) can be constructed from an `IO` object representing a FAI file:

```jldoctest
julia> io = IOBuffer("seqname\t9\t2\t6\t8");

julia> Index(io) isa Index
true
```

Or from a path representing a FAI file:
```julia
julia> Index("/path/to/file.fai")
```

Alternatively, a FASTA file can be indexed to produce an `Index` using `faidx`.

```jldoctest
julia> faidx(IOBuffer(">abc\nTAGA\nTA"))
Index:
abc 6 5 4 5
```

Alternatively, a FASTA file can be indexed, and the index immediately written to a FAI file,
by passing an `AbstractString` to `faidx`:

```julia
julia> ispath("/path/to/fasta.fna.fai")
false

julia> faidx("/path/to/fasta.fna");

julia> ispath("/path/to/fasta.fna.fai")
true
```

Note that the restrictions on FASTA files for indexing are stricter than Julia's FASTA parser,
so not all FASTA files that can be read can be indexed:

```jldoctest
julia> str = ">\0\n\0";

julia> first(FASTAReader(IOBuffer(str))) isa FASTARecord
true

julia> Index(IOBuffer(str))
ERROR
[...]
```

### Attaching an `Index` to a `Reader`
When opening a `FASTA.Reader`, you can attach an `Index` by passing the `index` keyword.
You can either pass an `Index` directly, or else an `IO`, in which case an `Index` will be parsed from the `IO`,
or an `AbstractString` that will be interpreted as a path to a FAI file:

```jldoctest
julia> str = ">abc\nTAG\nTA";

julia> idx = faidx(IOBuffer(str));

julia> rdr = FASTAReader(IOBuffer(str), index=idx);
```

You can also add a index to an existing reader using the `index!` function:

```@docs
index!
```

### Seeking using an `Index`
With an `Index` attached to a `Reader`, you can do the following operation in O(1) time.
In these examples, we will use the following FASTA file:

```
>seq1 sequence
TAGAAAGCAA
TTAAAC
>seq2 sequence
AACGG
UUGC
```

```@meta
DocTestSetup = quote
using FASTX

data = """>seq1 sequence
TAGAAAGCAA
TTAAAC
>seq2 sequence
AACGG
UUGC
"""

reader = FASTA.Reader(IOBuffer(data), index=faidx(IOBuffer(data)))

end
```

* Seek to a Record using its identifier:
```jldoctest
julia> seekrecord(reader, "seq2");

julia> record = first(reader); sequence(record)
"AACGGUUGC"
```

* Directly extract a record using its identifier
```jldoctest
julia> record = reader["seq1"];

julia> description(record)
"seq1 sequence"
```

* Extract a sequence directly without loading the whole record into memory.
This is useful for huge sequences like chromosomes
```jldoctest
julia> extract(reader, "seq1", 3:5)
"GAA"
```

```@meta
DocTestSetup = nothing
```

FASTX.jl does not yet support indexing FASTQ files.

### Reference:
```@docs
faidx
seekrecord
extract
Index
```
59 changes: 59 additions & 0 deletions docs/src/fasta.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
```@meta
CurrentModule = FASTX
DocTestSetup = quote
using FASTX
end
```

# FASTA formatted files
__NB: First read the overview in the sidebar__

FASTA is a text-based file format for representing biological sequences.
A FASTA file stores a list of sequence records with name, description, and
sequence.

The template of a sequence record is:

```
>{description}
{sequence}
```

Where the "identifier" is the first part of the description up to the first whitespace
(or the entire description if there is no whitespace)

Here is an example of a chromosomal sequence:

```
>chrI chromosome 1
CCACACCACACCCACACACCCACACACCACACCACACACCACACCACACC
CACACACACACATCCTAACACTACCCTAACACAGCCCTAATCTA
```

## The `FASTARecord`
FASTA records are, by design, very lax in what they can contain.
They can contain almost arbitrary byte sequences, including invalid unicode, and trailing whitespace on their sequence lines, which will be interpreted as part of the sequence.
If you want to have more certainty about the format, you can either check the content of the sequences with a regex, or (preferably), convert them to the desired `BioSequence` type.

```@docs
FASTA.Record
```

### Reference:
```@docs
identifier
description
sequence
```

## `FASTAReader` and `FASTAWriter`
`FASTAWriter` can optionally be passed the keyword `width` to control the line width.
If this is zero or negative, it will write all record sequences on a single line.
Else, it will wrap lines to the given maximal width.

### Reference:
```@docs
FASTA.Reader
FASTA.Writer
validate_fasta
```
Loading