-
-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rewrite for v2 #68
Rewrite for v2 #68
Conversation
Codecov Report
@@ Coverage Diff @@
## master #68 +/- ##
==========================================
+ Coverage 84.39% 90.28% +5.89%
==========================================
Files 12 11 -1
Lines 660 628 -32
==========================================
+ Hits 557 567 +10
+ Misses 103 61 -42
Flags with carried forward coverage won't be shown. Click here to find out more.
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
After a little thinking, I might
|
I think empty identifiers are technically valid, eg
But I don't know if I would think of this as A FASTA record without a sequence doesn't make any sense to me. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall I think this is neater, and I'm liking the StringViews.
I'm not sure about enforcing the presence of identifiers or sequences. But once a decision is made, it makes sense to remove checks for fields always present.
Either way, the end-user would still need to check and decide how to handle the information. I don't think reinterpreting the |
So, reading some webpages e.g. I think we have an answer to this question regarding using missing or empty string returned by These pages seem to describe the entire '>' line as a [description|definition] line i.e. identifier + any additional info. So, what if, we take that stance. We include So, for any record with just an ID and no additional info, will give you Any record with additional content after the identifier, then Then, because so many platforms have their own way of dealing with extra info - e.g. ncbi has the whole "[tag=value]" thing. We simply take the position "use description() to get the whole description line, parse it how you will, ya on ya own buddy." Thus identifier becomes a subset of the description, and the behaviour of the two, is consistent. |
@jakobnissen How do you feel about this proposal of doing away with |
That's a good idea. I like it. I'll implement the changes this week |
@SabrinaJaye @kescobo and other interested parties: This is now ready for review/test. There is too much code to review, but you can play around with it and see if you like how it feels, and if you approve of the changes described in the OP here. I recommend reading the new, updated documentation. Now what is needed is just nice-to-haves, which can always be added later. The only thing left to do here before tagging v2 is just to code coverage (I will take care of that), and if @SabrinaJaye have any ideas for high-level operations. During the next week, I will finish up the last remaining tests, then in 1-2 weeks, I will squash merge this to master unless you have any comments, and then release FASTX v2. |
I think if you add |
@kescobo I tried to add previews, but apparently it's failing? :/ I can't figure out why. I added a new documenter key, but the build job claims it's not there or it's empty. Maybe it's acceptable that it doesn't work for PRs, I can look at it after pushing this to master. |
* Bump BioSequences/BioSymbols to v3/v5 * Bump Julia version
Currently, `header`, `identifier` and `description` returns `String`, which forces needless allocations. This PR adds the dependency `StringViews`, which allows the creation of an `AbstractString` from any `AbstractVector{UInt8}`. The aforementioned functions now return these string views backed by a view into the data buffer.
Rename FASTQ.FASTQRead to FASTQ.Read
04431b4
to
54765bd
Compare
Seems fine, I can try too. I'll build docs locally for now |
4befa1f
to
e29fdc7
Compare
Why a breaking change?
Essentially, #63 is unsolvable without making a breaking change.
I figured, if we were to break the API anyway, there were several areas where FASTX could be made nicer.
Important changes
External
BioGenerics
method have been removed, except the ones used for the readers/writers.@
Record
from a string. Instead, useparse(Record, str)
.quality_scores
returns the qualities as a lazy, validating iterator of scores using a default QualityEncoding object to decode ASCII PHRED scores to quality scorescopy
, which defaults totrue
. Iffalse
, the reader will overwrite the same record on iteration. This makes the oldwhile !eof(reader)
idiom obsolete in favor of iterating overReader(io; copy=false)
.transcribe
has been removed, as it is now trivial to do the same thing.faidx
function.extract
can now extract parts of sequences from indexed FASTA files without loading a whole record. E.g. if you have a whole chromosome, you can load just a few basepairs without loading the entire chromosome (see so slowly of extract sequence by coord #29)validate_fasta
andvalidate_fastq
to quickly and memory-efficiently check if a file is FASTX-formatted.Internal
closes #77
closes #73
closes #37
closes #63
closes #29