Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC]: Broaden characters allowed in FASTA sequences #75

Closed
wants to merge 2 commits into from

Conversation

jakobnissen
Copy link
Member

This PR allows all printable ASCII characters in FASTA sequences. That is, all bytes represented by the characters '!':'~' but not >. Like before, horizontal whitespace, i.e. \t\v and space is allowed inside sequences, but are not considered part of the sequence.

I think this character set is the broadest possible set that is practically parseable. Expanding it further would mean allowing non-printable characters, which would be a complete mess, or Unicode, which would be another complete mess.

This PR is meant just to toss the idea out there, for debate. I have no strong intuition it is actually a good idea.

Sequences can now contain all printable ASCII characters ('!':'~'), optionally
with horizontal whitespace (tab, space, \v and newline).
@codecov
Copy link

codecov bot commented Feb 27, 2022

Codecov Report

Merging #75 (9529d82) into master (5e7efd6) will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master      #75   +/-   ##
=======================================
  Coverage   84.39%   84.39%           
=======================================
  Files          12       12           
  Lines         660      660           
=======================================
  Hits          557      557           
  Misses        103      103           
Flag Coverage Δ
unittests 84.39% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/fasta/readrecord.jl 96.42% <ø> (ø)
src/fasta/reader.jl 89.85% <0.00%> (ø)
src/fasta/writer.jl 96.29% <0.00%> (ø)
src/fastq/reader.jl 89.36% <0.00%> (ø)
src/fastq/writer.jl 96.77% <0.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5e7efd6...9529d82. Read the comment docs.

@TransGirlCodes
Copy link
Member

TransGirlCodes commented May 4, 2022

So, I'm actually not all that against this, because correctness can still be maintained in the form of say LongDNA{2}(record) or the sequence method. So invalid characters not permitted by an alphabet can still be caught, but it would allow for when some tools or people use their own screwy files based loosely on FASTA.

A concrete example of this is the files produced by a tool called KAT Sect, which output's kmer counts along a sequence in a FASTA like format i.e.

>seqA
30 34 1 38 44

With this PR, one could parse the file, and then decide what to do with the sequence section e.g. parse the string into a vector of ints. Of course, a dedicated kat sect parser would be better... and hopefully the sect analysis will be doable from Kmers.jl anyway without relying on kat.

@jakobnissen
Copy link
Member Author

Superseded by #68

@jakobnissen jakobnissen closed this Aug 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants