Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Include period (.) in valid letters #73

Closed
suyang-nju opened this issue Feb 25, 2022 · 7 comments · Fixed by #68
Closed

Include period (.) in valid letters #73

suyang-nju opened this issue Feb 25, 2022 · 7 comments · Fixed by #68

Comments

@suyang-nju
Copy link

The period (.) is used in some programs (e.g., SAM and HMMER) to represent gaps in addition to -. Technically the format is a2m (http://compbio.soe.ucsc.edu/a2m-desc.html), but it's sufficiently similar to FASTA that it will be very convenient if FASTX.jl supports the a2m format.

@jakobnissen
Copy link
Member

See #75 . In general, I'm very conflicted on this. It's trivially easy to broaden the set of characters allowed in FASTA. However, the more we broaden it, the less our users can rely on what is inside the sequence when it is being parsed.

I'll leave this up and see what other people have to say.

@suyang-nju
Copy link
Author

How about taking an optional alphabet argument that defaults to [A-Za-z\-]? This way a user can explicitly set the alphabet and knows what to expect in the sequences.

@jakobnissen
Copy link
Member

Unfortunately, that is not a feasible solution for two reasons:

  • First, what if the user specifies an alphabet that contains >?
  • More importantly, the function that FASTX uses to parse FASTA files is generated at compile time, so it is not possible to change the parser at runtime like that

@suyang-nju
Copy link
Author

I see. Would it be possible to provide a predefined set of alphabets, ranging from more restrictive to permissive, that are all generated at compile time? Alternatively, use the most permissive alphabet as in #75, but add a function to return the actual letters/symbols used by the parsed FASTA file?

@jakobnissen
Copy link
Member

Would it be possible to provide a predefined set of alphabets, ranging from more restrictive to permissive, that are all generated at compile time

Yes, but that would mean a large increase in compile times for the package, a much harder time testing for correctness, and general confusion when users don't know which alphabet to choose.

Alternatively, use the most permissive alphabet as in #75, but add a function to return the actual letters/symbols used by the parsed FASTA file?

That's a much more reasonable approach. The most "Julian" way would be to have a function that checked if a record was compatible with an Alphabet. Though I'm not quite clear how that would work internally.

@TransGirlCodes
Copy link
Member

Alphabets also exist in BioSequences.jl, which may get confusing if there's two different alphabet types to worry about.

@TransGirlCodes
Copy link
Member

TransGirlCodes commented May 27, 2022

If we want to be as strict as possible with not letting people shoot themselves in the foot, every format ought to probably have its own package built on Automa. So there can be no confusion about what kind of file it is you are trying to pass. So a2m should maybe be A2M.jl. You could argue it's overkill, but if that's so - just awk A2M into a FASTA file?

But this issue is related to #75, which I wasn't negative about so I think the question is this - when do we want to be strict? At parsing, or when it comes to doing something with the record? In #75 I highlight that should the FASTA parser permit more characters, their validity is still enforced when calling a BioSequences.jl constructor on them, which does enforce the Alphabet of that sequence type. So we could just say yes to #75, and then if an a2m file were to be read, its on the constructor that would take records and turn them into alignments.

There are good arguments for and against FASTX's strictness.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants