-
-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Include period (.) in valid letters #73
Comments
See #75 . In general, I'm very conflicted on this. It's trivially easy to broaden the set of characters allowed in FASTA. However, the more we broaden it, the less our users can rely on what is inside the sequence when it is being parsed. I'll leave this up and see what other people have to say. |
How about taking an optional alphabet argument that defaults to |
Unfortunately, that is not a feasible solution for two reasons:
|
I see. Would it be possible to provide a predefined set of alphabets, ranging from more restrictive to permissive, that are all generated at compile time? Alternatively, use the most permissive alphabet as in #75, but add a function to return the actual letters/symbols used by the parsed FASTA file? |
Yes, but that would mean a large increase in compile times for the package, a much harder time testing for correctness, and general confusion when users don't know which alphabet to choose.
That's a much more reasonable approach. The most "Julian" way would be to have a function that checked if a record was compatible with an |
Alphabets also exist in BioSequences.jl, which may get confusing if there's two different alphabet types to worry about. |
If we want to be as strict as possible with not letting people shoot themselves in the foot, every format ought to probably have its own package built on Automa. So there can be no confusion about what kind of file it is you are trying to pass. So a2m should maybe be A2M.jl. You could argue it's overkill, but if that's so - just awk A2M into a FASTA file? But this issue is related to #75, which I wasn't negative about so I think the question is this - when do we want to be strict? At parsing, or when it comes to doing something with the record? In #75 I highlight that should the FASTA parser permit more characters, their validity is still enforced when calling a BioSequences.jl constructor on them, which does enforce the Alphabet of that sequence type. So we could just say yes to #75, and then if an a2m file were to be read, its on the constructor that would take records and turn them into alignments. There are good arguments for and against FASTX's strictness. |
The period (.) is used in some programs (e.g., SAM and HMMER) to represent gaps in addition to -. Technically the format is a2m (http://compbio.soe.ucsc.edu/a2m-desc.html), but it's sufficiently similar to FASTA that it will be very convenient if FASTX.jl supports the a2m format.
The text was updated successfully, but these errors were encountered: