Include period (.) in valid letters #73

suyang-nju · 2022-02-25T14:10:07Z

The period (.) is used in some programs (e.g., SAM and HMMER) to represent gaps in addition to -. Technically the format is a2m (http://compbio.soe.ucsc.edu/a2m-desc.html), but it's sufficiently similar to FASTA that it will be very convenient if FASTX.jl supports the a2m format.

jakobnissen · 2022-02-27T15:27:15Z

See #75 . In general, I'm very conflicted on this. It's trivially easy to broaden the set of characters allowed in FASTA. However, the more we broaden it, the less our users can rely on what is inside the sequence when it is being parsed.

I'll leave this up and see what other people have to say.

suyang-nju · 2022-03-02T17:15:05Z

How about taking an optional alphabet argument that defaults to [A-Za-z\-]? This way a user can explicitly set the alphabet and knows what to expect in the sequences.

jakobnissen · 2022-03-02T17:30:45Z

Unfortunately, that is not a feasible solution for two reasons:

First, what if the user specifies an alphabet that contains >?
More importantly, the function that FASTX uses to parse FASTA files is generated at compile time, so it is not possible to change the parser at runtime like that

suyang-nju · 2022-03-03T15:47:08Z

I see. Would it be possible to provide a predefined set of alphabets, ranging from more restrictive to permissive, that are all generated at compile time? Alternatively, use the most permissive alphabet as in #75, but add a function to return the actual letters/symbols used by the parsed FASTA file?

jakobnissen · 2022-05-05T13:50:11Z

Would it be possible to provide a predefined set of alphabets, ranging from more restrictive to permissive, that are all generated at compile time

Yes, but that would mean a large increase in compile times for the package, a much harder time testing for correctness, and general confusion when users don't know which alphabet to choose.

Alternatively, use the most permissive alphabet as in #75, but add a function to return the actual letters/symbols used by the parsed FASTA file?

That's a much more reasonable approach. The most "Julian" way would be to have a function that checked if a record was compatible with an Alphabet. Though I'm not quite clear how that would work internally.

TransGirlCodes · 2022-05-27T12:48:18Z

Alphabets also exist in BioSequences.jl, which may get confusing if there's two different alphabet types to worry about.

TransGirlCodes · 2022-05-27T12:57:45Z

If we want to be as strict as possible with not letting people shoot themselves in the foot, every format ought to probably have its own package built on Automa. So there can be no confusion about what kind of file it is you are trying to pass. So a2m should maybe be A2M.jl. You could argue it's overkill, but if that's so - just awk A2M into a FASTA file?

But this issue is related to #75, which I wasn't negative about so I think the question is this - when do we want to be strict? At parsing, or when it comes to doing something with the record? In #75 I highlight that should the FASTA parser permit more characters, their validity is still enforced when calling a BioSequences.jl constructor on them, which does enforce the Alphabet of that sequence type. So we could just say yes to #75, and then if an a2m file were to be read, its on the constructor that would take records and turn them into alignments.

There are good arguments for and against FASTX's strictness.

jakobnissen mentioned this issue Jul 28, 2022

Rewrite for v2 #68

Merged

jakobnissen closed this as completed in #68 Aug 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Include period (.) in valid letters #73

Include period (.) in valid letters #73

suyang-nju commented Feb 25, 2022

jakobnissen commented Feb 27, 2022

suyang-nju commented Mar 2, 2022

jakobnissen commented Mar 2, 2022

suyang-nju commented Mar 3, 2022

jakobnissen commented May 5, 2022

TransGirlCodes commented May 27, 2022

TransGirlCodes commented May 27, 2022 •

edited

Loading

Include period (.) in valid letters #73

Include period (.) in valid letters #73

Comments

suyang-nju commented Feb 25, 2022

jakobnissen commented Feb 27, 2022

suyang-nju commented Mar 2, 2022

jakobnissen commented Mar 2, 2022

suyang-nju commented Mar 3, 2022

jakobnissen commented May 5, 2022

TransGirlCodes commented May 27, 2022

TransGirlCodes commented May 27, 2022 • edited Loading

TransGirlCodes commented May 27, 2022 •

edited

Loading