Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Allow zero-length kmers #17

Closed
jakobnissen opened this issue Jun 10, 2022 · 5 comments · Fixed by #35
Closed

Proposal: Allow zero-length kmers #17

jakobnissen opened this issue Jun 10, 2022 · 5 comments · Fixed by #35

Comments

@jakobnissen
Copy link
Member

Currently, one can't make 0-length kmers:

julia> DNAKmer(dna"")
ERROR: ArgumentError: Bad kmer parameterisation. K must be greater than 0.
Stacktrace:
 [1] checkmer(#unused#::Type{DNAKmer{0, 0}})
   @ Kmers ~/.julia/packages/Kmers/7SNBQ/src/kmer.jl:414

I'm not sure I get the rationale for that. Sure, length 0 kmers are a little weird, but in general, containers in Julia can be length 0, That is, we have length 0 LongSequence, LongSubSeq, Vector, Set, Tuple etc etc.
I think it would be nicer to just allow it.

@kescobo
Copy link
Member

kescobo commented Jun 10, 2022

Hmm - but I think of a Kmer as closer to Char than String, and you can't have empty Char. How would you iterate a sequence with each 0-length kmer, for example?

@TransGirlCodes
Copy link
Member

I guess it depends on whether you consider Kmers as just LongSequences - i.e. ordered containers of BioSymbols - just optimised for a specific purpose. Or almost more of a BioSymbol itself, indeed you can consider a LongSequence as a container of kmers, as well as nucleotides. I think of them kinda as both, to be honest. As for the iterating over 0-length kmers, I wonder if we can take inspiration from the julia ecosystem - patterns of iterating over substrings or views of an array or similar?

@kescobo
Copy link
Member

kescobo commented Jun 11, 2022

Or almost more of a BioSymbol itself, indeed you can consider a LongSequence as a container of kmers, as well as nucleotides. I think of them kinda as both,

Actually, maybe the best comparison with the string ecosystem is a regular expression rather than a Char.

julia> st = "hello banana"
"hello banana"

julia> findall(r"ba", st)
1-element Vector{UnitRange{Int64}}:
 7:8

julia> findall(r"", st)
13-element Vector{UnitRange{Int64}}:
 1:0
 2:1
 3:2
 4:3
 5:4
 6:5
 7:6
 8:7
 9:8
 10:9
 11:10
 12:11
 13:12

I suppose this would argue in favor of 0-mers 🤷, but I don't really like it. I would have thought r"" would throw an error...

@jakobnissen
Copy link
Member Author

jakobnissen commented Jun 13, 2022

I definitely see Kmer as "just another BioSequence" - and I see the characteristics of Kmers, namely their immutability and fixed lengths - to be essentially implementation details. I.e. if it was possible to produce just as efficient code using LongSequence, I don't know why I would ever use Kmer.

My analogy is that BioSequence is like AbstractVector, LongSequence is like Vector, LongSubSeq is like SubArray{T, 1} and Kmer is like StaticVector. Though not literally, of course, as we decided, BioSequence is not actually an AbstractVector.

Or, if you will, it corresponds to AbstractString, String, SubString, and InlineString, respectively.

That is, the different sequence types are only different due to computer-sciency implementation details like whether they are stack-allocated or not, IMO they should not be "biologically" different, and, when possible, they should try to behave identically with each other, such that one can make generic code that takes BioSequence, and then plug whatever subtype in it you want.

If kmers were Char-like, in my opinion, that would mean they were "atomic" primitive values, i.e. they did not contain elements (other than themselves, possibly).

@kescobo
Copy link
Member

kescobo commented Jun 13, 2022

I definitely disagree with myself from 3 days ago about the Char thing.

StaticVector is a good analogy too, I suppose. In any case, there are enough analogies that implement the empty form that I think we should probably allow 0mers for consistency, even if it makes me grumpy 🤷

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants