Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Comparing very similar sequences does not provide all results. #7

Open
KasperThystrup opened this issue Oct 14, 2024 · 1 comment
Open

Comments

@KasperThystrup
Copy link

KasperThystrup commented Oct 14, 2024

First of thanks for a great tool!

While playing around with some comparisons between genes from the same file, I noticed that blastn behaves differently from expected:
blastn -query genes.fasta -subject genes.fasta -outfmt 6

Having two identical genes (cps2B & cps8B) in genes.fasta - results in following matches:
cps2B:cps8B (100% id and cov)
cps8B:cps2B (100% id and cov)

This comparisson misses the following:
cps2B:cps2B
cps8B:cps8B

Now adding a third gene (cps7B) to the mix by appending it to the genes.fasta, changes everything out entirely:
cps2B:cps2B
cps2B:cps7B (99.76% ID and 100% cov)
cps8B:cps8B
cps7B:cps7B
cps7B:cps2B (99.76% ID and 100% cov)

This comparisson misses the following:
cps2B:cps8B
cps8B:cps2B
cps8B:cps7B (should be 99.76% as well as cps2B and cps8B are identical)
cps7B:cps8B (should be 99.76% as well as cps2B and cps8B are identical)

Is there a way to include all top matches?

@JacobLondon
Copy link
Owner

JacobLondon commented Oct 27, 2024

Glad you enjoy the tool! It's been a while since I've looked at my undergrad senior project. A quick disclaimer, I want to note that I have very little experience in bioinformatics other than this project, although our professor Mohamed El-Hadedy Aly, Ph.D. at California Polytechnic University, Pomona could give a very educated answer to possible questions.

That out of the way, I looked at the matching method in extend.cpp where the implementation attempts to find the best match via scoring/extending with the Smith-Waterman algorithm. It could be feasible to modify the extend_filter member function from 'track the best at all times' approach to keeping a std::List of ExtendedSequenceMap objects that is then sorted by score and returned, providing top matches.

Unfortunately, I haven't the time to maintain this project, but if you were so technically inclined, send me a pull request with an implementation and I might be able to provide that as an alternate approach!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants