-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Problem with duplicates = "closest" in closest #72
Comments
Confirmed. I created some unit tests for this. Currently I am very busy and it will take some time to fix it. I am always surprised how you find such bugs 👍 |
Thanks @sgibb for looking into this - and sure, you have more urgent things to do at present, so take your time. I stumble across these while analyzing real world data, which is never perfect and leads, as in this case, to this strange mappings. |
We could circumvent the described problem by filtering the duplicated values before running closest2 <- function(x, y) {
xd <- duplicated(x)
yd <- duplicated(y)
i <- closest(x[!xd], y[!yd], dup="closest", tol = 0)
out <- rep.int(NA_integer_, length(x))
out[!xd] <- i + cumsum(yd)[i]
out
} (because our input arguments are sorted we could tweak it a little bit in C and just compare the current element to its predecessor instead of the whole vector) The function C_closest_dup_closest is already very difficult to understand. Removing the duplicates in the loop would slightly decrease the runtime and would decrease the already bad readability of the function even further. And I am not sure the we really tackle all corner cases with this solution. All in all the Finding the closest pair is relatively easy to do fast and reliable. The problem arises if we have to reassign a mapping because we found a closer match (because that could be recursive, e.g. An easy way would be to find the closest match, remove it from the set and repeat until nothing is left. While easy to implement (and probably would result in correct solutions even in the described cases) it will be very slow because we need to compare every element again and again. I can't imaging that we are the first people in the world that have the problem to find the closest pairs in two sorted one-dimensional arrays. IMHO most of the knn algorithms for n dimensional spaces are too complex for our problem (and allow many-to-one assignment but we want to allow just one-to-one assignments). Does anybody have a suggestion for an algorithm that we could implement here? |
Thanks for your comments @sgibb . Maybe an intermediate solution would be what you suggested, i.e. to remove duplicated values beforehand. But only if this would not increase the runtime considerably. In the end, having duplicated m/z values in a spectrum would be IMHO wrong anyway, because an MS instrument could never generate such data. |
Doing this in R already duplicates the runtime on my machine for
In theory it should not produce such data. But you found them, didn't you? 😄 |
yes, I found that in some "reference spectra" from the human metabolome database. No idea how they got there, but likely that they are rounding errors. But even if so, I'm not aware that any MS instrument at present has a precision which is that small that it could measure peaks with a difference in m/z that would be affected by such a rounding (64 to 32 bit). I believe that there might be some "human mistake" in there... |
I would suggest we keep this issue open for the record/reference but don't fix the problem. |
Hello, I noticed that there is still a strange problem with |
JFTR: I asked for a better algorithm on StackOverflow: https://stackoverflow.com/questions/70486603/closest-pairs-of-elements-in-two-sorted-arrays Currently I think a good algorithm would be:
I assume this algorithm will solve all the above mentioned problems. Nevertheless I think the runtime will increase. |
My C-fu is really bad - do you think you might find the time to implement that version @sgibb ? |
@jorainer I am not sure that the time to do this is well invested. While the above mentioned algorithm will IMHO find the correct solutions the current one does as well except for a few edge cases (less than 3 elements in |
OK, then let's keep the "wontfix" label. |
There is a strange problem in the
closest
function when a query m/z matches more than one target m/z:What works fine:
But with 3 matches for the first m/z the second will no longer be found:
So, for whatever odd reasons the match for the second m/z is dropped. Note that without
duplicates = "closest"
it still works.My session info:
The text was updated successfully, but these errors were encountered: