Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem with duplicates = "closest" in closest #72

Open
jorainer opened this issue Oct 15, 2020 · 12 comments
Open

Problem with duplicates = "closest" in closest #72

jorainer opened this issue Oct 15, 2020 · 12 comments
Assignees
Labels
bug Something isn't working wontfix This will not be worked on

Comments

@jorainer
Copy link
Member

There is a strange problem in the closest function when a query m/z matches more than one target m/z:

What works fine:

mzs <- c(175.119091510163, 349.231397186144)
mz_ref <- c(rep(175.119, 2), rep(349.2306, 2))
mz_ref
[1] 175.1190 175.1190 349.2306 349.2306
MsCoreUtils::closest(mzs, mz_ref, duplicates = "closest", tolerance = 0, ppm = 40)
[1] 1 3

But with 3 matches for the first m/z the second will no longer be found:

mz_ref <- c(rep(175.119, 3), rep(349.2306, 2))
mz_ref
[1] 175.1190 175.1190 175.1190 349.2306 349.2306
MsCoreUtils::closest(mzs, mz_ref, duplicates = "closest", tolerance = 0, ppm = 40)
[1]  1 NA

So, for whatever odd reasons the match for the second m/z is dropped. Note that without duplicates = "closest" it still works.

My session info:

> sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04 LTS

Matrix products: default
BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.8.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=C             
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] MsCoreUtils_1.1.8   BiocManager_1.30.10

loaded via a namespace (and not attached):
[1] MASS_7.3-53         compiler_4.0.2      parallel_4.0.2     
[4] S4Vectors_0.27.14   BiocGenerics_0.35.4 stats4_4.0.2       
@jorainer jorainer added the bug Something isn't working label Oct 15, 2020
@sgibb
Copy link
Member

sgibb commented Oct 21, 2020

Confirmed. I created some unit tests for this. Currently I am very busy and it will take some time to fix it.

I am always surprised how you find such bugs 👍

@jorainer
Copy link
Member Author

jorainer commented Oct 21, 2020

Thanks @sgibb for looking into this - and sure, you have more urgent things to do at present, so take your time.

I stumble across these while analyzing real world data, which is never perfect and leads, as in this case, to this strange mappings.

@sgibb
Copy link
Member

sgibb commented Mar 19, 2021

We could circumvent the described problem by filtering the duplicated values before running closest:

closest2 <- function(x, y) {
    xd <- duplicated(x)
    yd <- duplicated(y)
    i <- closest(x[!xd], y[!yd], dup="closest", tol = 0)
    out <- rep.int(NA_integer_, length(x))
    out[!xd] <- i + cumsum(yd)[i]
    out
}

(because our input arguments are sorted we could tweak it a little bit in C and just compare the current element to its predecessor instead of the whole vector)

The function C_closest_dup_closest is already very difficult to understand. Removing the duplicates in the loop would slightly decrease the runtime and would decrease the already bad readability of the function even further. And I am not sure the we really tackle all corner cases with this solution.

All in all the closest function is incredible hard to implement correctly.

Finding the closest pair is relatively easy to do fast and reliable. The problem arises if we have to reassign a mapping because we found a closer match (because that could be recursive, e.g. x_n is closer to table_m than x_n-1 (which was the current best match), but x_n-1 is closer to table_m-1 as x_n-2 (which was the current best match), so we have to reassign x_n-1 to table_m-1 and reevaluate x_n-2 and so on.). This because even more difficult if we allow duplicates.

An easy way would be to find the closest match, remove it from the set and repeat until nothing is left. While easy to implement (and probably would result in correct solutions even in the described cases) it will be very slow because we need to compare every element again and again.

I can't imaging that we are the first people in the world that have the problem to find the closest pairs in two sorted one-dimensional arrays. IMHO most of the knn algorithms for n dimensional spaces are too complex for our problem (and allow many-to-one assignment but we want to allow just one-to-one assignments).

Does anybody have a suggestion for an algorithm that we could implement here?
I think closest as a key function that we use a lot in many different places so it is worth to improve it.

@jorainer
Copy link
Member Author

Thanks for your comments @sgibb . Maybe an intermediate solution would be what you suggested, i.e. to remove duplicated values beforehand. But only if this would not increase the runtime considerably. In the end, having duplicated m/z values in a spectrum would be IMHO wrong anyway, because an MS instrument could never generate such data.

@sgibb
Copy link
Member

sgibb commented Mar 22, 2021

to remove duplicated values beforehand. But only if this would not increase the runtime considerably.

Doing this in R already duplicates the runtime on my machine for n = 100. I did a quick check in C (without adjusting the index by cumsum) and it will increase the runtime around 10-25 % for n = 10^(2:6). IMHO even that is too high for a corner case that should never happen. (BTW: even anyDuplicated as error check make the function considerably slower.)

In the end, having duplicated m/z values in a spectrum would be IMHO wrong anyway, because an MS instrument could never generate such data.

In theory it should not produce such data. But you found them, didn't you? 😄
I guess rounding errors and conversion from 64 to 32 bit could produce duplicates, or?

@jorainer
Copy link
Member Author

In theory it should not produce such data. But you found them, didn't you? 😄
I guess rounding errors and conversion from 64 to 32 bit could produce duplicates, or?

yes, I found that in some "reference spectra" from the human metabolome database. No idea how they got there, but likely that they are rounding errors. But even if so, I'm not aware that any MS instrument at present has a precision which is that small that it could measure peaks with a difference in m/z that would be affected by such a rounding (64 to 32 bit). I believe that there might be some "human mistake" in there...

@jorainer
Copy link
Member Author

I would suggest we keep this issue open for the record/reference but don't fix the problem.

@jorainer jorainer added the wontfix This will not be worked on label Mar 22, 2021
@andreavicini
Copy link

Hello, I noticed that there is still a strange problem with duplicates = "closest". For example
closest(x = c(1,2,3), table = c(0, 2, 3), ppm = 0, tolerance = 0, duplicates = "closest") returns NA NA 3
but I would expect NA 2 3 (like in the case where duplicates = "closest" is not used since there are not multiple matchings between xandtable`)

@sgibb
Copy link
Member

sgibb commented Dec 28, 2021

JFTR: I asked for a better algorithm on StackOverflow: https://stackoverflow.com/questions/70486603/closest-pairs-of-elements-in-two-sorted-arrays

Currently I think a good algorithm would be:

  1. merge both arrays (x and table) like in merge-sort and keep track of the origin in another array (whether the element belongs to x or table).
  2. remove all elements that previous and next element is not from the other array.
  3. calculate all distances of each x to its previous and next element in table and store the x:table pairs.
  4. sort these distances in ascending order, drop all above tolerance.
  5. assign x with the shortest distance to its corresponding table element, remove all pairs where both elements occur.
  6. restart with 5. until no pair is left.

I assume this algorithm will solve all the above mentioned problems. Nevertheless I think the runtime will increase.

@jorainer
Copy link
Member Author

My C-fu is really bad - do you think you might find the time to implement that version @sgibb ?

@sgibb
Copy link
Member

sgibb commented Jan 11, 2022

@jorainer I am not sure that the time to do this is well invested. While the above mentioned algorithm will IMHO find the correct solutions the current one does as well except for a few edge cases (less than 3 elements in x and table, more than 2 duplicates, ...).
And the runtime will potentially increase dramatically (the distance has to be calculated multiple times, subsequently sorted multiple times, ...).

@jorainer
Copy link
Member Author

OK, then let's keep the "wontfix" label.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working wontfix This will not be worked on
Projects
None yet
Development

No branches or pull requests

3 participants