Problem with duplicates = "closest" in closest #72

jorainer · 2020-10-15T09:59:39Z

There is a strange problem in the closest function when a query m/z matches more than one target m/z:

What works fine:

mzs <- c(175.119091510163, 349.231397186144)
mz_ref <- c(rep(175.119, 2), rep(349.2306, 2))
mz_ref
[1] 175.1190 175.1190 349.2306 349.2306
MsCoreUtils::closest(mzs, mz_ref, duplicates = "closest", tolerance = 0, ppm = 40)
[1] 1 3

But with 3 matches for the first m/z the second will no longer be found:

mz_ref <- c(rep(175.119, 3), rep(349.2306, 2))
mz_ref
[1] 175.1190 175.1190 175.1190 349.2306 349.2306
MsCoreUtils::closest(mzs, mz_ref, duplicates = "closest", tolerance = 0, ppm = 40)
[1]  1 NA

So, for whatever odd reasons the match for the second m/z is dropped. Note that without duplicates = "closest" it still works.

My session info:

> sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04 LTS

Matrix products: default
BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.8.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=C             
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] MsCoreUtils_1.1.8   BiocManager_1.30.10

loaded via a namespace (and not attached):
[1] MASS_7.3-53         compiler_4.0.2      parallel_4.0.2     
[4] S4Vectors_0.27.14   BiocGenerics_0.35.4 stats4_4.0.2

The text was updated successfully, but these errors were encountered:

sgibb · 2020-10-21T12:18:19Z

Confirmed. I created some unit tests for this. Currently I am very busy and it will take some time to fix it.

I am always surprised how you find such bugs 👍

jorainer · 2020-10-21T13:27:00Z

Thanks @sgibb for looking into this - and sure, you have more urgent things to do at present, so take your time.

I stumble across these while analyzing real world data, which is never perfect and leads, as in this case, to this strange mappings.

sgibb · 2021-03-19T09:11:40Z

We could circumvent the described problem by filtering the duplicated values before running closest:

closest2 <- function(x, y) {
    xd <- duplicated(x)
    yd <- duplicated(y)
    i <- closest(x[!xd], y[!yd], dup="closest", tol = 0)
    out <- rep.int(NA_integer_, length(x))
    out[!xd] <- i + cumsum(yd)[i]
    out
}

(because our input arguments are sorted we could tweak it a little bit in C and just compare the current element to its predecessor instead of the whole vector)

The function C_closest_dup_closest is already very difficult to understand. Removing the duplicates in the loop would slightly decrease the runtime and would decrease the already bad readability of the function even further. And I am not sure the we really tackle all corner cases with this solution.

All in all the closest function is incredible hard to implement correctly.

Finding the closest pair is relatively easy to do fast and reliable. The problem arises if we have to reassign a mapping because we found a closer match (because that could be recursive, e.g. x_n is closer to table_m than x_n-1 (which was the current best match), but x_n-1 is closer to table_m-1 as x_n-2 (which was the current best match), so we have to reassign x_n-1 to table_m-1 and reevaluate x_n-2 and so on.). This because even more difficult if we allow duplicates.

An easy way would be to find the closest match, remove it from the set and repeat until nothing is left. While easy to implement (and probably would result in correct solutions even in the described cases) it will be very slow because we need to compare every element again and again.

I can't imaging that we are the first people in the world that have the problem to find the closest pairs in two sorted one-dimensional arrays. IMHO most of the knn algorithms for n dimensional spaces are too complex for our problem (and allow many-to-one assignment but we want to allow just one-to-one assignments).

Does anybody have a suggestion for an algorithm that we could implement here?
I think closest as a key function that we use a lot in many different places so it is worth to improve it.

jorainer · 2021-03-22T07:32:07Z

Thanks for your comments @sgibb . Maybe an intermediate solution would be what you suggested, i.e. to remove duplicated values beforehand. But only if this would not increase the runtime considerably. In the end, having duplicated m/z values in a spectrum would be IMHO wrong anyway, because an MS instrument could never generate such data.

sgibb · 2021-03-22T10:34:54Z

to remove duplicated values beforehand. But only if this would not increase the runtime considerably.

Doing this in R already duplicates the runtime on my machine for n = 100. I did a quick check in C (without adjusting the index by cumsum) and it will increase the runtime around 10-25 % for n = 10^(2:6). IMHO even that is too high for a corner case that should never happen. (BTW: even anyDuplicated as error check make the function considerably slower.)

In the end, having duplicated m/z values in a spectrum would be IMHO wrong anyway, because an MS instrument could never generate such data.

In theory it should not produce such data. But you found them, didn't you? 😄
I guess rounding errors and conversion from 64 to 32 bit could produce duplicates, or?

jorainer · 2021-03-22T12:21:16Z

In theory it should not produce such data. But you found them, didn't you? 😄
I guess rounding errors and conversion from 64 to 32 bit could produce duplicates, or?

yes, I found that in some "reference spectra" from the human metabolome database. No idea how they got there, but likely that they are rounding errors. But even if so, I'm not aware that any MS instrument at present has a precision which is that small that it could measure peaks with a difference in m/z that would be affected by such a rounding (64 to 32 bit). I believe that there might be some "human mistake" in there...

jorainer · 2021-03-22T12:21:59Z

I would suggest we keep this issue open for the record/reference but don't fix the problem.

andreavicini · 2021-08-06T09:14:05Z

Hello, I noticed that there is still a strange problem with duplicates = "closest". For example
closest(x = c(1,2,3), table = c(0, 2, 3), ppm = 0, tolerance = 0, duplicates = "closest") returns NA NA 3
but I would expect NA 2 3 (like in the case where duplicates = "closest" is not used since there are not multiple matchings between xandtable`)

sgibb · 2021-12-28T11:49:20Z

JFTR: I asked for a better algorithm on StackOverflow: https://stackoverflow.com/questions/70486603/closest-pairs-of-elements-in-two-sorted-arrays

Currently I think a good algorithm would be:

merge both arrays (x and table) like in merge-sort and keep track of the origin in another array (whether the element belongs to x or table).
remove all elements that previous and next element is not from the other array.
calculate all distances of each x to its previous and next element in table and store the x:table pairs.
sort these distances in ascending order, drop all above tolerance.
assign x with the shortest distance to its corresponding table element, remove all pairs where both elements occur.
restart with 5. until no pair is left.

I assume this algorithm will solve all the above mentioned problems. Nevertheless I think the runtime will increase.

jorainer · 2022-01-11T12:50:43Z

My C-fu is really bad - do you think you might find the time to implement that version @sgibb ?

sgibb · 2022-01-11T13:05:25Z

@jorainer I am not sure that the time to do this is well invested. While the above mentioned algorithm will IMHO find the correct solutions the current one does as well except for a few edge cases (less than 3 elements in x and table, more than 2 duplicates, ...).
And the runtime will potentially increase dramatically (the distance has to be calculated multiple times, subsequently sorted multiple times, ...).

jorainer · 2022-01-11T13:12:51Z

OK, then let's keep the "wontfix" label.

jorainer added the bug Something isn't working label Oct 15, 2020

jorainer assigned sgibb Oct 15, 2020

sgibb added a commit that referenced this issue Oct 21, 2020

tests: for multiple duplicates in table (issue #72)

9a67767

sgibb added a commit that referenced this issue Oct 21, 2020

tests: for multiple duplicates in x (similar to issue #72)

ab257ae

sgibb added a commit that referenced this issue Oct 21, 2020

tests: for multiple duplicates in x (similar to issue #72)

c8710fe

jorainer added the wontfix This will not be worked on label Mar 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problem with duplicates = "closest" in closest #72

Problem with duplicates = "closest" in closest #72

jorainer commented Oct 15, 2020

sgibb commented Oct 21, 2020

jorainer commented Oct 21, 2020 •

edited by lgatto

Loading

sgibb commented Mar 19, 2021

jorainer commented Mar 22, 2021

sgibb commented Mar 22, 2021

jorainer commented Mar 22, 2021

jorainer commented Mar 22, 2021

andreavicini commented Aug 6, 2021

sgibb commented Dec 28, 2021

jorainer commented Jan 11, 2022

sgibb commented Jan 11, 2022

jorainer commented Jan 11, 2022

Problem with duplicates = "closest" in closest #72

Problem with duplicates = "closest" in closest #72

Comments

jorainer commented Oct 15, 2020

sgibb commented Oct 21, 2020

jorainer commented Oct 21, 2020 • edited by lgatto Loading

sgibb commented Mar 19, 2021

jorainer commented Mar 22, 2021

sgibb commented Mar 22, 2021

jorainer commented Mar 22, 2021

jorainer commented Mar 22, 2021

andreavicini commented Aug 6, 2021

sgibb commented Dec 28, 2021

jorainer commented Jan 11, 2022

sgibb commented Jan 11, 2022

jorainer commented Jan 11, 2022

jorainer commented Oct 21, 2020 •

edited by lgatto

Loading