-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TG2 - What about testing for duplicate records in a dataset? #176
Comments
We discussed this in detail previously. Identifying
To have a reliable test of this nature is difficult (CRIA in Brazil did it and it works reasonably well) but again - from my view - not a core test. |
Thanks for the feedback @ArthurChapman. My question originates from working with data extracted by various institutions where they haven't understood the results of doing joins in queries over 1:n relationships in their databases. So there can be several copies of a record with different values for a particular field (like collector). Agreed though that yes this should not be a core test. Are the supplementary tests documented online? |
Not really documented and I am sure that they won't be in this process. Many of the Supplementary tests are "use/user dependant" where as most of the Core Tests are "use/user independant" - i.e. they are picking up problems (often true errors) irregardless what the use is likely to be. There is a spreadsheet. I don't have the link with me as I am travelling. But @Tasilee could provide it for you. They are partly documented in that spreadsheet, but haven't been put into the formats that the core tests have evolved to during discussions. |
@ianengelbrecht - I've sent you a link to the original master spreadsheet - note worksheets, one covers the Supplementary tests. |
They definitely would be useful @ianengelbrecht and increasingly necessary. At GBIF-scale, for example, DwC Occurrence records for a DNA sequence, a citation in literature and specimens from Institutions can all relate to the same collecting event in nature but the idiosyncrasies of data management make it difficult to link them. We are currently exploring clustering algorithms to provide these links.
Do you envisage a future process accommodating this kind of thing, please? There is a family of quality tests that are needed (e.g. outlier detection, the likelihood of identification, similar records, inferring grids etc). |
I agree with @timrobertson100 - the tests were designed to apply at the record level and therefore assertions from the tests can be accumulated on any set of multiple records. I also agree with @ArthurChapman regards detecting 'duplicate' records. The ALA also has a duplicate record tester but the permutations of pathways and alterations/amendments requires a more comprehensive review and a more standardized use of identifiers. The latter is happening (e.g., GBIF-ALA alignment project), but slowly. |
@ianengelbrecht the framework within which we are defining these tests allows for the definition of tests that operate on data sets (MultiRecord in terms of the framework), and allows for the definition of tests that would specify how duplicates might be detected, and how those duplicates might be dealt with. Running such a test for QualityAssurance might result in assertions that duplicate occurrence records should be deduplicated. Similarly a test could be phrased to add record relationships linking potential duplicate occurrence records. The potential data quality needs are diverse, and the potential range of what such tests might specify are large, so we haven't tried to define one here. Taking @ArthurChapman 's case "Collector + Collector no+ date" we might frame a (non-CORE) test in the form:
Not in any way recommending this test, just indicating how such a test might be phrased in a form consistent with the framework and the CORE tests we have defined, though it is quite different in that it pertains to a MultiRecord rather than a SingleRecord, and we haven't that space very much. |
I agree with @chicoreus. We have concluded that our tests relate to Darwin Core terms in a single record. Also, de-duplication is non-trivial. In many cases I have seen in the ALA (where duplicate records is a flag), the 'best' outcome, after considerable research, is an 'amalgamated' record. |
As mentioned by @chicoreus and @Tasilee - our tests test data within a single record within one database. Testing for duplicates is a much larger problem.
These types of tests become very user/use dependent. |
On Fri, 23 Jun 2023 17:22:08 -0700 Arthur Chapman ***@***.***> wrote:
These types of tests become very user/use dependent.
Which contributes to our judgement that they don't belong in CORE.
|
While the tests and assertions are intended to be applied to individual records, is there room for tests like this on a dataset as a whole? Would they be useful or is it implicit that datasets shouldn't have duplicate records?
The text was updated successfully, but these errors were encountered: