TG2 - What about testing for duplicate records in a dataset? #176

ianengelbrecht · 2019-03-25T08:34:16Z

While the tests and assertions are intended to be applied to individual records, is there room for tests like this on a dataset as a whole? Would they be useful or is it implicit that datasets shouldn't have duplicate records?

ArthurChapman · 2019-03-25T10:50:56Z

We discussed this in detail previously. Identifying

What people mean as duplicates can be different
a). Collector +collector number - some collectors start their numbers anew each year so unreliable;
b). Collector, + date - may have collected from >1 plant of species on same day
c). Collector + Collector no+ date - but could have collected from the same plant and split into two specimens - one with flowers, one with fruits.
d). Collector + Collector no. +date + location - but again see c). above
e). Different institutions may use different formats for a Collector (F.Mueller, F von Mueller, F.M., FvM, F.Muell. etc.) in their databases, adding another layer of complexity.
A duplicate is not an "error". Botanists "duplicate" parts of a plant to many institutions, often before georeferencing, etc. which can mean each of the duplicates are different (and difficult to detect). But at the same time provide valuable input. They can also show different things to different users. To taxonomists they supply valuable different information, to a modeller, they may be just noise and cause bias.
In modelling , a duplicate may be regarded as more than one record in a grid square (making use of the term confusing as it is not a duplicate other than by location alone.
We have retained the tests we originally had for duplicates in the Supplementary (non Core) tests along with hundreds of other tests we regarded at the time as being "non-Core"
Once specimen level GUIDs become acceptable practice, then true duplicates should be able to be identified, linked and documented.

To have a reliable test of this nature is difficult (CRIA in Brazil did it and it works reasonably well) but again - from my view - not a core test.

ianengelbrecht · 2019-03-27T07:53:28Z

Thanks for the feedback @ArthurChapman. My question originates from working with data extracted by various institutions where they haven't understood the results of doing joins in queries over 1:n relationships in their databases. So there can be several copies of a record with different values for a particular field (like collector). Agreed though that yes this should not be a core test. Are the supplementary tests documented online?

ArthurChapman · 2019-03-27T11:43:43Z

Not really documented and I am sure that they won't be in this process. Many of the Supplementary tests are "use/user dependant" where as most of the Core Tests are "use/user independant" - i.e. they are picking up problems (often true errors) irregardless what the use is likely to be. There is a spreadsheet. I don't have the link with me as I am travelling. But @Tasilee could provide it for you. They are partly documented in that spreadsheet, but haven't been put into the formats that the core tests have evolved to during discussions.

Tasilee · 2019-03-27T20:13:49Z

@ianengelbrecht - I've sent you a link to the original master spreadsheet - note worksheets, one covers the Supplementary tests.

timrobertson100 · 2020-04-01T06:57:53Z

While the tests and assertions are intended to be applied to individual records, is there room for tests like this on a dataset as a whole? Would they be useful or is it implicit that datasets shouldn't have duplicate records?

They definitely would be useful @ianengelbrecht and increasingly necessary. At GBIF-scale, for example, DwC Occurrence records for a DNA sequence, a citation in literature and specimens from Institutions can all relate to the same collecting event in nature but the idiosyncrasies of data management make it difficult to link them. We are currently exploring clustering algorithms to provide these links.

Not really documented and I am sure that they won't be in this process

Do you envisage a future process accommodating this kind of thing, please? There is a family of quality tests that are needed (e.g. outlier detection, the likelihood of identification, similar records, inferring grids etc).

Tasilee · 2020-04-01T20:43:50Z

I agree with @timrobertson100 - the tests were designed to apply at the record level and therefore assertions from the tests can be accumulated on any set of multiple records.

I also agree with @ArthurChapman regards detecting 'duplicate' records. The ALA also has a duplicate record tester but the permutations of pathways and alterations/amendments requires a more comprehensive review and a more standardized use of identifiers. The latter is happening (e.g., GBIF-ALA alignment project), but slowly.

chicoreus · 2023-06-23T19:19:33Z

@ianengelbrecht the framework within which we are defining these tests allows for the definition of tests that operate on data sets (MultiRecord in terms of the framework), and allows for the definition of tests that would specify how duplicates might be detected, and how those duplicates might be dealt with. Running such a test for QualityAssurance might result in assertions that duplicate occurrence records should be deduplicated. Similarly a test could be phrased to add record relationships linking potential duplicate occurrence records. The potential data quality needs are diverse, and the potential range of what such tests might specify are large, so we haven't tried to define one here.

Taking @ArthurChapman 's case "Collector + Collector no+ date" we might frame a (non-CORE) test in the form:

Field	Value
Label	AMENDMENT_OCCURRENCE_DUPLICATED
Description	Does the value of dwc:geodeticDatum occur in bdq:sourceAuthority?
Output Type	Amendment
Resource Type	MultiRecord
Information Elements	dwc:recordedBy, dwc:fieldNumber, dwc:eventDate, dwc:occurrenceID, resourceID, relationshipOfResourceID, relatedResourceID
Expected Response	INTERNAL_PREREQUISITES_NOT_MET if any of dwc:recordedBy, dwc:fieldNumber, dwc:eventDate, or dwc:occurrenceID are EMPTY; FILLED_IN for each SingleRecord where dwc:recordedBy, dwc:fieldNumber, dwc:eventDate, are identical to another SingleRecord with a different dwc:occurrenceID assert resourceID, relationshipOfResourceID="duplicate of", relatedResourceID relating these records.
Data Quality Dimension	Uniqueness

Not in any way recommending this test, just indicating how such a test might be phrased in a form consistent with the framework and the CORE tests we have defined, though it is quite different in that it pertains to a MultiRecord rather than a SingleRecord, and we haven't that space very much.

Tasilee · 2023-06-23T20:45:49Z

I agree with @chicoreus. We have concluded that our tests relate to Darwin Core terms in a single record. Also, de-duplication is non-trivial. In many cases I have seen in the ALA (where duplicate records is a flag), the 'best' outcome, after considerable research, is an 'amalgamated' record.

ArthurChapman · 2023-06-24T00:21:57Z

As mentioned by @chicoreus and @Tasilee - our tests test data within a single record within one database. Testing for duplicates is a much larger problem.

Duplicates records in the database of the same record
Duplicates in one database - but some institutions may have different records for specimen, liquid material, DNA, etc. with links
Duplicates between databases which are actually different specimens and may show different things - Syntypes, Isotypes, etc.
Duplicates of a species at a location (a term used in modelling where multiple species records in one grid may be deleted)
- also see my comment above.

These types of tests become very user/use dependent.

chicoreus · 2023-06-24T00:27:04Z

On Fri, 23 Jun 2023 17:22:08 -0700 Arthur Chapman ***@***.***> wrote: These types of tests become very user/use dependent.

Which contributes to our judgement that they don't belong in CORE.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TG2 - What about testing for duplicate records in a dataset? #176

TG2 - What about testing for duplicate records in a dataset? #176

ianengelbrecht commented Mar 25, 2019

ArthurChapman commented Mar 25, 2019

ianengelbrecht commented Mar 27, 2019

ArthurChapman commented Mar 27, 2019

Tasilee commented Mar 27, 2019 •

edited

Loading

timrobertson100 commented Apr 1, 2020

Tasilee commented Apr 1, 2020

chicoreus commented Jun 23, 2023

Tasilee commented Jun 23, 2023

ArthurChapman commented Jun 24, 2023

chicoreus commented Jun 24, 2023 via email

TG2 - What about testing for duplicate records in a dataset? #176

TG2 - What about testing for duplicate records in a dataset? #176

Comments

ianengelbrecht commented Mar 25, 2019

ArthurChapman commented Mar 25, 2019

ianengelbrecht commented Mar 27, 2019

ArthurChapman commented Mar 27, 2019

Tasilee commented Mar 27, 2019 • edited Loading

timrobertson100 commented Apr 1, 2020

Tasilee commented Apr 1, 2020

chicoreus commented Jun 23, 2023

Tasilee commented Jun 23, 2023

ArthurChapman commented Jun 24, 2023

chicoreus commented Jun 24, 2023 via email

Tasilee commented Mar 27, 2019 •

edited

Loading