Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TG2 - What about testing for duplicate records in a dataset? #176

Open
ianengelbrecht opened this issue Mar 25, 2019 · 10 comments
Open

TG2 - What about testing for duplicate records in a dataset? #176

ianengelbrecht opened this issue Mar 25, 2019 · 10 comments

Comments

@ianengelbrecht
Copy link
Collaborator

While the tests and assertions are intended to be applied to individual records, is there room for tests like this on a dataset as a whole? Would they be useful or is it implicit that datasets shouldn't have duplicate records?

@ArthurChapman
Copy link
Collaborator

We discussed this in detail previously. Identifying

  1. What people mean as duplicates can be different
    a). Collector +collector number - some collectors start their numbers anew each year so unreliable;
    b). Collector, + date - may have collected from >1 plant of species on same day
    c). Collector + Collector no+ date - but could have collected from the same plant and split into two specimens - one with flowers, one with fruits.
    d). Collector + Collector no. +date + location - but again see c). above
    e). Different institutions may use different formats for a Collector (F.Mueller, F von Mueller, F.M., FvM, F.Muell. etc.) in their databases, adding another layer of complexity.
  2. A duplicate is not an "error". Botanists "duplicate" parts of a plant to many institutions, often before georeferencing, etc. which can mean each of the duplicates are different (and difficult to detect). But at the same time provide valuable input. They can also show different things to different users. To taxonomists they supply valuable different information, to a modeller, they may be just noise and cause bias.
  3. In modelling , a duplicate may be regarded as more than one record in a grid square (making use of the term confusing as it is not a duplicate other than by location alone.
  4. We have retained the tests we originally had for duplicates in the Supplementary (non Core) tests along with hundreds of other tests we regarded at the time as being "non-Core"
  5. Once specimen level GUIDs become acceptable practice, then true duplicates should be able to be identified, linked and documented.

To have a reliable test of this nature is difficult (CRIA in Brazil did it and it works reasonably well) but again - from my view - not a core test.

@ianengelbrecht
Copy link
Collaborator Author

Thanks for the feedback @ArthurChapman. My question originates from working with data extracted by various institutions where they haven't understood the results of doing joins in queries over 1:n relationships in their databases. So there can be several copies of a record with different values for a particular field (like collector). Agreed though that yes this should not be a core test. Are the supplementary tests documented online?

@ArthurChapman
Copy link
Collaborator

Not really documented and I am sure that they won't be in this process. Many of the Supplementary tests are "use/user dependant" where as most of the Core Tests are "use/user independant" - i.e. they are picking up problems (often true errors) irregardless what the use is likely to be. There is a spreadsheet. I don't have the link with me as I am travelling. But @Tasilee could provide it for you. They are partly documented in that spreadsheet, but haven't been put into the formats that the core tests have evolved to during discussions.

@Tasilee
Copy link
Collaborator

Tasilee commented Mar 27, 2019

@ianengelbrecht - I've sent you a link to the original master spreadsheet - note worksheets, one covers the Supplementary tests.

@timrobertson100
Copy link
Member

While the tests and assertions are intended to be applied to individual records, is there room for tests like this on a dataset as a whole? Would they be useful or is it implicit that datasets shouldn't have duplicate records?

They definitely would be useful @ianengelbrecht and increasingly necessary. At GBIF-scale, for example, DwC Occurrence records for a DNA sequence, a citation in literature and specimens from Institutions can all relate to the same collecting event in nature but the idiosyncrasies of data management make it difficult to link them. We are currently exploring clustering algorithms to provide these links.

Not really documented and I am sure that they won't be in this process

Do you envisage a future process accommodating this kind of thing, please? There is a family of quality tests that are needed (e.g. outlier detection, the likelihood of identification, similar records, inferring grids etc).

@Tasilee
Copy link
Collaborator

Tasilee commented Apr 1, 2020

I agree with @timrobertson100 - the tests were designed to apply at the record level and therefore assertions from the tests can be accumulated on any set of multiple records.

I also agree with @ArthurChapman regards detecting 'duplicate' records. The ALA also has a duplicate record tester but the permutations of pathways and alterations/amendments requires a more comprehensive review and a more standardized use of identifiers. The latter is happening (e.g., GBIF-ALA alignment project), but slowly.

@chicoreus
Copy link
Collaborator

@ianengelbrecht the framework within which we are defining these tests allows for the definition of tests that operate on data sets (MultiRecord in terms of the framework), and allows for the definition of tests that would specify how duplicates might be detected, and how those duplicates might be dealt with. Running such a test for QualityAssurance might result in assertions that duplicate occurrence records should be deduplicated. Similarly a test could be phrased to add record relationships linking potential duplicate occurrence records. The potential data quality needs are diverse, and the potential range of what such tests might specify are large, so we haven't tried to define one here.

Taking @ArthurChapman 's case "Collector + Collector no+ date" we might frame a (non-CORE) test in the form:

Field Value
Label AMENDMENT_OCCURRENCE_DUPLICATED
Description Does the value of dwc:geodeticDatum occur in bdq:sourceAuthority?
Output Type Amendment
Resource Type MultiRecord
Information Elements dwc:recordedBy, dwc:fieldNumber, dwc:eventDate, dwc:occurrenceID, resourceID, relationshipOfResourceID, relatedResourceID
Expected Response INTERNAL_PREREQUISITES_NOT_MET if any of dwc:recordedBy, dwc:fieldNumber, dwc:eventDate, or dwc:occurrenceID are EMPTY; FILLED_IN for each SingleRecord where dwc:recordedBy, dwc:fieldNumber, dwc:eventDate, are identical to another SingleRecord with a different dwc:occurrenceID assert resourceID, relationshipOfResourceID="duplicate of", relatedResourceID relating these records.
Data Quality Dimension Uniqueness

Not in any way recommending this test, just indicating how such a test might be phrased in a form consistent with the framework and the CORE tests we have defined, though it is quite different in that it pertains to a MultiRecord rather than a SingleRecord, and we haven't that space very much.

@Tasilee
Copy link
Collaborator

Tasilee commented Jun 23, 2023

I agree with @chicoreus. We have concluded that our tests relate to Darwin Core terms in a single record. Also, de-duplication is non-trivial. In many cases I have seen in the ALA (where duplicate records is a flag), the 'best' outcome, after considerable research, is an 'amalgamated' record.

@ArthurChapman
Copy link
Collaborator

As mentioned by @chicoreus and @Tasilee - our tests test data within a single record within one database. Testing for duplicates is a much larger problem.

  1. Duplicates records in the database of the same record
  2. Duplicates in one database - but some institutions may have different records for specimen, liquid material, DNA, etc. with links
  3. Duplicates between databases which are actually different specimens and may show different things - Syntypes, Isotypes, etc.
  4. Duplicates of a species at a location (a term used in modelling where multiple species records in one grid may be deleted)
    • also see my comment above.

These types of tests become very user/use dependent.

@chicoreus
Copy link
Collaborator

chicoreus commented Jun 24, 2023 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants