Skip to content
This repository has been archived by the owner on Jan 22, 2025. It is now read-only.

Documentation > GitHub Wiki on SORMAS duplicate detection variables and logic #23

Open
Candice-Louw opened this issue Jan 20, 2021 · 16 comments

Comments

@Candice-Louw
Copy link
Collaborator

Situation Description

Duplicate detection in SORMAS is a feature that affects many users. There is currently no documentation that describes which variables are taken into account when this process is executed. This makes it difficult for SORMAS end users to decide which fields to make mandatory to capture, during the contact tracing process for example, to ensure more accurate results.

Feature Description

  1. Create a GitHub Wiki entry dedicated to duplicate detection logic, includicating which variables are taken into consideration during duplicate check for Person entities i.e.

    • Case Persons
    • Contact Persons
    • Event Participants (Persons)
  2. Please complete this per server configuration i.e. DE, CH and international, as different variables are visible on these different systems and it is not clear which are/are not relevant for which configuration.

  3. Please also provide link(s) to the file(s) in the sourcecode where this is programmed so developers may directly access this.

@bernardsilenou @kwa20 - please include additional entities if needed.

Possible Alternatives

This doesn't necessarily have to be a Wiki entry - it could be documented elsewhere (public) too, please. The request is simply to be able to share this information (URL) when this sort of request comes our way so that it is self-explanatory enough for anyone (technical and non-technical) to be able to understand and access the most up to date version available.

@JonasCir
Copy link
Collaborator

@Candice-Louw what do you think about using GitHub pages for that like we do e.g., in SORMAS-glossary? The good think is by this we would also have a solution to collect all the documentation files in the repo root.

@Candice-Louw
Copy link
Collaborator Author

@JonasCir - great idea, yes! Where would this process start?

@JonasCir
Copy link
Collaborator

I provide the same PR as for SORMAS-glossary to this repo and you go ahead and start to write the documentation in Markdown? 😃 I would still request a signoff for this approach from someone of the team, though :)

@JonasCir
Copy link
Collaborator

Everything taken from SORMAS-glossary where it is already running in production:
The approach uses mkdocs
a GitHub action workflow
All docs go to a docs folder
and will be rendered in Github pages

If filed this PR already 3 times to SORMAS-Stats, data-generator and glossary. An advantage would be that all the docs currently cluttering the repo root could got to a dedicated folder under docs/

@bernardsilenou
Copy link
Collaborator

I also support we create section of the glossary.

@JonasCir
Copy link
Collaborator

This is the question, do we put such things into glossary or main repo? The tech stack with github pages is the same.

@bernardsilenou
Copy link
Collaborator

OK I got it now, I think we should put in glossary please.

@JonasCir JonasCir transferred this issue from SORMAS-Foundation/SORMAS-Project Jan 21, 2021
@JonasCir
Copy link
Collaborator

@bernardsilenou @Candice-Louw let's continue our discussion here in the glossary:)

@Candice-Louw
Copy link
Collaborator Author

@Jan-Boehme - would it be possible to please upload/share the document/info that you compiled on the current duplicate check algorithm, please?

@SORMAS-JanBoehme
Copy link
Collaborator

SORMAS-JanBoehme commented May 28, 2021

@Candice-Louw

Sure, the github wiki page does not exist yet, right? Or I am too dumb to find it :-)

For everyone involved here is the info from a discussion currently going on with the german health departments about the duplicate detection for persons:

The departments aren't really happy with the current way the duplicate detection works because it requires some field to be exactly the same or else it will not work at all (more on that later).
Typos when entering data happen often or sometimes the information they get from the persons themselves are unclear. (i.e. is the person called Detlef or Detlev or Mohammed or Muhammed)

This I why I went ahead and did some digging in the the source code for the current implementation and created a concept for a more sophisticated person duplicate detection which makes use of weighted values attached to fields which could indicate a duplicate. The sum of all these weighted values is then observed to decide if a person is presented to the user as a possible duplicate.


Current implementation:

The SELECT statement for reading possible duplicates from the database is build regarding this criteria (PersonService.buildSimilarityCriteriaFilter):

(FirstName is equal OR LastName is equal)
AND
(sex is equal OR sex is null OR sex is unknown)
AND
(birthdateDD is equal AND birthdateMM is equal AND birthdateYYYY is equal) //Only if a value is provided
AND
(NationalHealthId is equal OR passportNumber is equal) //Only if a value is provided

The way of building the statement raises the following problems:

  • Either first name or last name of the person needs to be exactly the same. If it isn't the person will never be detected as a duplicate (i.e. Jens Müller and Hens Nüller will never be considered possible duplicates even if every other known data is exactly the same because they are never pulled from the database for further inspection)
  • If sex is male/female they are never considered a duplicate even if everything else is exactly the same. There are unisex names in existence from which it is not 100% clear which sex the person has, possibly resulting in someone interpeting the name as male and someone else as female.
  • Day, month and year are not evaluated seperately from another but instead are checked if they are equal connected by a logical and condition. Which means that only if the birtdate is exactly the same it will be considered a duplicate. (i.e. when the user makes a typo and enters the birthdate as March 3rd, 1991 oder March 2nd, 1919 instead of the correct date of March 2nd, 1991 it will not be considered a duplicate)

After pulling all matches from the database, firstName and LastName are joined into one string and the trigram distance between this string and the value in question is calculated.
If is greater than the server config value "namesimilaritythreshold" it is considered a possible duplicate that is presented to the user for selection.


I will provide the concept for the weighted person duplicate check when it has reached a high enough maturity level as it could have severe implications on database load which needs to be tested before even considering going ahead with a new implementation.
i.e the trigram calculation or maybe even using a phonetic algorithm for fuzzy search would need to be done on the database when executing the query. Which, in the worst case, means cross referencing every single entry in the table. Which may be fine for a few hundred entrys but not for over 1 Million like we have in Nigeria.

@bernardsilenou
Copy link
Collaborator

@Jan-Boehme

  • I think there are many duplicate detection methods out there and we can implement multiple of needed.
  • A challenge with all weighted methods is how weights are defined This differ from person to person, and a wrong assignment of weights may instead lead to false suggestions/ detection. This is the only point I think we need to clearly define.
  • The current implementation should not require any variable to be exact for it to work. If that is the case, there there is surly a bug or they need to change the value for "namesimilaritythreshold".

Few comments to your last comment:

Either first name or last name of the person needs to be exactly the same. If it isn't the person will never be detected as a duplicate (i.e. Jens Müller and Hens Nüller will never be considered possible duplicates even if every other known data is exactly the same because they are never pulled from the database for further inspection)

If this is what they experienced, then its clearly that they are using a "namesimilaritythreshold" corresponding to 1, that implies exact match. This would lead to over-conservative results. Names must not be exactly the same, even when you swap first and last name, it should not matter.
First and last names are concatenated in a sting, white space deleted, stings are compared using qgram algorithm and similarity compared with "namesimilaritythreshold"

If sex is male/female they are never considered a duplicate even if everything else is exactly the same. There are unisex names in existence from which it is not 100% clear which sex the person has, possibly resulting in someone interpeting the name as male and someone else as female.

If users are not 100% sure of the sex, then they should use "unknown " or NA as option. If name and all other person identifiers are the same but sex for one is male and the other is female or other, then they would not be suggested as duplicate.

Day, month and year are not evaluated seperately from another but instead are checked if they are equal connected by a logical and condition. Which means that only if the birtdate is exactly the same it will be considered a duplicate. (i.e. when the user makes a typo and enters the birthdate as March 3rd, 1991 oder March 2nd, 1919 instead of the correct date of March 2nd, 1991 it will not be considered a duplicate)
That is right, duplicate does not correct for wrong data entry. Adjusting for wrong data entry in the duplicate detection may lead to over suggestion of possible duplicates which is also as bad as under suggestion.

@kwa20
Copy link

kwa20 commented May 28, 2021

@bernardsilenou @Jan-Boehme

Either first name or last name of the person needs to be exactly the same. If it isn't the person will never be detected as a duplicate (i.e. Jens Müller and Hens Nüller will never be considered possible duplicates even if every other known data is exactly the same because they are never pulled from the database for further inspection)

If this is what they experienced, then its clearly that they are using a "namesimilaritythreshold" corresponding to 1, that implies exact match. This would lead to over-conservative results. Names must not be exactly the same, even when you swap first and last name, it should not matter. First and last names are concatenated in a sting, white space deleted, stings are compared using qgram algorithm and similarity compared with "namesimilaritythreshold"

This is however what happens, even if the namesimilaritythreshold is the default of 0.65. Here some examples:

name 1 name 2 detected
Jens Müller Hens Nüller no
Jens Müller Hens Müller yes
Jens Müller Jens Nüller yes
Jens Müller Thomas Müller no
Hens Müller Hens Nüller yes
Jens Nüller Thomas Nüller no
Thomas Muller Dhomas Müller no

@SORMAS-JanBoehme
Copy link
Collaborator

@bernardsilenou
The current implementation should not require any variable to be exact for it to work. If that is the case, there there is surly a bug or they need to change the value for "namesimilaritythreshold".

I checked the source code though and it is implemented exactly this way. No bug or wrong configuration by the GSA admins.
Which is confirmed by the tests @kwa20 ran which are the same results I am getting at least for a german test instance.

I get that we should avoid over suggesting of possible duplicates but at the moment the current implementation hides possibly relevant information from the users on purpose out of fear of overwhelming them.
Human errors happen and that's okay. SORMAS should compensate for them and help the user find and fix them instead of basically saying to the user "Well, you should have entered the birthdate correctly, tough luck.".

I can only speak for myself but when I first used SORMAS and I realized that I have to "create" a case and then just trust the system to provide me with the correct person (of which I knew for certain existed in the database) without telling me how it decides if I would be "allowed" to link the person I was kind of taken aback.

This is what I would like to achieve with a weighted comparison system. Allowing for users to make errors and fix them easily while at the same time getting better and more relevant results from the comparison. Also making every single parameter of the duplicate detection configurable by the local administrators on-the-fly. Enabling them to make informed decisions and tailor the software exactly to their needs while being 100% transparent about what happens.

@maxiheyner
Copy link

@MateStrysewske The local health departments often ask for more information on how the duplicate detection works exactly but we do not have an official documentation for it, yet.
Jan once checked the duplicate detection for persons and documented his findings here:
#23 (comment)
But something seems to have changed in the meantime as it does no longer behave the same way as at that time (It seems no longer neccessary for either first or last name to be exactly the same)

What is the exact way the duplicate detection for different entities is implemented at the moment? We need to document that for e.g. the admin manual. Can you please help here?

@MateStrysewske MateStrysewske added documentation Improvements or additions to documentation and removed documentation Improvements or additions to documentation labels Jul 26, 2021
@MateStrysewske
Copy link

Could you please add an issue to the main GitHub repository to create such a guide and prioritise it accordingly, i.e. for the next sprint if it's urgently needed?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants