-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SemTab matching (joint entity linking) #59
Comments
Hi @VladimirAlexiev, yes I would love the two communities to work together, and I have started discussions in the OAEI workshop to see how that could happen. If you already have ideas about how the protocol could evolve to get closer to the SemTab challenges, don't hesitate! |
@VladimirAlexiev Yes, OpenRefine actually worked that way before because the Freebase Recon Service used extra columns as hints (with special highly weighted scoring) for disambiguated properties in Freebase, which is somewhat equivalent to the list of Wikidata's P1963 set of properties for any particular type, but was more constrained to only the most important properties in identifying similarly-named topics from each other, which we called "disambiguating properties". Here's some old archived info from the old wiki about them: https://web.archive.org/web/20151002083332/http://wiki.freebase.com/wiki/Disambiguation |
Hi @VladimirAlexiev, I participated in SemTab2020 with bbw-team (3rd place) and our code (https://github.com/UB-Mannheim/bbw) is open source. We used contextual matching (both vertical and horizontal) and meta-lookup for spell checking. The mentioned 'tough tables' were used only in round 4 of SemTab2020 and they were only a part of the whole dataset, although the most challenging part. Majority of the tables in SemTab2020 were 'synthetically generated' (https://doi.org/10.5281/zenodo.4282879). Hi @wetneb, Hi @thadguidry, |
I am keen to go in this direction! But I think there is a big overlap between CEA and the reconciliation queries we have, so I have been thinking about generalizing reconciliation queries so that CEA tasks could be formulated as reconciliation queries.
Candidate retrieval and scoring are not done in OpenRefine, they are done in the reconciliation services themselves. You can get an overview of how services generally do it here: |
@shigapov In addition to Antonin's @wetneb excellent paper, I would say just knowing and learning about Web Search technologies, Text Analysis strategies, and in general lexicographic or linguistic research. If you want a quick primer on things to build better Reconciliation services for particular domains, then you might start with a tool like Elasticsearch that many use to build out with and then create their custom Reconciliation API's from. Anyways, |
@thadguidry, thank you for many links! @wetneb, thank you for the paper! |
In the current specification "query" corresponds to a label or to an alias of an entity in Wikidata, right? As additional context we can specify "type" and "properties". What if additionally to a label of an entity in Wikidata ("query") we could specify a label of an object in the statements corresponding to the entity? Specifying both "query_subject" and "query_object" is yet another way to include context. Then matching could return entities, properties and types. This would be already very close to what we are doing in SemTab. |
We can already specify "a label of an object in the statements corresponding to the entity": that is something you can do in the "properties" section. But at the moment you are required to say which property it is a value of. If I understand correctly, in SemTab you do not specify the linking property (the CPA challenge is about inferring it), right? So one way to relax this would be to make the property id optional there. |
@shigapov thanks for mentioning the other two SemTab tasks! A quick overview and elaboration about what was already mentioned:
@shigapov Do you have contact info and timing of SemTab 2021? |
Not necessarily - you can try to infer relations even without reconciling first. This is especially useful when some (or all!) of the entities involved do not exist in the target KB, but the ontology does have a property to represent their relations. If I give you a table where the first column looks like people names, the second looks like city names and the third looks like dates in second half of the 20th century, you can already suggest some likely relations between the column of names and of cities: placeOfBirth is one that comes to mind, for instance. Your example of peaks and altitudes is another great one: looking at the table, one should be able to guess the property even if we do not know any of the peaks involved (because their names look like peak names and the numbers look like typical altitudes of peaks). |
@VladimirAlexiev, there is a discussion group for SemTab (https://groups.google.com/g/sem-tab-challenge). You could also contact an organizer Ernesto Jimenez-Ruiz (https://www.city.ac.uk/people/academics/ernesto-jimenez-ruiz). I do not know timing of SemTab2021, but I expect that it starts in April-May and ends in October. |
http://www.cs.ox.ac.uk/isg/challenges/sem-tab/
Describes a task called
The key difference is this:
The latter is very useful for non-stratifiable scenarios like "company and CEO", "sportsman and team", etc. So it works for a wider variety of data tasks.
They have a bunch of test cases called "tough tables". See https://tinyurl.com/iswc2020-resources-2T-dataset
You can watch the 15m presentation of this year's winner for a quick intro to the concepts.
https://drive.google.com/file/d/1vz-6nkc9t6MQZYzgg-PZNLs-9TT86wRD/view
The text was updated successfully, but these errors were encountered: