-
Notifications
You must be signed in to change notification settings - Fork 24
Long text query may incorrectly return empty results array #116
Comments
I have just added some explanation of why we are using two endpoints here: Given OpenRefine's current behaviour I am not sure about the benefit of returning an error rather than an empty list of results. In fact, the protocol does not really define a way to return an error for a single query in a batch (perhaps that's something worth adding?). |
Hi, Antonin! Thank you for your reply. Sorry I couldn't check it before. Thank you for the example you added to the documentation explaining why both endpoints are used. It is very clear! And thanks for sharing your paper too! I understand you are not sure about the benefit of returning an error rather than an empty list of results. I'm not sure either. Let me explain the situation that gave rise to this suggestion, in case it helps clear things up. Cita is a Wikidata addon that provides citations metadata (i.e., what sources a given source cites) support to the reference management software Zotero. It is able to get this information from Wikidata, where citing and cited items are linked via P2860 "cites work" statements. However, to get citations, the QID of the citing item must be known. This is where the Wikidata reconciliation service enters the scene. Cita sends a reconciliation query including unique identifiers (DOI and ISBN, if available) in the A user is working with old books which have long titles, such as Q106923254 with a 348-character title. Because of Wikidata's 250-character limit on labels and aliases, the item's label is a short version of the title, whereas the full title appears as a P1476 "title" statement. A Cita user (or any Wikidata reconciliation service consumer) may be tempted to submit a query for the full title. First, the wbsearchentities endpoint would return an empty array (this is what I posted a Phabricator ticket about, because I think it should return an error instead). After your example about why both endpoints are used, I agree it may be OK that this is just ignored by the reconciliation API (after all, it's somehow similar to the "Lovelace, Ada" example). Now, the query&list=search endpoint would return an error, because the maximum query length is 300 characters. The reconciliation API currently seems to ignore this error. As a result, the reconciliation service would return an empty array and the user may think that an item for that book does not exist in Wikidata, and create a duplicate. Had the user queried a shorter 300-character version of the title, the wbsearchentities would still have returned an empty array, but the query&list=search endpoint would have found this string in the content of the page and returned Q106923254 (actually query&list=search doesn't seem to be searching P1476 statements, but I think that's a bug). Sorry if this was too long and if this issue may be irrelevant to other users of the reconciliation API. Feel free to close it if you think that might be the case. It just occurred to me when I found this in Cita that it might be relevant, but again I'm not sure. To work this around in Cita, I may just refuse to reconcile items with titles longer than 250 characters and ask the user to provide an alternative short title. Thank you for your useful project and for taking the time to read me! |
Yes you are obviously right, it's not because OpenRefine doesn't do error-handling properly that we should prevent other clients from doing so… So I think we really need a error handling mechanism in the protocol for that. I have opened an issue for it here: reconciliation-api/specs#69. If you have ideas about what syntax we should use for it, feel free to chime in there :) |
Thank you for taking care of this, @wetneb! I'm already following the issue you opened :) |
The reconciliation query's
query
field "is searched for with both search APIs provided by the Wikibase instance (the auto-complete API and the search API)".The auto-complete API (wbsearchentities) "searches for entities using labels and aliases". Wikidata labels and aliases seem to be limited to 250 characters (I'm not sure what the limit is in other Wikibase instances). As a result, any query longer than 250 characters would return an empty results array from Wikidata API's wbsearchentities (I've just posted a task in Phabricator suggesting that it returns an error instead, as the query would be nonsense)
On the other hand, the search API (query&list=search) searches page content (including labels and aliases, I understand). This endpoint has a query-length limit of 300 characters. In this case, the endpoint does return an error (instead of an empty results array) if the limit is exceeded, but openrefine-wikibase seems to ignore this error.
As a result, reconciliation queries with a
query
field longer than 300 characters will always return an empty results array (as long as the query doesn't fit one of the exceptions to the reconciliation workflow).This may make an user believe that there is no item matching their query, when in reality the query had an error.
Would it make sense to either limit the length of the
query
field, or handle the error returned by the search API?Why is the wbsearchentities endpoint used if the search API searches page content including labels and aliases? Assuming there is a reason (I'm sure there is), would this reason imply that the length of the
query
field should be further limited to 250 characters instead, or that the wbsearchentities error response proposed in my Phabricator ticket should be handled (if ever implemented)?Thank you!
The text was updated successfully, but these errors were encountered: