Reconcilation fails when specifying type work (Q386724) - sometimes giving 502 response #131

Dominic-DallOsto · 2022-01-07T22:33:15Z

Hi. I'm coming from an issue I found here (diegodlh/zotero-cita#149) using the Cita extension in Zotero to connect the citations of scholarly works with Wikidata.

There is more detail in the above issue, but I'll reproduce the important part here:

If I try to find this item, everything works fine if I specify the actual type (scholarly article):

curl -X POST -F 'queries={"q0":{"query":"Parton distributions of the proton","type":"Q13442814"}}' https://wikidata.reconci.link/en/api

Working up the type hierarchy, it also works if I specify article (Q191067), written work (Q47461344), creative work (Q17537576) or intellectual work (Q15621286), but doesn't work if I specify the type work (Q386724).

In this case I get the following error message.

curl -X POST -F 'queries={"q0":{"query":"Parton distributions of the proton","type":"Q386724"}}' https://wikidata.reconci.link/en/api

{"arguments":{"lang":"en","queries":"{\"q0\":{\"query\":\"Parton distributions of the proton\",\"type\":\"Q386724\"}}"},"details":"Invalid control character at: line 5796829 column 74 (char 139001879)","message":"invalid query","status":"error"}

I'm not sure if this is the right place to be reporting this, or if the issue might lie further up the chain. But I was hoping you might have some insight / be able to help investigate further.

Thanks!

wetneb · 2022-01-08T10:59:04Z

This could be because the type hierarchy is too big for us to fetch all subclasses of work (Q386724).

VladimirAlexiev · 2022-01-09T08:58:09Z

Hmm, there are only 266? See https://w.wiki/4eyu

wetneb · 2022-01-09T12:42:40Z

Those are only the direct subclasses, not the indirect ones. To get the indirect ones you need to replace wdt:P279 by wdt:P279*.

Dominic-DallOsto · 2022-01-09T14:07:49Z

Ok, yeah, that seems to correspond to the behaviour I'm getting:

All subchildren of intellectual work returns 41917 results

?item wdt:P279* wd:Q15621286

While all subchildren of work times out

?item wdt:P279* wd:Q386724

This has been working fine for at least the past 10 months though - could it be some extra types were just added so the query times out now?

Is there a way this query could be formulated the other way around, as mentioned in the query optimisation page so it doesn't timeout? Something like getting all items that match the title then traversing from their type forwards to see whether their type is a subclass of work?

wetneb · 2022-01-09T14:23:13Z

If there is, I have not found it yet! I think this is one of the weakest points of this service and I personally do not see a way out of this issue without changing its architecture quite dramatically.
That being said if there are ways to mitigate the problem even slightly, I will be very keen to merge any PR going in that direction.

Dominic-DallOsto · 2022-01-09T14:49:29Z

Correct me if I'm wrong, I just had a quick look. But at the moment it looks like you're caching type-subclass lists, then checking those for each item. I guess in general this speeds things up a lot, unless the type has too many subclasses for the query to work?

Would it be possible in some cases (maybe if searching for all subclasses of a type results in a timeout) to directly query whether the item is an instance or subclass of a particular type?

For example from the query optimisation page, this search times out:

ASK {
  wd:Q74430325 wdt:P31/wdt:P279* wd:Q386724.
}

But adding the hint to reverse the traverse order returns a result in 200 ms.

ASK {
  wd:Q74430325 wdt:P31/wdt:P279* wd:Q386724.
}

But I'm new to this so please excuse any naivety!

Dominic-DallOsto · 2022-01-09T15:05:39Z

It looks like subclass type checking of work is a known issue

wetneb · 2022-01-09T15:34:58Z

Absolutely, it can make sense to query for membership on a per instance basis, or at least on a per direct type basis (because we already fetch the P31 values outside SPARQL).

This does mean making one SPARQL request per reconciliation query, which is likely to slow down query resolution quite a lot in general (and potentially be a problem for the SPARQL endpoint itself?). So I would be cautious about doing that for any type, but it could be a sensible fallback for types where the initial subclass fetching fails.

In general, as the type hierarchy grows and gets messier, there is no chance we can do this on the fly I think. There could be:

limits on the depth / breadth / size of type hierarchies, that we would announce clearly in the service's documentation, and people could check if the types they care about fit those limits. If they do not, it could encourage more work on cleaning up the ontology and optimizing it further.
a dedicated service to do type filtering in Wikidata. For instance, a service which could reply very quickly and reliably queries about the type hierarchy (for instance a SPARQL enpdoint which would only ingest wdt:P279 links). Or for each Wikidata item, a cached list of all the superclasses it has (recomputed every time the item is edited, for instance).

Dominic-DallOsto · 2022-01-10T08:53:43Z

Yeah, that makes sense!

As an immediate remedy in the direction of point 1, would it be possible to catch this error in particular and provide a more descriptive error like "query timed out checking types - please choose a more restrictive type"?

As something of a middle ground between the current approach and type checking each item, according to this issue checking whether a particular type is a subclass of another is quite fast. Would it make sense to build the cache this way ("is type A a subclass of type B") instead of "what are all the subclasses of type B"?

wetneb · 2022-01-10T14:23:20Z

As something of a middle ground between the current approach and type checking each item, according to this issue checking whether a particular type is a subclass of another is quite fast. Would it make sense to build the cache this way ("is type A a subclass of type B") instead of "what are all the subclasses of type B"?

Yes, ideally this would be used as a fallback for the cases where there are too many subclasses.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reconcilation fails when specifying type work (Q386724) - sometimes giving 502 response #131

Reconcilation fails when specifying type work (Q386724) - sometimes giving 502 response #131

Dominic-DallOsto commented Jan 7, 2022

wetneb commented Jan 8, 2022

VladimirAlexiev commented Jan 9, 2022

wetneb commented Jan 9, 2022

Dominic-DallOsto commented Jan 9, 2022 •

edited

Loading

wetneb commented Jan 9, 2022

Dominic-DallOsto commented Jan 9, 2022

Dominic-DallOsto commented Jan 9, 2022

wetneb commented Jan 9, 2022

Dominic-DallOsto commented Jan 10, 2022

wetneb commented Jan 10, 2022

Reconcilation fails when specifying type work (Q386724) - sometimes giving 502 response #131

Reconcilation fails when specifying type work (Q386724) - sometimes giving 502 response #131

Comments

Dominic-DallOsto commented Jan 7, 2022

wetneb commented Jan 8, 2022

VladimirAlexiev commented Jan 9, 2022

wetneb commented Jan 9, 2022

Dominic-DallOsto commented Jan 9, 2022 • edited Loading

wetneb commented Jan 9, 2022

Dominic-DallOsto commented Jan 9, 2022

Dominic-DallOsto commented Jan 9, 2022

wetneb commented Jan 9, 2022

Dominic-DallOsto commented Jan 10, 2022

wetneb commented Jan 10, 2022

Dominic-DallOsto commented Jan 9, 2022 •

edited

Loading