Skip to content
This repository has been archived by the owner on Nov 10, 2022. It is now read-only.

Reconcilation fails when specifying type work (Q386724) - sometimes giving 502 response #131

Open
Dominic-DallOsto opened this issue Jan 7, 2022 · 10 comments

Comments

@Dominic-DallOsto
Copy link

Hi. I'm coming from an issue I found here (diegodlh/zotero-cita#149) using the Cita extension in Zotero to connect the citations of scholarly works with Wikidata.

There is more detail in the above issue, but I'll reproduce the important part here:

If I try to find this item, everything works fine if I specify the actual type (scholarly article):

curl -X POST -F 'queries={"q0":{"query":"Parton distributions of the proton","type":"Q13442814"}}' https://wikidata.reconci.link/en/api

Working up the type hierarchy, it also works if I specify article (Q191067), written work (Q47461344), creative work (Q17537576) or intellectual work (Q15621286), but doesn't work if I specify the type work (Q386724).

In this case I get the following error message.

curl -X POST -F 'queries={"q0":{"query":"Parton distributions of the proton","type":"Q386724"}}' https://wikidata.reconci.link/en/api

{"arguments":{"lang":"en","queries":"{\"q0\":{\"query\":\"Parton distributions of the proton\",\"type\":\"Q386724\"}}"},"details":"Invalid control character at: line 5796829 column 74 (char 139001879)","message":"invalid query","status":"error"}

I'm not sure if this is the right place to be reporting this, or if the issue might lie further up the chain. But I was hoping you might have some insight / be able to help investigate further.

Thanks!

@wetneb
Copy link
Owner

wetneb commented Jan 8, 2022

This could be because the type hierarchy is too big for us to fetch all subclasses of work (Q386724).

@VladimirAlexiev
Copy link

Hmm, there are only 266? See https://w.wiki/4eyu

@wetneb
Copy link
Owner

wetneb commented Jan 9, 2022

Those are only the direct subclasses, not the indirect ones. To get the indirect ones you need to replace wdt:P279 by wdt:P279*.

@Dominic-DallOsto
Copy link
Author

Dominic-DallOsto commented Jan 9, 2022

Ok, yeah, that seems to correspond to the behaviour I'm getting:

All subchildren of intellectual work returns 41917 results

?item wdt:P279* wd:Q15621286

While all subchildren of work times out

?item wdt:P279* wd:Q386724

This has been working fine for at least the past 10 months though - could it be some extra types were just added so the query times out now?

Is there a way this query could be formulated the other way around, as mentioned in the query optimisation page so it doesn't timeout? Something like getting all items that match the title then traversing from their type forwards to see whether their type is a subclass of work?

@wetneb
Copy link
Owner

wetneb commented Jan 9, 2022

If there is, I have not found it yet! I think this is one of the weakest points of this service and I personally do not see a way out of this issue without changing its architecture quite dramatically.
That being said if there are ways to mitigate the problem even slightly, I will be very keen to merge any PR going in that direction.

@Dominic-DallOsto
Copy link
Author

Correct me if I'm wrong, I just had a quick look. But at the moment it looks like you're caching type-subclass lists, then checking those for each item. I guess in general this speeds things up a lot, unless the type has too many subclasses for the query to work?

Would it be possible in some cases (maybe if searching for all subclasses of a type results in a timeout) to directly query whether the item is an instance or subclass of a particular type?

For example from the query optimisation page, this search times out:

ASK {
  wd:Q74430325 wdt:P31/wdt:P279* wd:Q386724.
}

But adding the hint to reverse the traverse order returns a result in 200 ms.

ASK {
  wd:Q74430325 wdt:P31/wdt:P279* wd:Q386724.
}

But I'm new to this so please excuse any naivety!

@Dominic-DallOsto
Copy link
Author

It looks like subclass type checking of work is a known issue

@wetneb
Copy link
Owner

wetneb commented Jan 9, 2022

Absolutely, it can make sense to query for membership on a per instance basis, or at least on a per direct type basis (because we already fetch the P31 values outside SPARQL).

This does mean making one SPARQL request per reconciliation query, which is likely to slow down query resolution quite a lot in general (and potentially be a problem for the SPARQL endpoint itself?). So I would be cautious about doing that for any type, but it could be a sensible fallback for types where the initial subclass fetching fails.

In general, as the type hierarchy grows and gets messier, there is no chance we can do this on the fly I think. There could be:

  • limits on the depth / breadth / size of type hierarchies, that we would announce clearly in the service's documentation, and people could check if the types they care about fit those limits. If they do not, it could encourage more work on cleaning up the ontology and optimizing it further.
  • a dedicated service to do type filtering in Wikidata. For instance, a service which could reply very quickly and reliably queries about the type hierarchy (for instance a SPARQL enpdoint which would only ingest wdt:P279 links). Or for each Wikidata item, a cached list of all the superclasses it has (recomputed every time the item is edited, for instance).

@Dominic-DallOsto
Copy link
Author

Yeah, that makes sense!

As an immediate remedy in the direction of point 1, would it be possible to catch this error in particular and provide a more descriptive error like "query timed out checking types - please choose a more restrictive type"?

As something of a middle ground between the current approach and type checking each item, according to this issue checking whether a particular type is a subclass of another is quite fast. Would it make sense to build the cache this way ("is type A a subclass of type B") instead of "what are all the subclasses of type B"?

@wetneb
Copy link
Owner

wetneb commented Jan 10, 2022

As something of a middle ground between the current approach and type checking each item, according to this issue checking whether a particular type is a subclass of another is quite fast. Would it make sense to build the cache this way ("is type A a subclass of type B") instead of "what are all the subclasses of type B"?

Yes, ideally this would be used as a fallback for the cases where there are too many subclasses.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants