Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Augment Wikidata graph with ranking information better than the sitelinks count #676

Open
tuukka opened this issue May 29, 2022 · 5 comments

Comments

@tuukka
Copy link

tuukka commented May 29, 2022

Would it be easy to add some ranking information (triples) to the Wikidata endpoint? This has been discussed for years elsewhere (T143424 T174981), but I'm not aware of a query endpoint that would provide this yet. Here's two open-sourced rankings that I could find:

QRank (pageviews): https://qrank.wmcloud.org/
Danker (PageRank): https://danker.s3.amazonaws.com/index.html

@hannahbast
Copy link
Member

hannahbast commented May 29, 2022

Adding triples for ranking would be rather easy, but I have a question:

We always use ^schema:about/wikibase:sitelinks for ranking. This counts the number of Wikimedia pages of an entity and is a very good proxy for popularity (and a much better proxy than, for example, the number of triples an entity is involved in). For example, here is a list of all people in Wikidata ranked by the number of sitelinks: https://qlever.cs.uni-freiburg.de/wikidata/kfJfrG

Have you tried ^schema:about/wikibase:sitelinks or is there anything that you don't like about it?

@tuukka
Copy link
Author

tuukka commented May 29, 2022

I am using the sitelinks count but I see it as just one metric:

  • sitelinks measures how "global" the notability and interest towards a topic is among Wikimedia contributors
  • pageviews measures how much readers a topic has among the general public
  • PageRank measures the "centrality" and connectedness of the topic in the Wikimedia graph

My current use case is reimplementing wikitrivia-generator, which is currently heavy and slow:

  • First it needs a full Wikidata dump (more or less solved with a large Wikidata query in QLever).
  • Then it makes pageviews API calls one-by-one, which takes days.

See more on the pain here: tom-james-watson/wikitrivia#26 (comment)

@hannahbast
Copy link
Member

@tuukka Do you have a demo of what the wikitrivia-generator does? Without fully understanding yet, what you want, a viable approach might be:

  1. Get the appropriate subset from Wikidata via a CONSTRUCT query
  2. Build a QLever instance for that subset
  3. Ask queries to that instance

Don't be afraid of building and running a qlever instance, it's as simple as this in a directory with a TTL file (which could be obtained via a CONSTRUCT query), using the qlever script:

. qlever      # Configure
qlever index  # Build index
qlever start  # Start the server

@tuukka
Copy link
Author

tuukka commented May 29, 2022

Here's the original game: https://wikitrivia.tomjwatson.com/

Here's the game data file as produced by wikitrivia-generator (in English, with items that were once generated and never updated, as it's too much hassle): https://wikitrivia-data.tomjwatson.com/items.json

So far, some people seem to have been able to fork the script and run it in their own language with more or less success: Basque, Romanian.

Ideally, it would be possible for the player to pick any language supported by Wikidata, and the game could make a suitable Sparql query to get a fresh set of up-to-date items for that language and no other backend infrastructure was needed.

You are right, it is also possible to implement this query without using the official QLever instance for now, and this issue could be tagged wishlist :-)

@hannahbast
Copy link
Member

Thanks for the explanation, now I understand. For this kind of application, asking a Wikidata SPARQL endpoint from time to time seems to be the method of choice.

But isn't then a query like https://qlever.cs.uni-freiburg.de/wikidata/m76Lrg doing exactly what you need? It works for any language and takes 20 - 30 seconds.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants