-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate using SPARQL to source cards instead of scraping dumps #6
Comments
By using SPARQL, we can more easily create data sets for specific subsets of things like "Books", "Battles" or "TV Shows". We would still want an "All" collection though, which would work similarly to the current version. To be more specific on what would be needed here, I'd need to explain a rough outline of how the current processing works.
In order to use SPARQL, we would need to be able to not only get enough items to populate the game, but we also need to be able to filter based on some heuristics to ensure that we only generate interesting cards that can reasonably be answered. A great vector for that is wikipedia page views, but there may also be better ways of doing this. The current game has a list of around 10,000 cards, to give an idea of how many would be needed. It's worth noting also that the kinds of cards that could be generated could be improved a lot as well. For example, you may have a card for Woodrow Wilson, who is well known enough to be included, but the card would be for when he is born, which is much more difficult. A better card would be to ask when he became president. Being able to detect what it is that is interesting about an entry and programmatically generating a card based on that would be great, though I imagine difficult. What I get the feeling may make more sense would be to have many different SPARQL queries that specifically compose data based on things like the above example, so in that case a list of when famous world leaders came to power, and then to join all those datasets together. It would also be possible to pass the results of a SPARQL query through another processing step to check the wikipedia pageviews should we feel like the wikidata sitelinks still doesn't provide a good enough vector for how well known the item is. Some initial queries I was playing around with: |
Your query examples work fine :) So what was that actually did not work? |
My problem basically is getting enough queries that provide enough interesting results as to be able to actually fill the game with at least ten thousand cards. Maybe what I should ask for is for people to provide SPARQL queries that provide interesting search results? Here is what I would need in the query results:
Maybe somebody can come up with a reusable snippet for providing those things and then people can concentrate on providing interesting queries? I've made a discussion here where we can start collating queries: #8. |
You do have an option to query Wikidata deployed by other SPARQL Query Service providers e.g., the instance we host using our Virtuoso Platform. |
Potentially useful resource: https://www.wikidata.org/wiki/Wikidata:Request_a_query |
From tom-james-watson/wikitrivia#26:
|
That's great, nice one! I think with that as a base it would be possible to formulate more complex, specific queries that we could then add together to build up an interesting set. What would be good to avoid is too many "boring" things like I think more interesting things to see are things like:
I think that's why it may be easier to instead stitch together multiple more "niche" queries and therefore avoid having too many results that are just |
Hi Tom, huge fan of your game! I wrote some Python code to generate cards from the Wikidata Query Service API, focusing solely on events. I've attached the code and the output it currently generates to this message. It creates about 10,000 cards (but I've set a low threshold for the minimum acceptable number of site links per item). Currently, the code works by sending the API separate queries for individual event categories and combining the outputs. I can modify the code to get different sets of cards, such as discoveries and inventions or creative works as you suggest above. I generated the list of categories using this QLever query to find the most common categories of Wikidata items with the start date (P580) property. Here is the query I used. Do you have any suggestions for ways I could improve my code? |
I had trouble when I tried using SPARQL, which was actually my first approach to tackling this. Any queries that would be broad enough and return enough data would just time out. Maybe people in the community with more experience with SPARQL can help though!
The text was updated successfully, but these errors were encountered: