Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Translations #26

Open
airon90 opened this issue Jan 20, 2022 · 10 comments
Open

Translations #26

airon90 opened this issue Jan 20, 2022 · 10 comments

Comments

@airon90
Copy link

airon90 commented Jan 20, 2022

Can you support other language? You can get correct labels from Wikidata

@Mte90
Copy link

Mte90 commented Feb 9, 2022

Yes it will help a lot too as you are doing query I guess that is enough a language picker that change something in the endpoint the game uses.

@tuukka
Copy link
Contributor

tuukka commented Apr 5, 2022

As all the game data is currently loaded from a single file at start, I think the best approach might be to provide language-specific versions of this file.

Approach 0: Instead of having a language-specific file, fetch the data of the Wikidata item each time a card is shown to see if Wikidata (at the moment) contains the desired translations. I'm not sure which endpoints can be accessed directly by the game in the browser, but e.g. these would seem to work: https://www.wikidata.org/wiki/Special:EntityData/Q42.json and https://query.wikidata.org/bigdata/ldf?subject=wd:Q42

Approach 1: For each card (Wikidata item) in the original data file, replace the original label, description and Wikipedia article title (in English) by ones in the desired language from the same Wikidata item. However, they might not be available or they might be unsuitable (contain the answer or have a mistake).

Approach 2: Generate a new set of cards appropriate in the desired language e.g. by tweaking https://github.com/tom-james-watson/wikitrivia-generator.

EDIT: Approach 3: Generate a new set of cards dynamically from frontend by calling a suitable Sparql endpoint such as QLever. https://qlever.cs.uni-freiburg.de/wikidata/

@nicolaes
Copy link

nicolaes commented May 24, 2022

I like Approach 2 the most. Approaches 0 and 1 are for me:

  • Pro: long-term and low-maintenance
  • Con: may hinder quick-fix tweaks in the database

I'll try Approach 2 in Romanian to see how it goes.

Edit: I take back liking Approach 2 after seeing the 73GB data source. I will still give it a try, but don't have high hopes.

@tuukka
Copy link
Contributor

tuukka commented May 24, 2022

@nicolaes 👍 Perhaps we can find the necessary people who can make this happen together. To make approach 2 easier, I found some initial discussion on reimplementing it based on queries against a Sparql endpoint. In my experience, the official Sparql endpoint does not have the performance needed, but QLever (and/or Virtuoso) might be able to answer all the queries we need. Here's a quick test that finds about 9000 results that might be suitable for Romanian cards: https://qlever.cs.uni-freiburg.de/wikidata/30kMrq?exec=true

See also: tom-james-watson/wikitrivia-generator#6 and tom-james-watson/wikitrivia-generator#8

@nicolaes
Copy link

nicolaes commented May 25, 2022

@tuukka Thanks for the idea. I appreciate the effort to put together the Romanian version.
The quick test of 9000 entries is very relevant; current English database has 10k entries.

I don't know SPARQL, so I am playing around the link you provided.
My plan is to find a reasonably fast query that provides at least 5000 results, then put it together with the wikitrivia app.

@nicolaes
Copy link

I gave QLever a few tries, then I dropped it.
I ran a query with all year types (created, discovered, invented, born etc) and I lost the backend connectivity. Probably because lack of optimization. Here is the code: https://qlever.cs.uni-freiburg.de/wikidata/aFFkcp

I got progres on the raw data source processing, and now have ~1000 usable entries for Romanian.
I'm not yet sure if Approaches 0 and 1 are viable, but it might be worth trying them out.
My steps to get the Romanian entities were:

  • downloading the wiki data (73GB)
  • parsing it with wikibase-dump-filter - 150k entries in 9h (should be faster for more popular languages)
  • adapt the wikitrivia-generator parser (translate filter words, change en to ro, adjust viewcounts) - 250 entries / hour

Since I don't have many cards, I will account for the scenario when you don't have any relevant cards to show.
Then I will put this live - see if Romanians actually use it.

@tuukka
Copy link
Contributor

tuukka commented May 26, 2022

@nicolaes I hadn't thought of the possibility to create a set of cards dynamically based on a Sparql query. I've added it as "Approach 3" in my original list. At a glance, an advantage would be that the data would update automatically, but a disadvantage would be that two games couldn't be guaranteed to be played with the same set of cards.

I have reported the QLever crash to its developers - I hope it's something they can easily fix as QLever is very performant in general.

Do you know why you got just 10% of the amount of cards compared to English? For example, is it because the Romanian labels are missing, the filter words match more often, or the viewcounts are lower?

@tuukka
Copy link
Contributor

tuukka commented Jun 4, 2022

Update: here's a query for QLever that returns all suitable Wikidata items and their required attributes sorted by sitelinks count (pageviews is not available for queries). You can change "en" to any other language code: https://qlever.cs.uni-freiburg.de/wikidata/OycBUK

@tom-james-watson
Copy link
Owner

tom-james-watson commented Jun 4, 2022

Some really interesting discussion here!

@nicolaes - yeah unfortunately the wikitrivia-generator process as it stands is slow. I think sparql is definitely the future. Also, with something like the example @tuukka has worked on, that shows how easy the SPARQL approach would make it to internationalize.

The discussion of how to work out the details of the SPARQL approach should be kept to tom-james-watson/wikitrivia-generator#6.

@nicolaes
Copy link

nicolaes commented Jun 4, 2022

@tuukka sorry for late reply, messed up notifications.
I appreciated the time you invested in the SPARQL query. I got to download the 10k sample you prepared without any QLever issues.

About Romanian low count of entities: it's because not all pages are translated and I didn't adjust the view count thresholds correctly (e.g. I reduced it by 40x compared to English, while there are 60x less Romanian speakers).

PS: top hit from SPARQL query in Romanian is the wiki of Russia 🤔

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants