Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix time out of querries #83

Open
Nudin opened this issue Aug 27, 2024 · 2 comments
Open

Fix time out of querries #83

Nudin opened this issue Aug 27, 2024 · 2 comments

Comments

@Nudin
Copy link
Owner

Nudin commented Aug 27, 2024

The background component of machtsinn runs several SPARQL-queries, sadly several of them nowadays time out. Therefore, many languages don't receive any updates anymore. (See the bottom of the statistics-page to see when which query was successful last.

Possible ways to fix/improve this:

  • Optimize the queries – maybe ask some experts?
  • Split up the queries into even more – (I tried splitting up de into a querry only covering the nouns and one covering the rest. But most are nouns, so this querry too fails most of the time. Splitting Noun by Genus might work for de.)
  • Is there some way to get a higher timeout? There once were plans for a higher timeout for tools like this – but I'm not up-to-date.
  • Add a limit to the queries. Would mean that the would at least get some new matches in, but not all. We need to check the logic if this doesn't break anything (for example pruning).
@Jerkiller
Copy link
Contributor

I agree that this is the most impactful issue to date.

Let me add some thoughts:

  • Default query timing out: I think that asking for "all the other languages" is now practically impossible.
    • The lexicographical namespace has moved forward in these years and the number of lexemes grow up a lot (not only for German). New languages has grown a lot.
    • We could divide it in many single language queries. At least for the ones with most senseless lexemes (ru, et, ml, es, la, el, an, eu, id, ja, fa, uk, sk, cs, nn). What do you think?
  • Single-language queries (de) timing out. That's a big issue... I agree with the strategies you proposed, even if maybe not optimal.
    • I tried with not much success optimizing the query... Probably I will do some other tries before asking for help!
    • Partitioning: Good idea. I also tried to partition by the initial lemma letter which may be fine for German: it forms a fixed number of partitions which are small and balanced in proportions. But in general, it cannot be applied for all languages.
    • Limiting: not the best solution because some lexemes are excluded. But in every case, when the users match the lexeme with the sense, the excluded lexemes will come out in the query results because the senseless lexemes will be fewer and fewer.
    • Increasing timeouts: I heard about orbopengraph and I used QLever, but in this case, queries need some tweaking.

In my script I used a different approach, which is surely slower, but maybe could be of interest.

  • A process 1 searches all the the senseless lexemes in a language and writes them in a file/table.
  • A process 2 reads the senseless lexeme file row by row, it searches the senseless lexeme in label (and aliases too) and writes the possible matches in another data structure.
  • Process 3 is a dialog with the user that is being asked if a match is valid or not.

@Nudin
Copy link
Owner Author

Nudin commented Sep 5, 2024

The new query-main.wikidata.org seems to be faster than the so-far default. I switched to it. Then I found that using the default.sparql querry didn't work due to encoding issues of the database. I fixed those. And now the default-querry runs again. 🥳

Only the da, de and sv querries fail. We can partition them or strip down the filter, or replace them by the apparently more efficient querries used for en/fr/etc. We should look into where those differ.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants