Explore using multithreading in dump parsing #532

andrewtavis · 2024-12-19T18:53:16Z

Terms

I have searched open and closed feature requests
I agree to follow Scribe-Data's Code of Conduct

Description

Currently the total time to parse a Wikidata lexeme dump in Google Colab is ~250 seconds. It would be great if we could explore multithreading this process in order to get the time down even more. This should be based off of the total number of available CPUs. We should have this run on an appropriate number of CPUs that the user has available, which is likely not the maximum to not overload their system.

Contribution

@axif0 will be working on this as a part of Outreachy! 📶✈️

andrewtavis · 2025-01-04T13:35:44Z

Checking/documenting the suggestions from the PR #536, @axif0:

I tried Multi-threading as we are discussed. But it takes much time, So I increased batch_size=50000 so approximately it takes <250 second. As it speeds up file parsing by reading and processing lines in batches (e.g., 50,000 lines at a time). This way, fewer I/O operations occur, and the parser’s internal state updates more efficiently with each chunk before moving on, rather than for every single line.

So the suggestion is to keep the batch size where it is and not use multithreading as the process is more efficient without it?

axif0 · 2025-01-04T19:26:06Z

So the suggestion is to keep the batch size where it is and not use multithreading as the process is more efficient without it?

Since we use a compressed BZ2 lexeme JSON Wikidump, I followed this docs-

https://qwikidata.readthedocs.io/en/stable/_modules/qwikidata/json_dump.html#WikidataJsonDump.create_chunks

I used multithreading implemented using with concurrent.futures.ThreadPoolExecutor to process batches in parallel. But I got almost same result. I think bz2 itself is not designed for parallelism. (Not totally sure)
Therefore I avoided implementing multi-threading.

andrewtavis · 2025-01-05T07:55:06Z

All sounds good to me, and thanks for the further explanation to document the results, @axif0! :)

andrewtavis added feature New feature or request help wanted Extra attention is needed labels Dec 19, 2024

andrewtavis assigned axif0 Dec 19, 2024

andrewtavis added this to Scribe Board Dec 19, 2024

github-project-automation bot moved this to Todo in Scribe Board Dec 19, 2024

axif0 mentioned this issue Jan 3, 2025

Feat: Translation cmd for scribe-data #536

Merged

2 tasks

andrewtavis closed this as completed Jan 5, 2025

github-project-automation bot moved this from Todo to Done in Scribe Board Jan 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Explore using multithreading in dump parsing #532

Explore using multithreading in dump parsing #532

andrewtavis commented Dec 19, 2024

andrewtavis commented Jan 4, 2025

axif0 commented Jan 4, 2025

andrewtavis commented Jan 5, 2025

Explore using multithreading in dump parsing #532

Explore using multithreading in dump parsing #532

Comments

andrewtavis commented Dec 19, 2024

Terms

Description

Contribution

andrewtavis commented Jan 4, 2025

axif0 commented Jan 4, 2025

andrewtavis commented Jan 5, 2025