Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explore using multithreading in dump parsing #532

Closed
2 tasks done
andrewtavis opened this issue Dec 19, 2024 · 3 comments
Closed
2 tasks done

Explore using multithreading in dump parsing #532

andrewtavis opened this issue Dec 19, 2024 · 3 comments
Assignees
Labels
feature New feature or request help wanted Extra attention is needed

Comments

@andrewtavis
Copy link
Member

Terms

Description

Currently the total time to parse a Wikidata lexeme dump in Google Colab is ~250 seconds. It would be great if we could explore multithreading this process in order to get the time down even more. This should be based off of the total number of available CPUs. We should have this run on an appropriate number of CPUs that the user has available, which is likely not the maximum to not overload their system.

Contribution

@axif0 will be working on this as a part of Outreachy! 📶✈️

@andrewtavis andrewtavis added feature New feature or request help wanted Extra attention is needed labels Dec 19, 2024
@andrewtavis
Copy link
Member Author

Checking/documenting the suggestions from the PR #536, @axif0:

I tried Multi-threading as we are discussed. But it takes much time, So I increased batch_size=50000 so approximately it takes <250 second. As it speeds up file parsing by reading and processing lines in batches (e.g., 50,000 lines at a time). This way, fewer I/O operations occur, and the parser’s internal state updates more efficiently with each chunk before moving on, rather than for every single line.

So the suggestion is to keep the batch size where it is and not use multithreading as the process is more efficient without it?

@axif0
Copy link
Collaborator

axif0 commented Jan 4, 2025

So the suggestion is to keep the batch size where it is and not use multithreading as the process is more efficient without it?

Since we use a compressed BZ2 lexeme JSON Wikidump, I followed this docs-

I used multithreading implemented using with concurrent.futures.ThreadPoolExecutor to process batches in parallel. But I got almost same result. I think bz2 itself is not designed for parallelism. (Not totally sure)
Therefore I avoided implementing multi-threading.

@andrewtavis
Copy link
Member Author

All sounds good to me, and thanks for the further explanation to document the results, @axif0! :)

@github-project-automation github-project-automation bot moved this from Todo to Done in Scribe Board Jan 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request help wanted Extra attention is needed
Projects
Archived in project
Development

No branches or pull requests

2 participants