You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently the total time to parse a Wikidata lexeme dump in Google Colab is ~250 seconds. It would be great if we could explore multithreading this process in order to get the time down even more. This should be based off of the total number of available CPUs. We should have this run on an appropriate number of CPUs that the user has available, which is likely not the maximum to not overload their system.
Contribution
@axif0 will be working on this as a part of Outreachy! 📶✈️
The text was updated successfully, but these errors were encountered:
Checking/documenting the suggestions from the PR #536, @axif0:
I tried Multi-threading as we are discussed. But it takes much time, So I increased batch_size=50000 so approximately it takes <250 second. As it speeds up file parsing by reading and processing lines in batches (e.g., 50,000 lines at a time). This way, fewer I/O operations occur, and the parser’s internal state updates more efficiently with each chunk before moving on, rather than for every single line.
So the suggestion is to keep the batch size where it is and not use multithreading as the process is more efficient without it?
I used multithreading implemented using with concurrent.futures.ThreadPoolExecutor to process batches in parallel. But I got almost same result. I think bz2 itself is not designed for parallelism. (Not totally sure)
Therefore I avoided implementing multi-threading.
Terms
Description
Currently the total time to parse a Wikidata lexeme dump in Google Colab is ~250 seconds. It would be great if we could explore multithreading this process in order to get the time down even more. This should be based off of the total number of available CPUs. We should have this run on an appropriate number of CPUs that the user has available, which is likely not the maximum to not overload their system.
Contribution
@axif0 will be working on this as a part of Outreachy! 📶✈️
The text was updated successfully, but these errors were encountered: