Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat: Translation cmd for scribe-data #536

Merged
merged 16 commits into from
Jan 4, 2025
Merged

Conversation

axif0
Copy link
Collaborator

@axif0 axif0 commented Dec 28, 2024

Contributor checklist


Description

Total

If user doesn't give -wdp then it will use query.

scribe-data total -a  -wdp

Additional: scribe-data total -a -wdp [wiki dump directory] and if we want to get specific language then we can add --language English
image

Translation

scribe-data get -dt translations

Additional:
scribe-data get -dt translations -wdp [wiki dump directory] -od [output dir for exported JSON]
we can add -l [language], -dt translations, -wdp [wiki dump directory]. -od [output dir for exported JSON]
e.g:

scribe-data get -l bengali -dt translations -wdp dump_path -od exported_json

image

I tried Multi-threading as we are discussed. But it takes much time, So I increased batch_size=50000 so approximately it takes <250 second. As it speeds up file parsing by reading and processing lines in batches (e.g., 50,000 lines at a time). This way, fewer I/O operations occur, and the parser’s internal state updates more efficiently with each chunk before moving on, rather than for every single line.

Related issue

Copy link

github-actions bot commented Dec 28, 2024

Thank you for the pull request!

The Scribe team will do our best to address your contribution as soon as we can. The following is a checklist for maintainers to make sure this process goes as well as possible. Feel free to address the points below yourself in further commits if you realize that actions are needed :)

If you're not already a member of our public Matrix community, please consider joining! We'd suggest using Element as your Matrix client, and definitely join the General and Data rooms once you're in. Also consider joining our bi-weekly Saturday dev syncs. It'd be great to have you!

Maintainer checklist

  • The linting and formatting workflow within the PR checks do not indicate new errors in the files changed

  • The CHANGELOG has been updated with a description of the changes for the upcoming release and the corresponding issue (if necessary)

@axif0 axif0 changed the title Feat Feat: Translation cmd for scribe-data Dec 28, 2024
@axif0
Copy link
Collaborator Author

axif0 commented Dec 28, 2024

I used orJson As it is faster then Json Ref: geeksforgeeks

Not sure, it worth it actually.

@andrewtavis
Copy link
Member

Looks like orjson needs to be added into the requirements? Ah just seeing that you removed it :)

Do you want to do a time benchmark to check what we're talking about for speed. I don't see a problem with adding a well maintained library to the requirements if we get enough of a performance boost from it. It's less time for us to test it as well, but it would be nice to know the difference 😊

@axif0
Copy link
Collaborator Author

axif0 commented Dec 28, 2024

Ya, I can do a time benchmark with and without orjson.

Also for get all cmd, I need the feedback for forms. Also in so the function all_bool might changes..I have added questionary library for message input in get.py.

Therefore recent changes might fail the tests.. in test_get.py.

@andrewtavis
Copy link
Member

All good! Take your time on this and we'll figure it out 😊 Looking forward to the benchmark!

@axif0
Copy link
Collaborator Author

axif0 commented Dec 28, 2024

for total

With orjson:
image

With json:
image

for translations

With orjson:
image

With json:
image

I think we should go for orjson.. 😅

@andrewtavis
Copy link
Member

Sounds good to me, @axif0 :) The difference will only grow over time 😊

Can you make the switch over to orjson, add it into the requirements and also fix the tests? From there let me know and I'll start a final review 😊

@axif0
Copy link
Collaborator Author

axif0 commented Dec 29, 2024

Added new command,

Except total, this cmd Gets all translations & all data-types:

scribe-data get --all -wdp scribe -od data

Centralized interactive cmd:

scribe-data i

image

@axif0
Copy link
Collaborator Author

axif0 commented Dec 29, 2024

@andrewtavis Really sorry for extra commits. Tried to make the PR optimize as possible..

Will add & update the test cases later if marge. :)

The local Wikidata dump that can be used to process data.
wikidata_dump : Union[str, bool]
The local Wikidata dump path that can be used to process data.
If True, indicates the flag was used without a path.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the care you're putting into the doc strings, @axif0! :)

@andrewtavis
Copy link
Member

I have some very minor local changes @axif0 :) Will send those along later and then merge it in 😊

@axif0
Copy link
Collaborator Author

axif0 commented Jan 3, 2025

Thanks for the checking and looking for the feedback's and changes..

@axif0
Copy link
Collaborator Author

axif0 commented Jan 3, 2025

Added Functions to parse the translations of a word from MediaWiki API.

scribe-data get -t book

image

Hope it doesn't make conflict with you.. also i'm not so sure about the cmd output .. ~ @andrewtavis

@andrewtavis
Copy link
Member

Nice, so this closes #526 now too, @axif0? :) No stress! I'll figure out the conflicts if there are some and bring it in :)

@andrewtavis
Copy link
Member

Seeing in the description that it does close it. Really thanks so much for the amazing work, @axif0! 🚀👏👏

@axif0
Copy link
Collaborator Author

axif0 commented Jan 3, 2025

Yaa.. thank you..

@andrewtavis
Copy link
Member

Lots of changes in the commit above, @axif0, but really don't stress on a lot of it. One thing that I realized is that we technically weren't following the numpy doc formatting as all of our parameters and return statements were indented. I saw in some of your code that you didn't indent, and then investigated further and saw that that's actually correct 😊

I'll go through my commit and leave some comments for you for learning purposes! :)

CONTRIBUTING.md Outdated
@@ -300,13 +300,18 @@ def example_function(argument: argument_type) -> return_type:

Parameters
----------
argument: argument_type
Description of your argument.
argument: argument_type
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what I'm talking about for the parameter indentation, so just disregard all of this :) Glad we're using the correct formatting 😊

@@ -27,8 +27,8 @@
from pathlib import Path

import pkg_resources
import questionary
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I decided to import questionary always as select and confirm are methods that are super general sounding and really could be from anything, so I figured making them explicit would be good for readability.

-------
> is_valid_language(Path("path/to/query.sparql"), "Q123456")
True
Examples
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The section for numpy docs is technically Examples (reference). Just marking out the changes! Nothing you did wrong :)

@@ -159,7 +159,7 @@ def prompt_user_download_all():
else:
print("Updating all languages and data types...")
rprint(
"[bold red]Note that the download all functionality must use Wikidata dumps to observe responsible Wikidata Query Service usage practices.[/bold red]"
"[bold red]Note that the download all functionality must use Wikidata lexeme dumps to observe responsible Wikidata Query Service usage practices.[/bold red]"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Decided to be explicit here with the dumps as there are Wikidata dumps that are all of Wikidata, but we're using the lexeme subset :) Again just a note on the change! 😊

"qid": sub_data.get("qid", ""),
}
)
current_languages.extend(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use extend here to avoid the for/append loop and clean up the code a bit :)

@@ -661,17 +660,17 @@ def check_lexeme_dump_prompt_download(output_dir: str):
],
).ask()

if user_input.startswith("Delete"):
if user_input == "Delete existing dumps":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I figured being a bit more explicit made sense here :)

@@ -727,6 +730,7 @@ def check_index_exists(index_path: Path, overwrite_all: bool = False) -> bool:
default="Skip process",
).ask()

# If user selects "Skip process", return True meaning "don't proceed"
# If user selects "Skip process", return True meaning "don't proceed".
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remember to put periods at the end of comments that are their own line :)

iso_code = data.get("iso")
if iso_code:

if iso_code := data.get("iso"):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There were multiple cases where we can use this "Walrus Operator"/Assignment Expression to simplify the code :)

@@ -203,29 +207,29 @@ def _process_lexeme_forms(self, lexeme: dict) -> None:

for rep_lang, rep_data in representations.items():
if rep_lang == lang_code:
form_value = rep_data.get("value")
if form_value:
if form_value := rep_data.get("value"):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another assignment expression :) You can install the sourcery VS Code extension to get in line warnings to fix these 🪄

@@ -286,7 +288,7 @@ def process_lines(self, line: str) -> None:
)
self.forms_counts[lang_code][category_name] += len(forms_data)

break # Only process first valid lemma
break # only process first valid lemma
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inline comments shouldn't be capitalized :)


filtered = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly simplifying the for/if loop to just be an assignment. I took this also from a sourcery suggestion :)

Copy link
Member

@andrewtavis andrewtavis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing work as always, @axif0 🤩🚀 Thanks so much for keeping at making new great features during a bit of a slow review period for me :) Really appreciate your drive to succeed 😊

Let's try to get a few more of these Scribe-Data issues down and then we can start doing some testing of all the functionality together in a call!

@andrewtavis
Copy link
Member

Please check the comments above, btw, but these are more for learning opportunities :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants