-
-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
make sure the leading and tailing word boundary exists. #1019
base: main
Are you sure you want to change the base?
Conversation
…on is absorbed into the word boundary symbol.
Hi @lesca I am glad to hear that. Thank you for your feedback. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much for looking into this! I added a few questions inline.
if model_lang not in LANGUAGES_WITHOUT_SPACES: | ||
if not text.startswith(" "): | ||
text = " " + text | ||
if not text.endswith(" "): | ||
text = text + " " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Won't this result in double spaces?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @Barabazs
No, it won't. If there's already a leading or trailing space, no additional space will be added. So no need to worry about double spaces.
@@ -173,8 +180,8 @@ def align( | |||
elif char_ in model_dictionary.keys(): | |||
clean_char.append(char_) | |||
clean_cdx.append(cdx) | |||
else: | |||
# add placeholder | |||
elif char_ not in '!"\',.:;<=>?[\\]^_`{|}': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mention the word boundary symbol from Wav2Vec2, but the page you linked only uses |
. Can you explain why these other symbols are needed?
And imo an else-clause should be added as a fallback in case of unforeseen issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the original code, symbols beyond "a-z" were ignored, and word boundaries were added to the string for alignment purposes. However, some of these ignored symbols (like numbers or the dollar sign $) may have pronunciations and require timestamps. To address this, a wildcard placeholder is needed to ensure these symbols are properly processed.
On the other hand, certain symbols (like !"',.:;<=>?[]^_`{|}) can be absorbed into word boundaries without needing a wildcard. The condition in the code filters out these absorbable symbols and only adds wildcard placeholders for symbols that have explicit pronunciations.
As for the else-clause, it simply skips the symbol, so it’s not explicitly implemented in the current logic.
Let me know if you’d like further clarification or adjustments.
Fix #1016
On the wav2vec alignment tutorial page, word boundary symbols are present at the beginning and end. During the whisperX alignment process, the tailing word boundary might be missing, causing the tailing silence to be absorbed into the last word. Adding the tailing word boundary would resolve this issue.