make sure the leading and tailing word boundary exists. #1019

bfs18 · 2025-01-26T04:23:44Z

Fix #1016
On the wav2vec alignment tutorial page, word boundary symbols are present at the beginning and end. During the whisperX alignment process, the tailing word boundary might be missing, causing the tailing silence to be absorbed into the last word. Adding the tailing word boundary would resolve this issue.

…on is absorbed into the word boundary symbol.

lesca · 2025-01-31T12:31:32Z

@bfs18 I cloned the latest main branch (44e8bf5) and patched with this PR. From my recent transcriptions on serveral videos, this PR works well and I no longer see sublitles overlap. Thanks for your great works!

bfs18 · 2025-01-31T14:07:24Z

Hi @lesca I am glad to hear that. Thank you for your feedback.

bfs18 · 2025-02-07T15:28:05Z

Hi @Barabazs @m-bain
I believe this fix works. It produces correct results for the test case mentioned in issue #1016 (though the original poster has not yet confirmed). Additionally, it should resolve the related issue #999.

Barabazs

Thank you very much for looking into this! I added a few questions inline.

Barabazs · 2025-02-19T07:01:20Z

whisperx/alignment.py

+        if model_lang not in LANGUAGES_WITHOUT_SPACES:
+            if not text.startswith(" "):
+                text = " " + text
+            if not text.endswith(" "):
+                text = text + " "


Won't this result in double spaces?

Hi @Barabazs
No, it won't. If there's already a leading or trailing space, no additional space will be added. So no need to worry about double spaces.

Barabazs · 2025-02-19T07:04:22Z

whisperx/alignment.py

@@ -173,8 +180,8 @@ def align(
            elif char_ in model_dictionary.keys():
                clean_char.append(char_)
                clean_cdx.append(cdx)
-            else:
-                # add placeholder
+            elif char_ not in '!"\',.:;<=>?[\\]^_`{|}':


You mention the word boundary symbol from Wav2Vec2, but the page you linked only uses |. Can you explain why these other symbols are needed?

And imo an else-clause should be added as a fallback in case of unforeseen issues.

In the original code, symbols beyond "a-z" were ignored, and word boundaries were added to the string for alignment purposes. However, some of these ignored symbols (like numbers or the dollar sign $) may have pronunciations and require timestamps. To address this, a wildcard placeholder is needed to ensure these symbols are properly processed.

On the other hand, certain symbols (like !"',.:;<=>?[]^_`{|}) can be absorbed into word boundaries without needing a wildcard. The condition in the code filters out these absorbable symbols and only adds wildcard placeholders for symbols that have explicit pronunciations.

As for the else-clause, it simply skips the symbol, so it’s not explicitly implemented in the current logic.

Let me know if you’d like further clarification or adjustments.

make sure the leading and tailing word boundary exists.

120bed5

bfs18 mentioned this pull request Jan 26, 2025

Alignment issues after #986 (support timestamp for numbers) #1016

Open

Beronx86 added 2 commits January 26, 2025 12:28

make sure the leading and tailing word boundary exists.

8eefb4b

Remove the wildcard for unpronounceable punctuations, as their durati…

85391d1

…on is absorbed into the word boundary symbol.

bfs18 mentioned this pull request Jan 26, 2025

Fix subtitle overlaps #999

Closed

Merge branch 'main' into main

61b928a

Barabazs requested changes Feb 19, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

make sure the leading and tailing word boundary exists. #1019

make sure the leading and tailing word boundary exists. #1019

bfs18 commented Jan 26, 2025

lesca commented Jan 31, 2025

bfs18 commented Jan 31, 2025

bfs18 commented Feb 7, 2025

Barabazs left a comment

Barabazs Feb 19, 2025

bfs18 Feb 21, 2025 •

edited

Loading

Barabazs Feb 19, 2025

bfs18 Feb 21, 2025

make sure the leading and tailing word boundary exists. #1019

Are you sure you want to change the base?

make sure the leading and tailing word boundary exists. #1019

Conversation

bfs18 commented Jan 26, 2025

lesca commented Jan 31, 2025

bfs18 commented Jan 31, 2025

bfs18 commented Feb 7, 2025

Barabazs left a comment

Choose a reason for hiding this comment

Barabazs Feb 19, 2025

Choose a reason for hiding this comment

bfs18 Feb 21, 2025 • edited Loading

Choose a reason for hiding this comment

Barabazs Feb 19, 2025

Choose a reason for hiding this comment

bfs18 Feb 21, 2025

Choose a reason for hiding this comment

bfs18 Feb 21, 2025 •

edited

Loading