Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Number sequence is not transcribed when a chunk starts with it #1174

Open
vkras opened this issue Nov 26, 2024 · 1 comment
Open

Number sequence is not transcribed when a chunk starts with it #1174

vkras opened this issue Nov 26, 2024 · 1 comment

Comments

@vkras
Copy link

vkras commented Nov 26, 2024

I'm attaching an audio file (it's reproducible with longer files split into chunks).
Disabling VAD helps but it does not explain the issue because VAD correctly identifies where speech stars (around 2.5 seconds).
It affects both batch and non-batch methods.

With VAD:
chunks_metadata [{'start_time': 2.416, 'end_time': 12.72}]
duration_after_vad 10.304
Sentence: [0 7.83s -> 12.13s] It's important that that first piece can't be misinterpreted as a decimal.

Without VAD:
chunks_metadata [{'start_time': 0.0, 'end_time': 13.11925}]
duration_after_vad 13.11925
Sentence: [0 3.42s -> 12.14s] 8892. It's important that that first piece can't be misinterpreted as a decimal.

digit-speech.zip

@Purfview
Copy link
Contributor

Purfview commented Nov 27, 2024

  1. Whisper's model can just miss something in transcription for no apparent reason
  2. A one byte change in audio can trigger a different result
  3. A one token change in prompt can trigger different result

Btw, for me it's opposite. with VAD "8892" appears, without VAD it disappears. 😄

Maybe for model it's unusual to start with digits, try initial_prompt="OK"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants