Number sequence is not transcribed when a chunk starts with it #1174

vkras · 2024-11-26T22:08:03Z

I'm attaching an audio file (it's reproducible with longer files split into chunks).
Disabling VAD helps but it does not explain the issue because VAD correctly identifies where speech stars (around 2.5 seconds).
It affects both batch and non-batch methods.

With VAD:
chunks_metadata [{'start_time': 2.416, 'end_time': 12.72}]
duration_after_vad 10.304
Sentence: [0 7.83s -> 12.13s] It's important that that first piece can't be misinterpreted as a decimal.

Without VAD:
chunks_metadata [{'start_time': 0.0, 'end_time': 13.11925}]
duration_after_vad 13.11925
Sentence: [0 3.42s -> 12.14s] 8892. It's important that that first piece can't be misinterpreted as a decimal.

digit-speech.zip

Purfview · 2024-11-27T08:48:48Z

Whisper's model can just miss something in transcription for no apparent reason
A one byte change in audio can trigger a different result
A one token change in prompt can trigger different result

Btw, for me it's opposite. with VAD "8892" appears, without VAD it disappears. 😄

Maybe for model it's unusual to start with digits, try initial_prompt="OK"

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Number sequence is not transcribed when a chunk starts with it #1174

Number sequence is not transcribed when a chunk starts with it #1174

vkras commented Nov 26, 2024 •

edited

Loading

Purfview commented Nov 27, 2024 •

edited

Loading

Number sequence is not transcribed when a chunk starts with it #1174

Number sequence is not transcribed when a chunk starts with it #1174

Comments

vkras commented Nov 26, 2024 • edited Loading

Purfview commented Nov 27, 2024 • edited Loading

vkras commented Nov 26, 2024 •

edited

Loading

Purfview commented Nov 27, 2024 •

edited

Loading