Question on transcription time vs. input audio duration #837

pprobst · 2024-05-13T14:22:20Z

pprobst
May 13, 2024

Hello. Recently I ran a benchmark on a fine-tuned whisper-small model. Most audio files were pretty short, below 10s. The results are as follows, where the y-axis is the time to transcribe using faster-whisper, and the x-axis is the audio duration in seconds. My CPU is an AMD Ryzen 5 7600X (12) @ 5.45 GHz.

I used the following config. The model is loaded quantized (int8) and with compute_type="int8".

segments, _ = self.model.transcribe(
	file,
	task="transcribe",
	beam_size=5,
	best_of=5,
	language=self.language,
	condition_on_previous_text=False,
	without_timestamps=True,
	max_initial_timestamp=0.0,
	suppress_tokens=[-1] + SUPPRESS_TOKENS_INFER,
)

I thought that, since whisper pads audios shorter than 30s to 30s, every file would have more or less the same time to transcribe. But that's not the case. Why?

kvrban · 2024-05-13T22:41:08Z

kvrban
May 13, 2024

I thought that, since whisper pads audios shorter than 30s to 30s,

where did you read that? it's the other way round, audio files longer than 30s are processed in 30s chunks.

nothing changes below 30s. my dangerous half-knowledge on the subject

1 reply

pprobst May 13, 2024
Author

I skimmed through the paper and from what I read, it's only explicit that audios longer than 30s are processed in 30s chunks, as you say. But I do remember that I read somewhere that all the inputs to Whisper are of 30s. So I searched about this and found some references to it.

faster-whisper/faster_whisper/transcribe.py

Line 556 in 2036d12

segment = pad_or_trim(segment, self.feature_extractor.nb_max_frames)
- faster-whisper/faster_whisper/audio.py
  
  Line 107 in 2036d12
  
  def pad_or_trim(array, length: int, *, axis: int = -1):
Support for variable size chunks #54
Question about padding mask, and using model's encoder features openai/whisper#307

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question on transcription time vs. input audio duration #837

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Question on transcription time vs. input audio duration #837

pprobst May 13, 2024

Replies: 1 comment · 1 reply

kvrban May 13, 2024

pprobst May 13, 2024 Author

pprobst
May 13, 2024

Replies: 1 comment 1 reply

kvrban
May 13, 2024

pprobst May 13, 2024
Author