You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When using custom fine-tuned models, faster-whisper's implementation shows significant WER degradation compared to OpenAI's reference implementation (13.5% vs 8.2% WER). Through investigation, I've traced this to numerical differences starting from the very first Conv1D operation in the encoder and is also present for large-v3 weights. The weights are both stored in float16, suggesting some numerical or algorithmic issue.
Here is some benchmark results on custom dataset.
Implementation
Model
Performance (WER)
openai
large-v3
0.124165
faster-whisper
large-v3
0.0992092
openai
custom
0.0819845
faster-whisper
custom
0.135567
After comparing the logits of the two implementations and trying to narrow down the root cause of the issue i manage to locate that the difference starts as early the first conv1d operation in the encoder.
Steps to reproduce
Modifying the whisper encoder operator function to only apply the conv1d operation. I.e:
voidWhisperEncoder::operator()(const StorageView& features, StorageView& output) {
PROFILE("WhisperEncoder");
if (features.rank() != 3)
throwstd::invalid_argument("Expected input features to have 3 dimensions, but got "
+ std::to_string(features.rank())
+ " dimension(s) instead");
if (features.dim(1) != input_size() || features.dim(2) > max_input_time())
throwstd::invalid_argument("Invalid input features shape: expected an input with shape ("
+ std::to_string(features.dim(0))
+ ", "
+ std::to_string(input_size())
+ ", "
+ std::to_string(std::min(features.dim(2), max_input_time()))
+ "), but got an input with shape ("
+ std::to_string(features.dim(0))
+ ", "
+ std::to_string(features.dim(1))
+ ", "
+ std::to_string(features.dim(2))
+ ") instead");
StorageView input(output_type(), features.device());
_conv1(features, output);
}
And running the following scripts will show the diff between the implementation.
Description
When using custom fine-tuned models, faster-whisper's implementation shows significant WER degradation compared to OpenAI's reference implementation (13.5% vs 8.2% WER). Through investigation, I've traced this to numerical differences starting from the very first Conv1D operation in the encoder and is also present for large-v3 weights. The weights are both stored in float16, suggesting some numerical or algorithmic issue.
Here is some benchmark results on custom dataset.
After comparing the logits of the two implementations and trying to narrow down the root cause of the issue i manage to locate that the difference starts as early the first conv1d operation in the encoder.
Steps to reproduce
Modifying the whisper encoder operator function to only apply the conv1d operation. I.e:
And running the following scripts will show the diff between the implementation.
Faster whisper
pytorch equivalent is
If i run these two scripts i get the following outputs:
faster-whisper:
pytorch
Environment
python 3.10.10
CUDA Version: 12.3
All running inside a docker container: nvcr.io/nvidia/pytorch:23.10-py3
GPU: 3090
Additional findings
Precision behavior:
Input dependency (bfloat16):
This might not be a faster-whisper issue however, so let me know if you want me to close this issue and open this issue in ctranslate2 instead.
The text was updated successfully, but these errors were encountered: