-
-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve alignment accuracy by normalizing audio features #625
base: main
Are you sure you want to change the base?
Conversation
…c2Processor before the Forward pass
Fix a typo in the preprocess argument
Hi @IbrahimAmin1 thank you for the contribution. Can you provide some examples to compare the result with and without the preprocessing? |
Hi @Barabazs, Apologies for the delayed response, I’ve conducted some experiments using privately collected data, using open-source fine-tuned wav2vec2-based models from Hugging Face (e.g. jonatasgrosman models are used extensively by whisperX for alignment) as well as some models I fine-tuned myself. The results consistently show that applying initial audio normalization improves performance across the board. Let me know if you’d like more details or specific metrics. NOTE:
|
Audio data should be pre-processed using the
Wav2Vec2Processor (Wav2Vec2FeatureExtractor)
, I have noticed considerable alignment improvement(Mean absolute error)
when audio is normalized(zero mean and unit variance)
using the processor before the forward pass.Other than that, Each Hugging face Wav2Vec2 Feature Extractor configuration should contain the same config used during fine-tuning these models (e.g. normalization, attention_mask usage, etc..)
A typical
hugging face Wav2Vec2 Feature Extractor config file
is as follows:To maintain backwards compatibility, I have opted to let the user determine if Pre-processing should be applied or not, but chose to set
Pre-processing as the default option
as it improves alignment considerably.