Improve alignment accuracy by normalizing audio features #625

IbrahimAmin1 · 2023-12-13T09:39:30Z

Audio data should be pre-processed using the Wav2Vec2Processor (Wav2Vec2FeatureExtractor), I have noticed considerable alignment improvement (Mean absolute error) when audio is normalized (zero mean and unit variance) using the processor before the forward pass.

Other than that, Each Hugging face Wav2Vec2 Feature Extractor configuration should contain the same config used during fine-tuning these models (e.g. normalization, attention_mask usage, etc..)

A typical hugging face Wav2Vec2 Feature Extractor config file is as follows:

{
  "do_normalize": true,
  "feature_size": 1,
  "padding_side": "right",
  "padding_value": 0.0,
  "return_attention_mask": true,
  "sampling_rate": 16000
}

To maintain backwards compatibility, I have opted to let the user determine if Pre-processing should be applied or not, but chose to set Pre-processing as the default option as it improves alignment considerably.

…c2Processor before the Forward pass

Fix a typo in the preprocess argument

Barabazs · 2025-01-01T13:52:13Z

Hi @IbrahimAmin1 thank you for the contribution.

Can you provide some examples to compare the result with and without the preprocessing?
Regarding the default behavior, I would suggest to default to False to keep the results consistent with previous versions. Maybe optionally add a CLI flag to activate it?

IbrahimAmin1 · 2025-01-22T02:32:11Z

Hi @Barabazs,

Apologies for the delayed response, I’ve conducted some experiments using privately collected data, using open-source fine-tuned wav2vec2-based models from Hugging Face (e.g. jonatasgrosman models are used extensively by whisperX for alignment) as well as some models I fine-tuned myself.

The results consistently show that applying initial audio normalization improves performance across the board. Let me know if you’d like more details or specific metrics.

NOTE:

All the following results are from experiments conducted on Arabic manually aligned test data.
All models used were fine-tuned on arabic and based on XLSR-53 or XLS-R or mms-300m.

# 126.59 mins (15630 words) test alignment golden data with normalization

{'Model Name': 'run6', 'MAE': 0.13635854126678706, 'Variance': 0.053134696144256754, 'Standard Deviation': 0.23050964436278312}
{'Model Name': 'run2', 'MAE': 0.14064792066538057, 'Variance': 0.04699633125447122, 'Standard Deviation': 0.21678637239105047}
{'Model Name': 'run8', 'MAE': 0.14938355726167127, 'Variance': 0.058658478027780916, 'Standard Deviation': 0.24219512387284126}
{'Model Name': 'jonatasgrosman', 'MAE': 0.15062463211772, 'Variance': 0.05997076818404678, 'Standard Deviation': 0.24488929781443447}
{'Model Name': 'run4', 'MAE': 0.15504459373000268, 'Variance': 0.05421231566974826, 'Standard Deviation': 0.23283538319969382}
{'Model Name': 'run7', 'MAE': 0.16163243761996032, 'Variance': 0.07027376221075804, 'Standard Deviation': 0.2650919882055247}
{'Model Name': 'run1', 'MAE': 0.1622107485604585, 'Variance': 0.06287765027474253, 'Standard Deviation': 0.25075416302574627}
{'Model Name': 'run5', 'MAE': 0.165385092770312, 'Variance': 0.07120615874130695, 'Standard Deviation': 0.2668448214624128}
{'Model Name': 'run3', 'MAE': 0.1803580934101082, 'Variance': 0.06424321540314636, 'Standard Deviation': 0.2534624536359308}
{'Model Name': 'Omar', 'MAE': 1.2107753039027478, 'Variance': 6.159819371973449, 'Standard Deviation': 2.4818983403784793} 

# 126.59 mins (15630 words) test alignment golden data without normalization (vanilla whisperX)

{'Model Name': 'run2', 'MAE': 0.1455236084452928, 'Variance': 0.05440060593656431, 'Standard Deviation': 0.23323937475598822}
{'Model Name': 'run6', 'MAE': 0.150041650671782, 'Variance': 0.08771184401058374, 'Standard Deviation': 0.2961618544150879}
{'Model Name': 'run8', 'MAE': 0.15426762635956176, 'Variance': 0.06047632408950232, 'Standard Deviation': 0.2459193446833785}
{'Model Name': 'jonatasgrosman', 'MAE': 0.15897850287907747, 'Variance': 0.09354321105418911, 'Standard Deviation': 0.30584834649575776}
{'Model Name': 'run7', 'MAE': 0.1626999360204718, 'Variance': 0.06931678264235314, 'Standard Deviation': 0.2632808056853996}
{'Model Name': 'run1', 'MAE': 0.16832642354446453, 'Variance': 0.09108185608362362, 'Standard Deviation': 0.30179770722062094}
{'Model Name': 'run4', 'MAE': 0.1693870121561076, 'Variance': 0.0657704377455841, 'Standard Deviation': 0.25645747746085334}
{'Model Name': 'run5', 'MAE': 0.17629782469609703, 'Variance': 0.08660694732092333, 'Standard Deviation': 0.2942905831332755}
{'Model Name': 'run3', 'MAE': 0.18050294305822087, 'Variance': 0.06901171448269998, 'Standard Deviation': 0.2627008079216735}
{'Model Name': 'Omar', 'MAE': 1.228283557261672, 'Variance': 6.1731422629269135, 'Standard Deviation': 2.4845809028741472}

Improve alignment accuracy by normalizing audio features using Wav2Ve…

356b5f7

…c2Processor before the Forward pass

IbrahimAmin1 changed the title ~~Improve alignment accuracy by normalizing audio features using Wav2Ve…~~ Improve alignment accuracy by normalizing audio features Dec 13, 2023

Update alignment.py

4c7631c

Fix a typo in the preprocess argument

gillens mentioned this pull request Feb 21, 2024

Question about PyPi releases #700

Closed

HHousen mentioned this pull request Mar 25, 2024

AttributeError: 'Wav2Vec2Processor' object has no attribute 'sampling_rate' #722

Closed

This was referenced Aug 20, 2024

Implement PR improvements #863

Closed

Feat/add pr improvements matiasfnunezdev/whisperX#1

Merged

darkmirage added a commit to SesameAILabs/whisperX that referenced this pull request Oct 25, 2024

Add preprocessing for aligner m-bain#625

5aa4fad

Barabazs added the question Further information is requested label Jan 1, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve alignment accuracy by normalizing audio features #625

Improve alignment accuracy by normalizing audio features #625

IbrahimAmin1 commented Dec 13, 2023

Barabazs commented Jan 1, 2025

IbrahimAmin1 commented Jan 22, 2025

Improve alignment accuracy by normalizing audio features #625

Are you sure you want to change the base?

Improve alignment accuracy by normalizing audio features #625

Conversation

IbrahimAmin1 commented Dec 13, 2023

Barabazs commented Jan 1, 2025

IbrahimAmin1 commented Jan 22, 2025