Possible issue when using HuggingFace portuguese language model #62

lfcnassif · 2022-09-02T17:34:11Z

First, thank you very much for this great project, it makes ASR very easy!

And your models are awesome! I made some accuracy tests with https://huggingface.co/jonatasgrosman/wav2vec2-xls-r-1b-portuguese model (sepinf-inc/IPED#1214 (comment)) and it is comparable to Microsoft's and Google's pt-BR models, actually a bit better!

Now I'm trying to use a language model as described in the Readme.md. I'm trying to use the same LM in the language_model folder in the HuggingFace model card above, but it prints some warning in console:

09/02/2022 12:10:19 - WARNING - pyctcdecode.alphabet - Found entries of length > 1 in alphabet. This is unusual unless style is BPE, but the alphabet was not recognized as BPE type. Is this correct?
09/02/2022 12:10:19 - WARNING - pyctcdecode.alphabet - Unigrams and labels don't seem to agree.

WER accuracy also dropped a lot. Am I doing something wrong? What language model is compatible to the above Portuguese model?

Thanks in advance

The text was updated successfully, but these errors were encountered:

lfcnassif mentioned this issue Sep 2, 2022

#1214 Wav2vec2 audio transcription sepinf-inc/IPED#1227

Merged

8 tasks

lfcnassif mentioned this issue Sep 11, 2022

Option to use a language model with Wav2Vec2 transcription sepinf-inc/IPED#1312

Closed

lfcnassif mentioned this issue Jul 28, 2023

Issue with ASR inference using language model #88

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible issue when using HuggingFace portuguese language model #62

Possible issue when using HuggingFace portuguese language model #62

lfcnassif commented Sep 2, 2022

Possible issue when using HuggingFace portuguese language model #62

Possible issue when using HuggingFace portuguese language model #62

Comments

lfcnassif commented Sep 2, 2022