Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transcriptions have no spaces - wav2vec2-xls-r-1b-spanish #51

Open
santideleon opened this issue Aug 3, 2022 · 4 comments
Open

Transcriptions have no spaces - wav2vec2-xls-r-1b-spanish #51

santideleon opened this issue Aug 3, 2022 · 4 comments

Comments

@santideleon
Copy link

santideleon commented Aug 3, 2022

I am working on Speech to Text for ~135 (or less) second audios of spanish recorded by lapel microphons or VR goggles. I am using wav2vec2-xls-r-1b-spanish and the language model lm.binary and unigrams.txt provided. They are the ones downloaded from jonatasgrosman/wav2vec2-large-xlsr-53-spanish, but based on the size they seems to be the exact same for 1b. I originally started with large version, but I opted for 1b for better performance.

My plan is to work on the text with the pysentimiento pre-trained spanish sentiment and emotion analyzer. The problem I have is that the text does not have spaces separating the words.

Is there a quick fix for this or any suggestions?

Example:
alesundíamanormalparamímelevantosobrelasochodelamañana desayunasepredesayunoalomismodeayunosquirconceriales yfrutameduchomeevistoacosasenchilavoycaminandosube lacuestahastaelaparadadelautobustyietesperoquevenga autobusesestallevaalaparadadesanlorenzocojoelmetro

code:


model = SpeechRecognitionModel("jonatasgrosman/wav2vec2-xls-r-1b-spanish")
lm_path = "language_model/lm.binary"
unigrams_path = "language_model/unigrams.txt"
decoder = KenshoLMDecoder(model.token_set, lm_path=lm_path, unigrams_path=unigrams_path)

def process_single_audio(correct_path, sr=16000,):
   

    #y, sr = librosa.load(str(path+correct_path),sr=sr)
    transcriptions = model.transcribe([str(correct_path)[1:]], decoder=decoder)

    print(transcriptions[0]['transcription'])


    return transcriptions[0]['transcription']
@santideleon
Copy link
Author

This problem seems to be fixed by using the automatic-speech-recognition pipeline. With and without chunking. Not really sure what is happening.

code:
`
pipe = pipeline("automatic-speech-recognition", model="jonatasgrosman/wav2vec2-xls-r-1b-spanish",
tokenizer="jonatasgrosman/wav2vec2-xls-r-1b-spanish",
feature_extractor= "jonatasgrosman/wav2vec2-xls-r-1b-spanish",
decoder=decoder)

transcriptions = pipe(str(correct_path)[1:])
`

Additionally I tested chunking in the pipeline. My first thought was that there was a problem with the length of the audios, but after testing different chunking parameters and then without chunking, it worked perfectly. The only thing I would note is that chunking significantly increases the time of processing the audio. I saw processing times of twice as long and up to seven times more. In terms of accurately transcribing the audios the longest to compute (of 10s chunks) seemed to work the best, but it is not worth the computation time, since 30s chunks which only doubled the processing time was almost as good.

@iljab
Copy link

iljab commented Aug 16, 2022

Same issue using the jonatasgrosman/wav2vec2-large-xlsr-53-german model

@arikhalperin
Copy link

You should try to add a language model. See here:
https://huggingface.co/blog/wav2vec2-with-ngram

@detongz
Copy link

detongz commented Jul 27, 2023

@santideleon Hi and I have this same issue using wbbbbb/wav2vec2-large-chinese-zh-cn model.

Have you solved this problem?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants