Skip to content

Hallucination on silence #1724

Open
Open
@pprobst

Description

@pprobst

Hello! In some experiments, I've noticed that in audio files that have silence at the end (even ~1s of silence), whispercpp sometimes transcribes "bullshit" text from nonexistent speech. This does not happen when I'm using the evaluate/predict functions from transformers, or transcribe from whisperx (although the latter uses VAD), which makes me think there's a parameter or something in whispercpp that may be making it prone to hallucination in these cases. Note that I'm using a converted fine-tuned base model (h5 to ggml).

I'm using the latest 1.5.3 version, but this also happened in 1.5.2.

An example below:

λ ./main -f 1635687465_8386435.ogg -l pt -m ../eval/ggml-model.bin -pc

whisper_init_from_file_with_params_no_state: loading model from '../eval/ggml-model.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51865
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head  = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 512
whisper_model_load: n_text_head   = 8
whisper_model_load: n_text_layer  = 6
whisper_model_load: n_mels        = 80
whisper_model_load: ftype         = 1
whisper_model_load: qntvr         = 0
whisper_model_load: type          = 2 (base)
whisper_model_load: n_langs       = 99
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3050 6GB Laptop GPU, compute capability 8.6, VMM: yes
whisper_backend_init: using CUDA backend
whisper_model_load:     CUDA buffer size =   147.46 MB
whisper_model_load: model size    =  147.37 MB
whisper_backend_init: using CUDA backend
whisper_init_state: kv self size  =   16.52 MB
whisper_init_state: kv cross size =   18.43 MB
whisper_init_state: compute buffer (conv)   =   14.86 MB
whisper_init_state: compute buffer (encode) =   85.99 MB
whisper_init_state: compute buffer (cross)  =    4.78 MB
whisper_init_state: compute buffer (decode) =   96.48 MB

system_info: n_threads = 4 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | METAL = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | CUDA = 1 | COREML = 0 | OPENVINO = 0 |

main: processing '1635687465_8386435.wav' (118886 samples, 7.4 sec), 4 threads, 1 processors, 5 beams + best of 5, lang = pt, task = transcribe, timestamps = 1 ...


[00:00:00.000 --> 00:00:06.300]   ponto parágrafo planos musculares com aspecto habitual a faixa etária
[00:00:06.300 --> 00:00:36.300]   subcutâneo de l cinco e l cinco e l cinco l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco


whisper_print_timings:     load time =   116.86 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =     9.17 ms
whisper_print_timings:   sample time =   325.28 ms /  1212 runs (    0.27 ms per run)
whisper_print_timings:   encode time =   120.70 ms /     2 runs (   60.35 ms per run)
whisper_print_timings:   decode time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:   batchd time =   555.86 ms /  1208 runs (    0.46 ms per run)
whisper_print_timings:   prompt time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:    total time =  1176.76 ms

The transcription in [00:00:00.000 --> 00:00:06.300] ponto parágrafo planos musculares com aspecto habitual a faixa etária is correct. But after that is just about 1s of silence. After transcribing the first segment, it "hangs" for a sec and then it hallucinates.

(note that the audio file being passed is OGG, but in code I'm converting it to WAV 16khz mono with ffmpeg)

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions