Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

faster-whisper integration ? #6816

Closed
ras0k opened this issue Apr 11, 2023 · 119 comments
Closed

faster-whisper integration ? #6816

ras0k opened this issue Apr 11, 2023 · 119 comments

Comments

@ras0k
Copy link

ras0k commented Apr 11, 2023

I did not read the whole thread about whisper GPU but can we avoid a lot of problems with VRAM and speed by switching to faster-whisper maybe ?

@Purfview
Copy link
Contributor

How faster-whisper speed [on GPU] compares to whisper-ConstMe?

@Purfview
Copy link
Contributor

I asked about whisper-ConstMe, not "openai/whisper".
Btw, I find large model's timestamps way less accurate than medium, when transcription is not better.

whisper-ConstMe can use any model.

@ras0k
Copy link
Author

ras0k commented Apr 11, 2023

full benchmarks :

faster-whisper is a reimplementation of OpenAI's Whisper model using CTranslate2, which is a fast inference engine for Transformer models.

This implementation is up to 4 times faster than openai/whisper for the same accuracy while using less memory. The efficiency can be further improved with 8-bit quantization on both CPU and GPU.

Benchmark
For reference, here's the time and memory usage that are required to transcribe 13 minutes of audio using different implementations:

openai/whisper@6dea21fd
whisper.cpp@3b010f9
faster-whisper@cce6b53e

Large-v2 model on GPU

Implementation Precision Beam size Time Max. GPU memory Max. CPU memory
openai/whisper fp16 5 4m30s 11325MB 9439MB
faster-whisper fp16 5 54s 4755MB 3244MB
faster-whisper int8 5 59s 3091MB 3117MB

Executed with CUDA 11.7.1 on a NVIDIA Tesla V100S.

Small model on CPU

Implementation Precision Beam size Time Max. memory
openai/whisper fp32 5 10m31s 3101MB
whisper.cpp fp32 5 17m42s 1581MB
whisper.cpp fp16 5 12m39s 873MB
faster-whisper fp32 5 2m44s 1675MB
faster-whisper int8 5 2m04s 995MB

Executed with 8 threads on a Intel(R) Xeon(R) Gold 6226R.

@Purfview
Copy link
Contributor

how much faster is constme compared to openai/whisper ?

If it ran for me I wouldn't ask.
Why you are posting these pointless posts?

@ras0k
Copy link
Author

ras0k commented Apr 11, 2023

Why you are posting these pointless posts?

because this would help the software ? why would we not integrate faster-whisper ? why are you so adversarial to contribution on an open-source project ?

@Purfview
Copy link
Contributor

because this would help the software ? why would we not integrate faster-whisper ? why are you so adversarial to contribution on an open-source project ?

I asked a question, you answer with some irrelevant posts. I'm adversarial to nonsense...

@ras0k
Copy link
Author

ras0k commented Apr 11, 2023

I asked a question, you answer with some irrelevant posts. I'm adversarial to nonsense...

I think const-me/whisper is a Windows port of the whisper.cpp implementation.
Which in turn is a C++ port of OpenAI's Whisper automatic speech recognition (ASR) model.

@Purfview
Copy link
Contributor

I think const-me/whisper is a Windows port of the whisper.cpp implementation. Which in turn is a C++ port of OpenAI's Whisper automatic speech recognition (ASR) model.

Are you a GPT-3 bot tuned on 4chan?

@ras0k
Copy link
Author

ras0k commented Apr 11, 2023

How faster-whisper speed [on GPU] compares to whisper-ConstMe?

at least 5x faster on CPU and 10x faster if you use GPU

@Purfview
Copy link
Contributor

at least 5x faster

at least 5x faster on CPU and 10x faster if you use GPU

So, few minutes ago you didn't knew what whisper-ConstMe is, and now you are posting "benchmarks" out of you ass...

how was my post irrelevent ? it's a port of whisper.cpp and the benchmark is testing whisper.cpp

If you are not a bot then clearly with some mental deficiency.

@Purfview
Copy link
Contributor

the benchmarks are from the repo and i am autistic.

I see... Take ten deep breaths and no need to type more posts.
I'm unsubscribing from this thread.

@rsmith02ct
Copy link

I'm not sure why this post devolved into insults instead of mutual understanding.

whisper-ConstMe is a GPU-enabled implementation of Whisper. Does faster-whisper provide any benefits in terms of speed or accuracy or GPU ram usage compared to it? ConstMe is already integrated into SubtitleEdit which is why the question is relevant.

@ras0k
Copy link
Author

ras0k commented Apr 12, 2023

I'm not sure why this post devolved into insults instead of mutual understanding.

whisper-ConstMe is a GPU-enabled implementation of Whisper. Does faster-whisper provide any benefits in terms of speed or accuracy or GPU ram usage compared to it? ConstMe is already integrated into SubtitleEdit which is why the question is relevant.

Yes it does go about 5x faster with the optimizations they provided, that is what the benchmark I posted show, they are on the faster-whisper GitHub. You can also use Whisper-cTranslate2 directly

@ras0k
Copy link
Author

ras0k commented Apr 12, 2023

or GPU ram usage compared to it?

we also save a lot of VRAM which means we can run large-v2 on 4gb GPUs

@ras0k
Copy link
Author

ras0k commented Apr 12, 2023

Btw, I find large model's timestamps way less accurate than medium, when transcription is not better.

maybe for english medium is fine but for multilingual large-v2 is a lot more useful than having to download a specific model for each language

@rsmith02ct
Copy link

rsmith02ct commented Apr 12, 2023 via email

@ras0k
Copy link
Author

ras0k commented Apr 12, 2023

Const-Me also has huge speed boosts over CPU-only implementations. I'll assume ConstMe and Faster Whisper are comparable unless someone reports data to the contrary.

can you provide a benchmark that shows this ?

@ras0k
Copy link
Author

ras0k commented Apr 12, 2023

I am talking about 5x speed on GPU vs GPU, not cpu vs gpu

@rsmith02ct
Copy link

rsmith02ct commented Apr 12, 2023 via email

@ras0k
Copy link
Author

ras0k commented Apr 12, 2023

I understand and respect your desire to not ponder on this but for me a tiny benchmark is completely useless, i am only talking about comparing whisper (in gpu mode) and faster-whisper (also gpu mode) on large-v2 because I believe there will be a use-case for a lot of users. I will do my best to provide the benchmarks that you asked me soon.

@ras0k
Copy link
Author

ras0k commented Apr 12, 2023

Larger models may show more difference but my GPU only has 4GB ram.

you can already try large-v2 on faster-whisper with your GPU, is that not incentive enough to want it ?

@rsmith02ct
Copy link

rsmith02ct commented Apr 12, 2023 via email

@ras0k
Copy link
Author

ras0k commented Apr 12, 2023

How much VRAM do you need for the large v2 model in Faster Whisper? That
may limit its interest to users.

3.09 GB

https://huggingface.co/guillaumekln/faster-whisper-large-v2

@ras0k
Copy link
Author

ras0k commented Apr 12, 2023

you haven't provided any test data to show how it is better than the
current options.

faster-whisper is a reimplementation of OpenAI's Whisper model using CTranslate2, which is a fast inference engine for Transformer models.

This implementation is up to 4 times faster than openai/whisper for the same accuracy while using less memory. The efficiency can be further improved with 8-bit quantization on both CPU and GPU.

I posted the full benchmarks up there in my first reply but I will also try subtitleEdit and post my results soon.

@rsmith02ct
Copy link

rsmith02ct commented Apr 12, 2023 via email

@ras0k
Copy link
Author

ras0k commented Apr 12, 2023

Is that the needed VRAM or the model size? I can't run the medium model (1.5gb) on my 4GB GPU FWIW.

On Wed, Apr 12, 2023 at 11:41 PM ras0k @.> wrote: How much VRAM do you need for the large v2 model in Faster Whisper? That may limit its interest to users. 3.09 GB https://huggingface.co/guillaumekln/faster-whisper-large-v2/tree/main — Reply to this email directly, view it on GitHub <#6816 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/A5EWOZKP5CJG2Z4MK6YJN2DXA25KLANCNFSM6AAAAAAW2Y4LVI . You are receiving this because you commented.Message ID: @.>

oh sorry yes model size, I am not sure about VRAM use I will try right now but if you take the time to read the benchmarks I posted they say 4.8gb or 3.1gb depending on fp16 or int8

Large-v2 model on GPU

Implementation Precision Beam size Time Max. GPU memory Max. CPU memory
openai/whisper fp16 5 4m30s 11325MB 9439MB
faster-whisper fp16 5 54s 4755MB 3244MB
faster-whisper int8 5 59s 3091MB 3117MB

Executed with CUDA 11.7.1 on a NVIDIA Tesla V100S.

@ras0k
Copy link
Author

ras0k commented Apr 12, 2023

Is that the needed VRAM or the model size? I can't run the medium model (1.5gb) on my 4GB GPU FWIW.

On Wed, Apr 12, 2023 at 11:41 PM ras0k @.> wrote: How much VRAM do you need for the large v2 model in Faster Whisper? That may limit its interest to users. 3.09 GB https://huggingface.co/guillaumekln/faster-whisper-large-v2/tree/main — Reply to this email directly, view it on GitHub <#6816 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/A5EWOZKP5CJG2Z4MK6YJN2DXA25KLANCNFSM6AAAAAAW2Y4LVI . You are receiving this because you commented.Message ID: @.>

just FYI right now I am testing large model on contMe and it's about 4.2GB so medium should run on your 4GB gpu, medium shows as about 2.3 GB usage max

@ras0k
Copy link
Author

ras0k commented Apr 12, 2023

my GPU is a 2060 6GB

@ras0k
Copy link
Author

ras0k commented Apr 12, 2023

for english medium.en is fine but for french not even large works so I really need large-v2 for it's multilingual capacities

@ras0k
Copy link
Author

ras0k commented Apr 12, 2023

do you want me to compare the speed of ConstMe vs Faster-Whisper on large just for benchmarking purposes ?

@Purfview
Copy link
Contributor

1:30 min transcription about 4 minutes vs 7 or so for Const-me using large v2 for both. I tested it through SubtitleEdit beta, not from the command line.

Edit: and Japanese works!

Nice! I would be interested in tests on longer sample and medium model.
Btw, do you find results from large-v2 model valuably better in comparison to medium?

@Purfview
Copy link
Contributor

Btw, by default it sets threads to max real cores, it's probably not healthy on CPUs with a lot of cores.
Can someone with such CPU make the tests and report optimal threads setting?

@rsmith02ct
Copy link

rsmith02ct commented Apr 21, 2023 via email

@Purfview
Copy link
Contributor

Longer sample? It was about a 90 minute source video. Yes, for a few short samples I did large does much better with proper names, especially unusual ones.

Oh, I misread it. That's impressive performance you have there. :)

@rsmith02ct
Copy link

rsmith02ct commented Apr 21, 2023 via email

@Purfview
Copy link
Contributor

Purfview commented Apr 21, 2023

For CPU I need to disable CUDA? Can you give me a command line setting to do what you need re: cores and threads? I have a 14 core CPU (13600K) and don't mind running it at 100% (I was using Subtitle Edit with CPP and running 3 at a time to max out the CPU- about 200W for ~10 hours at a time. No overheating).

Yes use CPU, and test --threads 14 --threads 10 --threads 8 --threads 4
Obviously only one task at a time. How much RAM do you have?

EDIT:
"100%" CPU usage doesn't mean optimal usage, programs can have same or better performance on for example 50% CPU usage.

BTW:
Actually I didn't "misread", you wrote "1:30 min", that means 90 seconds. :)

@rsmith02ct
Copy link

Using whisper-ctranslate2.exe --language en --model "medium" --device CPU --no_speech_threshold 0.2 on a 17 min file:

With 14 threads: 258 seconds
image

With 10 threads: seconds: 279s

With 8 threads: 290 seconds
image

With 4 threads: 292s
image

No thread limit: 258s
image

CUDA: 34s

Medium model isn't bad but has trouble with some words like cacao/cocoa and Cote D'Ivoire (the country).
For some runs it seemed to get stuck even though I wasn't doing anything else on the computer with 3x the time and low CPU usage. I did multiple runs and threw out the long ones.
image

Regarding efficiency, my goal was to load the CPU as much as possible and minimize my own time. Getting it to about 200W left nothing on the table (each instance of CPP seemed to occupy about 4 cores) and it worked away on batch transcriptions in 3 Subtitle Edits as I worked on other things with my laptop. Is it possible each one would have finished somewhat faster had I done them all in serial- sure. The CPU also heated the room that day- I left the ductless heat pump off.

@Purfview
Copy link
Contributor

Purfview commented Apr 21, 2023

Interesting....

Please do 2 threads test.

@rsmith02ct
Copy link

Is this necessary? I've done quite a lot of tests. It will probably yield a worse time..

@Purfview
Copy link
Contributor

Yes, it is.

@rsmith02ct
Copy link

2 threads. First run: 472s; second run 477s

@Purfview
Copy link
Contributor

Thanks, looks like I need to set default to max 4 threads.
Interestingly, there is some boost in >8 threads.

@rsmith02ct
Copy link

Why limit it at all?

@Purfview
Copy link
Contributor

Because it's a waste of electricity for nothing.

@darnn
Copy link

darnn commented Apr 21, 2023

FWIW, over here, on a file that's just shy of three minutes (2:58), in Hebrew, with the large model:
WhisperDesktop: 59 seconds
The CPU build of Faster-Whisper linked above, using the command line, default settings: 267 seconds
2 threads: 326 seconds
4: 266
8: 262
16: 281

Though I will say that Faster-Whisper was more accurate in three or four words out of that file.

@Purfview
Copy link
Contributor

@damn WhisperDesktop is Const-me, it runs on GPU.

@darnn
Copy link

darnn commented Apr 22, 2023

Oh, yes, I didn't mean to suggest otherwise, it's just that the last time I tried Whisper-Faster, it wouldn't run the large model on the GPU at all, because I didn't have the 10 GB of memory I needed. But I just tried again with your standalone version, and it does indeed run it. I tried the default, 8 and 16 threads, and with all of them the results were 78-83 seconds. The output was still a little bit more accurate than Whisper-Desktop, but, strangely, a little less accurate than with your CPU build.

@ras0k
Copy link
Author

ras0k commented Apr 24, 2023 via email

@Purfview
Copy link
Contributor

Purfview commented Apr 26, 2023

Oh, yes, I didn't mean to suggest otherwise, it's just that the last time I tried Whisper-Faster, it wouldn't run the large model on the GPU at all, because I didn't have the 10 GB of memory I needed.

Thanks for tests. I think that then you've tested "OpenAI standalone".

But I just tried again with your standalone version, and it does indeed run it. I tried the default, 8 and 16 threads, and with all of them the results were 78-83 seconds. The output was still a little bit more accurate than Whisper-Desktop, but, strangely, a little less accurate than with your CPU build.

Some difference between GPU<->CPU results is normal.
Check new "b103" release, threads by default issue should be fixed, now it supports languages in the full names, added debug output with --verbose True, some few more parameters. [ Should be all parameters supported now]
"b103" means what is the last commit compiled in. I didn't noticed change in performance or results vs b94, at least with medium model.

A waste of electricity ? where is the electricity wasted ? either it is computing or it is not, there is no ‘‘waste’’

Yes. In CPU. That's not how multi-core CPUs work.

@darnn
Copy link

darnn commented Apr 27, 2023

With the latest GPU release, with the default settings, it processes the same file in 55 seconds! So slightly faster than Whisper-Desktop now. Now, the question is, can I tweak any of the settings that would make it more accurate? I've never tried messing with any of these at all, and so I don't even really know what they are (beam size? something?), but as I said before, the CPU's output was slightly more accurate with its default settings.

@Purfview
Copy link
Contributor

Purfview commented Apr 27, 2023

You can try to increase --beam_size, I set it to 1 by default.
Higher - slower transcription.

And you can try --vad_filter False. [VAD can skip some lines]

@ras0k
Copy link
Author

ras0k commented Apr 28, 2023 via email

@Purfview
Copy link
Contributor

@darnn @rsmith02ct @ras0k
Got rid of PyTorch in new "r117" release [that's why one small executable]. As you reported that CPU was more accurate then I matched precision in CUDA with CPU. Added bit more info when running on --verbose.

Could you test if it works OK on CUDA? Speed should similar, maybe slower than previous version.

@darnn
Copy link

darnn commented May 13, 2023

Sounds interesting! I won't have time to test thoroughly in the next two or three days, but I will after that!

@rsmith02ct
Copy link

Could you share a link to the release so I don't have to look for it?

@Purfview
Copy link
Contributor

@darnn
Copy link

darnn commented May 14, 2023

Well, huh! I still haven't messed with all the different settings (is there anything other than beam_size that might improve accuracy?), but with the default settings, running through CUDA, it gives me the exact same time WhisperDesktop does for the file I used, 56 seconds.

@rsmith02ct
Copy link

Retesting with the 17 min file I tested above

From command line gives an error but then proceeds
2023-05-15 14:34:40.1329173 [W:onnxruntime:Default, onnxruntime_pybind_state.cc:1671 onnxruntime::python::CreateInferencePybindStateModule] Init provider bridge failed.

--language en --model "medium" --device CPU --no_speech_threshold 0.2
Time: 512s (seemed to get stuck more than once)
Second run: 297s

--language en --model "medium" --device CUDA --no_speech_threshold 0.2
Time: 34s

So basically as fast as before

Through SubtitleEdit beta "ctranslate2" engine
Time: 41s

@rsmith02ct
Copy link

Have you tested WhisperX? I can't actually install figure out how it (too complicated) but interested to see if it's got better timestamps
https://github.com/m-bain/whisperX

@Purfview
Copy link
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants