Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lack of GPU Parallelism for Real-Time Server Using Faster-Whisper #1192

Open
dariopellegrino00 opened this issue Dec 6, 2024 · 5 comments
Open

Comments

@dariopellegrino00
Copy link

Hi, i'm currently working on my thesis, which involves building a real-time transcription server using the whisper-streaming project and faster-whisper for the ASR backend. The server is deployed on an RTX 6000 Ada GPU, but I am struggling to achieve proper GPU parallelism.
I am relatively new to using Whisper and have only recently started using Python. I appreciate your patience and any guidance you can provide!

I Tried

Multiple Models on Multiple Threads:

  • I instantiated multiple WhisperModel instances (one per thread) and assigned each client to its own model. While this approach works for a few clients, performance degrades significantly beyond ~8 clients, regardless of the model size. Visually what seems to be happening to me is that the models are competing with each other for the entire GPU resources.

Single Shared Model with num_workers:

  • I shared a single WhisperModel instance among multiple threads and used the num_workers parameter to enable concurrent processing. This approach also works well initially but similarly fails to handle more than ~8 clients effectively, again with the same issues.
  1. Is there a way to achieve true GPU parallelism for multiple audio sources on a single GPU using faster-whisper?
  2. Does the num_workers parameter have any impact on GPU-based inference, or is it exclusively for CPU execution?
  3. Are there recommended configurations or best practices for maximizing GPU utilization in scenarios with multiple concurrent audio streams?

Any advice or clarification would be greatly appreciated. Thank you for your amazing work on this project!

@heimoshuiyu
Copy link
Contributor

Hello, I wrote a very simple FastAPI script to run the faster whisper module. The script is available here. https://github.com/heimoshuiyu/whisper-fastapi

I use Docker to deploy my service. I have 4 RTX 4070 Ti Super GPUs, and I deploy 2 services on each GPU. So in total, I have 8 services. Each service is mapped to one client, and I set up the Grafana GPU monitor, which indicates all GPUs are utilized at 100%.

I am using the large v2 module, and almost all my transcription tasks are longer than 1 to 3 minutes of audio. The feature extraction preprocessing part isn't wasting too much GPU time, I think. In my case, two services per GPU is enough. You might consider using more service per GPU if you are transcribing shorter audio.

@dariopellegrino00
Copy link
Author

Thank you very much for the response. I will take a look at your solution as soon as I can.

@MahmoudAshraf97
Copy link
Collaborator

@heimoshuiyu unfortunately that script is not utilizing the gpus correctly even if it shows 100% utilization
The correct way to utilize multiple gpus is to run a single model instance with multiple device indices and use model.model.generate function asynchronously
This is not implemented in faster whiper because it needs batching to actually saturate the gpu so sequential transcription will not benefit, and the batched transcription needs the encoder output which cannot utilize multiple gpus effeciently, I'll try to think of a good implementation to use multiple gpus effeciently

@dariopellegrino00
Copy link
Author

Hi @MahmoudAshraf97, sorry to bother you. Do you have any suggestions on how to efficiently transcribe multiple audio sources in parallel using Faster Whisper? I’d appreciate any insights or recommendations.

@Micla-SHL
Copy link

Hi @dariopellegrino00 I instantiated multiple WhisperModel instances (one per thread) and assigned each client to its own model. While this approach works for a few clients, performance degrades significantly beyond ~8 clients, regardless of the model size. Visually what seems to be happening to me is that the models are competing with each other for the entire GPU resources. 这是怎么做到的?我对性能问题的理解是你的显卡的token/s 决定了你的可用实例

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants