Added multiprocessing for cpu processing #648

joiemoie · 2024-01-19T04:20:20Z

Because of the Python GIL, the preprocessing doesn't fully efficiently use all the CPU cores. By spawning the CPU tasks in its own multiprocess, you can get requests that happen on different threads to fully utilize the CPU cores.

…YSTRAN#625 https://github.com/SYSTRAN/faster-whisper/pull/625/files

Purfview · 2024-01-19T09:15:18Z

Does this have any actual impact on performance? Do you have benchmarks?

joiemoie · 2024-01-19T18:02:49Z

Yes! I can send my data and test case later today.

…

On Fri, Jan 19, 2024 at 1:15 AM Purfview ***@***.***> wrote: Does this have any actual impact on performance? Do you have benchmarks? — Reply to this email directly, view it on GitHub <#648 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AEXW5QF2CYM2IVG5IA22GNLYPI2TBAVCNFSM6AAAAABCBKWDXKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMBQGAZTGNJZGM> . You are receiving this because you authored the thread.Message ID: ***@***.***>

joiemoie · 2024-01-22T03:33:13Z

Does this have any actual impact on performance? Do you have benchmarks?

Testing code:

`from faster_whisper import WhisperModel, decode_audio
from io import BytesIO
import time
from fastapi import FastAPI, Request, UploadFile

import nvtx
import threading
import time
import time
from concurrent.futures import ThreadPoolExecutor
from faster_whisper import WhisperModel, decode_audio

def preprocess_audio(filename):
with nvtx.annotate("Decode audio"):
return decode_audio(filename)

model = WhisperModel("large-v3", device="cuda", device_index=[0], compute_type="bfloat16", cpu_threads=2, num_workers=2)

def transcribe(model_to_use):
start_time = time.time()
with nvtx.annotate("Transcribe"):

    segments, info = model_to_use.transcribe("test.wav", language=None, vad_filter=True, word_timestamps=False, vad_parameters={"window_size_samples": 1024}, preprocess_on_multiple_cores=True)

print(f"Single Request Elapsed time: {time.time() - start_time}. Audio duration: {info.duration}")

#this is to clear out memory from the GPUs
transcribe(model)
transcribe(model)

if name == "main":
threads = []
for i in range(20):
threads.append(threading.Thread(target=transcribe, args=(model,)))

# thread_1 = threading.Thread(target=transcribe)
start_time = time.time()

for thread in threads:
    thread.start()

for thread in threads:
    thread.join()
print(f"Total Elapsed time: {time.time() - start_time}")

`

Results:

Overall time to pre-process 20 requests before without multicore:

2.7506766319274902 seconds

Overall time to pre-process 20 requests before with multicore:

1.9269721508026123 seconds

Now to test the overhead for a single request.

Overall time to pre-process 1 requests before without multicore:

0.21215391159057617 seconds

Overall time to pre-process 1 request with multicore:
Total Elapsed time: 0.21257996559143066

So there's a tradeoff between overhead and spawning the worker process

trungkienbkhn · 2024-01-26T08:47:30Z

@joiemoie , hello. Tks for an interesting pull request.
From my test (20 requests, device=cpu, model=tiny, cpy_threads=8), I received the overall time as below:

without multicore: 14.506s
with multicore: 10.917s

That's a pretty significant improvement !
But I think we can improve further. I tried adding this logic to the cpu_preprocessing function:

if not isinstance(audio, np.ndarray):
    audio = decode_audio(
        audio, sampling_rate=feature_extractor.sampling_rate
    )

if vad_filter:
    if vad_parameters is None:
        vad_parameters = VadOptions()
    elif isinstance(vad_parameters, dict):
        vad_parameters = VadOptions(**vad_parameters)

The overall time was 9.633s after my change. I think the logic in the decode_audio function also takes up a significant amount of computation time.
What do you think about this idea? And should we add more code logic into the cpu_preprocessing function?

joiemoie · 2024-01-27T08:57:50Z

Nice! That's not a bad idea. Please don't merge this in for now. I noticed that there's memory inefficiency, and the pool size needs to be capped or have a parameter set. I'm investigating the memory inefficiency

trungkienbkhn · 2024-04-03T04:48:24Z

@joiemoie , hello. Have you finished your work yet 😃 ?

Frenzie · 2024-10-09T15:24:59Z

faster_whisper/transcribe.py

@@ -264,56 +317,43 @@ def transcribe(
            https://github.com/snakers4/silero-vad.
          vad_parameters: Dictionary of Silero VAD parameters or VadOptions class (see available
            parameters and default values in the class `VadOptions`).
+          preprocess_on_multiple_cores: If preprocess_on_multiple_cores is True, multiple
+            CPU based workloads will run on different cores. This will slightly increse overhead
+            for single requests but improve performance for multiple simulatenous requests.


typo ^_^ (all looks very interesting!)

Suggested change

for single requests but improve performance for multiple simulatenous requests.

for single requests but improve performance for multiple simultaneous requests.

joiemoie force-pushed the multicore branch 2 times, most recently from dd68247 to 47e14c8 Compare January 19, 2024 04:45

Added multiprocessing for cpu processing

04f4bb3

joiemoie force-pushed the multicore branch from 47e14c8 to 04f4bb3 Compare January 19, 2024 04:50

joiemoie added 2 commits January 18, 2024 22:09

Added logic to detect possible code switching in the language

874ffc4

Bugfix: Illogical "Avoid computing higher temperatures on no_speech" S…

007b1a4

…YSTRAN#625 https://github.com/SYSTRAN/faster-whisper/pull/625/files

Added multiprocessing for cpu processing

57df6dd

joiemoie force-pushed the multicore branch from 04f4bb3 to 57df6dd Compare January 22, 2024 02:47

joiemoie added 2 commits January 21, 2024 18:48

Merge branch 'multicore'

6b9f62d

Added multiprocessing for cpu processing

a537128

joiemoie force-pushed the multicore branch from 57df6dd to a537128 Compare January 22, 2024 03:27

joiemoie added 2 commits January 21, 2024 19:27

Merge branch 'multicore'

8bee3bc

Added multiprocessing for cpu processing

b6efe69

joiemoie force-pushed the multicore branch from a537128 to b6efe69 Compare January 22, 2024 03:30

joiemoie added 2 commits January 21, 2024 19:30

Merge branch 'multicore'

ba63269

Added multiprocessing for cpu processing

d300824

joiemoie force-pushed the multicore branch from b6efe69 to d300824 Compare January 22, 2024 03:31

Merge branch 'multicore'

7ddb02d

Fixed the code switching

f2459c4

joiemoie added 2 commits February 23, 2024 16:49

Added decode audio in the cpu preprocessing

a7d5453

Changed the cpu pool to 20

9510a67

Fledermaus-20 approved these changes May 15, 2024

View reviewed changes

Frenzie reviewed Oct 9, 2024

View reviewed changes

MahmoudAshraf97 closed this Oct 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added multiprocessing for cpu processing #648

Added multiprocessing for cpu processing #648

joiemoie commented Jan 19, 2024

Purfview commented Jan 19, 2024

joiemoie commented Jan 19, 2024 via email

joiemoie commented Jan 22, 2024

trungkienbkhn commented Jan 26, 2024

joiemoie commented Jan 27, 2024

trungkienbkhn commented Apr 3, 2024

Frenzie Oct 9, 2024

	for single requests but improve performance for multiple simulatenous requests.
	for single requests but improve performance for multiple simultaneous requests.

Added multiprocessing for cpu processing #648

Added multiprocessing for cpu processing #648

Conversation

joiemoie commented Jan 19, 2024

Purfview commented Jan 19, 2024

joiemoie commented Jan 19, 2024 via email

joiemoie commented Jan 22, 2024

trungkienbkhn commented Jan 26, 2024

joiemoie commented Jan 27, 2024

trungkienbkhn commented Apr 3, 2024

Frenzie Oct 9, 2024

Choose a reason for hiding this comment