async execute is not run concurrently #7888

ShuaiShao93 · 2024-12-17T06:30:09Z

Description
We have a Python BLS model that calls into another model. This BLS model is just a thin wrapper, and we use await infer_request.async_exec(). In this case, the async function should be handle multiple requests concurrently when it's waiting for the async_exec.

However, we notice there is backlog on this BLS model rather than the actual backend model, which means requests are not processed concurrently.

Triton Information
24.11

To Reproduce

Define a Python model with async execute and call into another model. The python model is very lightweight, while the backend model is much slower.
Start sending concurrent requests with batch_size=1
Fetch the metrics and check the queue size of each model
Noticed that the queue size of the BLS model is increasing, while the queue size of the backend model is always 0

Expected behavior
If the BLS async model can handle concurrent requests, the backlog should happen on the backend model rather than the BLS model

The text was updated successfully, but these errors were encountered:

fighterhit · 2024-12-19T08:25:20Z

+1. We also encountered this problem in NGC triton server 23.12. I suspect that the underlying async_exec is not truly asynchronous and will block the event loop of the python backend.
I had to change to using triton grpc aio client to call the same triton server's another tensorflow model in BLS python backend, which can temporarily solve this problem. But each grpc connection will be destroyed because the event loop is closed, so I have to create a connection every time in async def execute(). Is there a way to reuse the same grpc connection in asynchronous mode? But essentially the blocking problem of infer_request.async_exec() should be solved.

@Tabrizian @okdimok @oandreeva-nv PTAL, thanks!

oandreeva-nv · 2024-12-20T20:54:43Z

Thanks for the issue, is it possible to share some code? this will significantly speed up the debugging process

ShuaiShao93 · 2024-12-20T21:22:57Z

Thanks for the issue, is it possible to share some code? this will significantly speed up the debugging process

I believe you can use this model in non-decoupled mode:

server/qa/python_models/bls_async/model.py

Lines 228 to 249 in 0194c3d

    
           class TritonPythonModel: 
        
               async def execute(self, requests): 
        
                   is_decoupled = True if os.environ["BLS_KIND"] == "decoupled" else False 
        
                   responses = [] 
        
                   for _ in requests: 
        
                       if is_decoupled: 
        
                           test1 = await multiple_async_bls_square(gpu=True) 
        
                           test2 = await multiple_async_bls_square(gpu=False) 
        
                           test3 = await async_bls_square() 
        
                       else: 
        
                           test1 = await multiple_async_bls_addsub(gpu=True) 
        
                           test2 = await multiple_async_bls_addsub(gpu=False) 
        
                           test3 = await async_bls_add_sub() 
        
                       responses.append( 
        
                           pb_utils.InferenceResponse( 
        
                               output_tensors=[ 
        
                                   pb_utils.Tensor("OUTPUT0", np.array([test1 & test2 & test3])) 
        
                               ] 
        
                           ) 
        
                       )

. Maybe just add some sleep in the backend model to showcase this better.

fighterhit · 2024-12-21T12:33:47Z

Hi @oandreeva-nv , When I used InferRequest.async_exec, I profiled the python backend process with py-spy. Here is the flame graph.

It can be seen that the CPU is mostly consumed at https://github.com/python/cpython/blob/3.10/Lib/concurrent/futures/thread.py#L81 and https://github.com/python/cpython/blob/3.10/Lib/concurrent/futures/thread.py#L58. The underlying are PyThread_acquire_lock_timed (libpython3.10.so.1.0) and pthread_cond_timedwait (libc.so.6).

I took a quick look at the async_exec of python backend implementation.

async_exec pybind
https://github.com/triton-inference-server/python_backend/blob/r23.12/src/pb_stub.cc#L1664-L1692
InferRequest::Exec
https://github.com/triton-inference-server/python_backend/blob/r23.12/src/infer_request.cc#L429-L624

I am not sure if it is cpu intensive and causes the blocking, but when I replaced it with aio grpc client, the blocking disappeared. FYI.

import tritonclient.grpc.aio as grpcclient

class TritonPythonModel: 
     async def execute(self, requests): 
        //... some logic
        
        // infer_request.async_exec() // Blocking. Not work.

        //Aio GRPC Work. But the connection has to be established every time to avoid the client being unable to be reused due to the event loop closed.
        triton_client = grpcclient.InferenceServerClient(url="localhost:10502")
        results = await triton_client.infer(
            model_name="the model in the same triton server",
            inputs=inputs,
            outputs=outputs
        )

tanmayv25 · 2025-01-24T20:28:34Z

Thanks for the clean reproducer. We will investigate and revert back.

vadimkantorov · 2025-02-11T00:05:24Z

@tanmayv25 Btw, is there an API to submit multiple InferenceRequest's to the same model? (besides looping and exec()/async_exec()ing them individually)

This is a natural usecase, e.g. we extracted a bunch of relevant sentences from a huge text and would like to get some BERT embeddings for them. Submitting such multiple (this is especially interesting when these InferRequests turn out small in number of tokens, so the overhead can become sizable) inputs to the BERT batcher in one shot (and then letting the batcher/scheduler handle all of them) makes a lot of sense when we can afford to wait for all the responses to arrive (I also elaborated a bit about this scenario in #7928)

vadimkantorov · 2025-02-20T12:01:10Z

@tanmayv25 meanwhile the async execute is getting blocked, is there any advice on reusing/preserving the open connection with the gRPC client? (for the case of doing a correct stress/load latency test in Python: sending truly in parallel a battery of inputs with as low overhead as possible and ideally into the same open connection)

tanmayv25 added the bug label Jan 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

async execute is not run concurrently #7888

async execute is not run concurrently #7888

ShuaiShao93 commented Dec 17, 2024

fighterhit commented Dec 19, 2024 •

edited

Loading

oandreeva-nv commented Dec 20, 2024

ShuaiShao93 commented Dec 20, 2024

fighterhit commented Dec 21, 2024 •

edited

Loading

tanmayv25 commented Jan 24, 2025

vadimkantorov commented Feb 11, 2025 •

edited

Loading

vadimkantorov commented Feb 20, 2025 •

edited

Loading

async execute is not run concurrently #7888

async execute is not run concurrently #7888

Comments

ShuaiShao93 commented Dec 17, 2024

fighterhit commented Dec 19, 2024 • edited Loading

oandreeva-nv commented Dec 20, 2024

ShuaiShao93 commented Dec 20, 2024

fighterhit commented Dec 21, 2024 • edited Loading

tanmayv25 commented Jan 24, 2025

vadimkantorov commented Feb 11, 2025 • edited Loading

vadimkantorov commented Feb 20, 2025 • edited Loading

fighterhit commented Dec 19, 2024 •

edited

Loading

fighterhit commented Dec 21, 2024 •

edited

Loading

vadimkantorov commented Feb 11, 2025 •

edited

Loading

vadimkantorov commented Feb 20, 2025 •

edited

Loading