Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

async execute is not run concurrently #7888

Open
ShuaiShao93 opened this issue Dec 17, 2024 · 7 comments
Open

async execute is not run concurrently #7888

ShuaiShao93 opened this issue Dec 17, 2024 · 7 comments
Labels
bug Something isn't working

Comments

@ShuaiShao93
Copy link

Description
We have a Python BLS model that calls into another model. This BLS model is just a thin wrapper, and we use await infer_request.async_exec(). In this case, the async function should be handle multiple requests concurrently when it's waiting for the async_exec.

However, we notice there is backlog on this BLS model rather than the actual backend model, which means requests are not processed concurrently.

Triton Information
24.11

To Reproduce

  1. Define a Python model with async execute and call into another model. The python model is very lightweight, while the backend model is much slower.
  2. Start sending concurrent requests with batch_size=1
  3. Fetch the metrics and check the queue size of each model
  4. Noticed that the queue size of the BLS model is increasing, while the queue size of the backend model is always 0

Expected behavior
If the BLS async model can handle concurrent requests, the backlog should happen on the backend model rather than the BLS model

@fighterhit
Copy link

fighterhit commented Dec 19, 2024

+1. We also encountered this problem in NGC triton server 23.12. I suspect that the underlying async_exec is not truly asynchronous and will block the event loop of the python backend.
I had to change to using triton grpc aio client to call the same triton server's another tensorflow model in BLS python backend, which can temporarily solve this problem. But each grpc connection will be destroyed because the event loop is closed, so I have to create a connection every time in async def execute(). Is there a way to reuse the same grpc connection in asynchronous mode? But essentially the blocking problem of infer_request.async_exec() should be solved.

@Tabrizian @okdimok @oandreeva-nv PTAL, thanks!

@oandreeva-nv
Copy link
Contributor

Thanks for the issue, is it possible to share some code? this will significantly speed up the debugging process

@ShuaiShao93
Copy link
Author

Thanks for the issue, is it possible to share some code? this will significantly speed up the debugging process

I believe you can use this model in non-decoupled mode:

class TritonPythonModel:
async def execute(self, requests):
is_decoupled = True if os.environ["BLS_KIND"] == "decoupled" else False
responses = []
for _ in requests:
if is_decoupled:
test1 = await multiple_async_bls_square(gpu=True)
test2 = await multiple_async_bls_square(gpu=False)
test3 = await async_bls_square()
else:
test1 = await multiple_async_bls_addsub(gpu=True)
test2 = await multiple_async_bls_addsub(gpu=False)
test3 = await async_bls_add_sub()
responses.append(
pb_utils.InferenceResponse(
output_tensors=[
pb_utils.Tensor("OUTPUT0", np.array([test1 & test2 & test3]))
]
)
)
. Maybe just add some sleep in the backend model to showcase this better.

@fighterhit
Copy link

fighterhit commented Dec 21, 2024

Hi @oandreeva-nv , When I used InferRequest.async_exec, I profiled the python backend process with py-spy. Here is the flame graph.

image

It can be seen that the CPU is mostly consumed at https://github.com/python/cpython/blob/3.10/Lib/concurrent/futures/thread.py#L81 and https://github.com/python/cpython/blob/3.10/Lib/concurrent/futures/thread.py#L58. The underlying are PyThread_acquire_lock_timed (libpython3.10.so.1.0) and pthread_cond_timedwait (libc.so.6).

I took a quick look at the async_exec of python backend implementation.

I am not sure if it is cpu intensive and causes the blocking, but when I replaced it with aio grpc client, the blocking disappeared. FYI.

import tritonclient.grpc.aio as grpcclient

class TritonPythonModel: 
     async def execute(self, requests): 
        //... some logic
        
        // infer_request.async_exec() // Blocking. Not work.

        //Aio GRPC Work. But the connection has to be established every time to avoid the client being unable to be reused due to the event loop closed.
        triton_client = grpcclient.InferenceServerClient(url="localhost:10502")
        results = await triton_client.infer(
            model_name="the model in the same triton server",
            inputs=inputs,
            outputs=outputs
        )

@tanmayv25
Copy link
Contributor

Thanks for the clean reproducer. We will investigate and revert back.

@tanmayv25 tanmayv25 added the bug Something isn't working label Jan 24, 2025
@vadimkantorov
Copy link

vadimkantorov commented Feb 11, 2025

@tanmayv25 Btw, is there an API to submit multiple InferenceRequest's to the same model? (besides looping and exec()/async_exec()ing them individually)

This is a natural usecase, e.g. we extracted a bunch of relevant sentences from a huge text and would like to get some BERT embeddings for them. Submitting such multiple (this is especially interesting when these InferRequests turn out small in number of tokens, so the overhead can become sizable) inputs to the BERT batcher in one shot (and then letting the batcher/scheduler handle all of them) makes a lot of sense when we can afford to wait for all the responses to arrive (I also elaborated a bit about this scenario in #7928)

@vadimkantorov
Copy link

vadimkantorov commented Feb 20, 2025

@tanmayv25 meanwhile the async execute is getting blocked, is there any advice on reusing/preserving the open connection with the gRPC client? (for the case of doing a correct stress/load latency test in Python: sending truly in parallel a battery of inputs with as low overhead as possible and ideally into the same open connection)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Development

No branches or pull requests

5 participants