-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
async execute is not run concurrently #7888
Comments
+1. We also encountered this problem in NGC triton server 23.12. I suspect that the underlying @Tabrizian @okdimok @oandreeva-nv PTAL, thanks! |
Thanks for the issue, is it possible to share some code? this will significantly speed up the debugging process |
I believe you can use this model in non-decoupled mode: server/qa/python_models/bls_async/model.py Lines 228 to 249 in 0194c3d
|
Hi @oandreeva-nv , When I used It can be seen that the CPU is mostly consumed at https://github.com/python/cpython/blob/3.10/Lib/concurrent/futures/thread.py#L81 and https://github.com/python/cpython/blob/3.10/Lib/concurrent/futures/thread.py#L58. The underlying are PyThread_acquire_lock_timed (libpython3.10.so.1.0) and pthread_cond_timedwait (libc.so.6). I took a quick look at the
I am not sure if it is cpu intensive and causes the blocking, but when I replaced it with aio grpc client, the blocking disappeared. FYI.
|
Thanks for the clean reproducer. We will investigate and revert back. |
@tanmayv25 Btw, is there an API to submit multiple This is a natural usecase, e.g. we extracted a bunch of relevant sentences from a huge text and would like to get some BERT embeddings for them. Submitting such multiple (this is especially interesting when these InferRequests turn out small in number of tokens, so the overhead can become sizable) inputs to the BERT batcher in one shot (and then letting the batcher/scheduler handle all of them) makes a lot of sense when we can afford to wait for all the responses to arrive (I also elaborated a bit about this scenario in #7928) |
@tanmayv25 meanwhile the async execute is getting blocked, is there any advice on reusing/preserving the open connection with the gRPC client? (for the case of doing a correct stress/load latency test in Python: sending truly in parallel a battery of inputs with as low overhead as possible and ideally into the same open connection) |
Description
We have a Python BLS model that calls into another model. This BLS model is just a thin wrapper, and we use
await infer_request.async_exec()
. In this case, the async function should be handle multiple requests concurrently when it's waiting for the async_exec.However, we notice there is backlog on this BLS model rather than the actual backend model, which means requests are not processed concurrently.
Triton Information
24.11
To Reproduce
Expected behavior
If the BLS async model can handle concurrent requests, the backlog should happen on the backend model rather than the BLS model
The text was updated successfully, but these errors were encountered: