You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Description
We are using triton server with custom python model with custom Llama 3.2 model using VLLM.
After we deploy server, Triton request scheduler always gives "PENDING" status and not changed to "EXECUTING" status.
We suspect that PENDING coming from certain request, so if we remove problematic request from scheduler queue then it will be resolved. But it could not fix the issue.
it is really weird that it always happen 3 days after we newly deploy the server.
here's the metric we got from triton
here's the logs (read bottom up order)
if you see the logs, in first time, it alwasy gives "INITIALIZE to PENDING" -> "PENDING to EXECUTING"
but after some time, it gives "INITIALIZE to PENDING" and no "PENDING to EXECUTING"
2025-02-14 09:52:25.565I0214 00:52:25.565106 1 http_server.cc:4578] "HTTP request: 2 /v2/models/llama_90b_finetune_model/infer"
2025-02-14 09:52:25.565I0214 00:52:25.565155 1 model_lifecycle.cc:339] "GetModel() 'llama_90b_finetune_model' version -1"
2025-02-14 09:52:25.565I0214 00:52:25.565171 1 model_lifecycle.cc:339] "GetModel() 'llama_90b_finetune_model' version -1"
2025-02-14 09:52:25.565I0214 00:52:25.565248 1 infer_request.cc:132] "[request id: <id_unknown>] Setting state from INITIALIZED to INITIALIZED"
2025-02-14 09:52:25.565I0214 00:52:25.565291 1 infer_request.cc:132] "[request id: <id_unknown>] Setting state from INITIALIZED to PENDING"
2025-02-14 09:52:15.952I0214 00:52:15.951979 1 http_server.cc:4578] "HTTP request: 0 /v2/health/ready"
2025-02-14 09:52:07.583I0214 00:52:07.583815 1 http_server.cc:317] "HTTP request: 0 /metrics"
2025-02-14 09:52:07.120I0214 00:52:07.120137 1 http_server.cc:4578] "HTTP request: 0 /v2/health/ready"
2025-02-14 09:52:06.529I0214 00:52:06.529119 1 http_server.cc:4578] "HTTP request: 2 /v2/models/llama_90b_finetune_model/infer"
2025-02-14 09:52:06.529I0214 00:52:06.529169 1 model_lifecycle.cc:339] "GetModel() 'llama_90b_finetune_model' version -1"
2025-02-14 09:52:06.529I0214 00:52:06.529188 1 model_lifecycle.cc:339] "GetModel() 'llama_90b_finetune_model' version -1"
2025-02-14 09:52:06.529I0214 00:52:06.529261 1 infer_request.cc:132] "[request id: <id_unknown>] Setting state from INITIALIZED to INITIALIZED"
2025-02-14 09:52:06.529I0214 00:52:06.529305 1 infer_request.cc:132] "[request id: <id_unknown>] Setting state from INITIALIZED to PENDING"
2025-02-14 09:52:00.953I0214 00:52:00.953782 1 http_server.cc:4578] "HTTP request: 0 /v2/health/ready"
2025-02-14 09:52:00.953I0214 00:52:00.953807 1 http_server.cc:4578] "HTTP request: 0 /v2/health/ready"
2025-02-14 09:51:47.476I0214 00:51:47.475936 1 model_lifecycle.cc:339] "GetModel() 'llama_90b_finetune_model' version -1"
2025-02-14 09:51:47.476I0214 00:51:47.475955 1 model_lifecycle.cc:339] "GetModel() 'llama_90b_finetune_model' version -1"
2025-02-14 09:51:47.476I0214 00:51:47.476034 1 infer_request.cc:132] "[request id: <id_unknown>] Setting state from INITIALIZED to INITIALIZED"
2025-02-14 09:51:47.476I0214 00:51:47.476085 1 infer_request.cc:132] "[request id: <id_unknown>] Setting state from INITIALIZED to PENDING"
2025-02-14 09:51:47.475I0214 00:51:47.475897 1 http_server.cc:4578] "HTTP request: 2 /v2/models/llama_90b_finetune_model/infer"
2025-02-14 09:51:45.946I0214 00:51:45.946497 1 http_server.cc:4578] "HTTP request: 0 /v2/health/ready"
2025-02-14 09:51:37.587I0214 00:51:37.587125 1 http_server.cc:317] "HTTP request: 0 /metrics"
2025-02-14 09:51:37.101I0214 00:51:37.101146 1 http_server.cc:4578] "HTTP request: 0 /v2/health/ready"
2025-02-14 09:51:21.133I0214 00:51:21.133596 1 http_server.cc:4578] "HTTP request: 2 /v2/models/llama_90b_finetune_model/infer"
2025-02-14 09:51:21.133I0214 00:51:21.133638 1 model_lifecycle.cc:339] "GetModel() 'llama_90b_finetune_model' version -1"
2025-02-14 09:51:21.133I0214 00:51:21.133656 1 model_lifecycle.cc:339] "GetModel() 'llama_90b_finetune_model' version -1"
2025-02-14 09:51:21.133I0214 00:51:21.133734 1 infer_request.cc:132] "[request id: <id_unknown>] Setting state from INITIALIZED to INITIALIZED"
2025-02-14 09:51:21.133I0214 00:51:21.133780 1 infer_request.cc:132] "[request id: <id_unknown>] Setting state from INITIALIZED to PENDING"
2025-02-14 09:51:19.159I0214 00:51:19.159371 1 http_server.cc:4578] "HTTP request: 2 /v2/models/llama_90b_finetune_model/infer"
2025-02-14 09:51:19.159I0214 00:51:19.159419 1 model_lifecycle.cc:339] "GetModel() 'llama_90b_finetune_model' version -1"
2025-02-14 09:51:19.159I0214 00:51:19.159439 1 model_lifecycle.cc:339] "GetModel() 'llama_90b_finetune_model' version -1"
2025-02-14 09:51:19.159I0214 00:51:19.159544 1 infer_request.cc:132] "[request id: <id_unknown>] Setting state from INITIALIZED to INITIALIZED"
2025-02-14 09:51:19.159I0214 00:51:19.159604 1 infer_request.cc:132] "[request id: <id_unknown>] Setting state from INITIALIZED to PENDING"
2025-02-14 09:51:15.916I0214 00:51:15.916003 1 http_server.cc:4578] "HTTP request: 0 /v2/health/ready"
2025-02-14 09:51:14.929I0214 00:51:14.929663 1 http_server.cc:4578] "HTTP request: 2 /v2/models/llama_90b_finetune_model/infer"
2025-02-14 09:51:14.929I0214 00:51:14.929702 1 model_lifecycle.cc:339] "GetModel() 'llama_90b_finetune_model' version -1"
2025-02-14 09:51:14.929I0214 00:51:14.929715 1 model_lifecycle.cc:339] "GetModel() 'llama_90b_finetune_model' version -1"
2025-02-14 09:51:14.929I0214 00:51:14.929777 1 infer_request.cc:132] "[request id: <id_unknown>] Setting state from INITIALIZED to INITIALIZED"
2025-02-14 09:51:14.929I0214 00:51:14.929824 1 infer_request.cc:132] "[request id: <id_unknown>] Setting state from INITIALIZED to PENDING"
2025-02-14 09:51:14.911I0214 00:51:14.911045 1 batch_request_converter.py:12] "batch_size is 1"
2025-02-14 09:51:14.911I0214 00:51:14.911278 1 batch_request_converter.py:12] "batch_size is 1"
2025-02-14 09:51:14.910I0214 00:51:14.910122 1 model.py:133] "Total Requests: 1"
2025-02-14 09:51:14.910I0214 00:51:14.910319 1 infer_response.cc:174] "add response output: output: criticality, type: BYTES, shape: [1]"
2025-02-14 09:51:14.910I0214 00:51:14.910347 1 http_server.cc:1279] "HTTP using buffer for: 'criticality', size: 16, addr: 0x563cd9a42030"
2025-02-14 09:51:14.910I0214 00:51:14.910359 1 infer_response.cc:174] "add response output: output: reason, type: BYTES, shape: [1]"
2025-02-14 09:51:14.910I0214 00:51:14.910371 1 http_server.cc:1279] "HTTP using buffer for: 'reason', size: 131, addr: 0x563cd9c15c30"
2025-02-14 09:51:14.910I0214 00:51:14.910375 1 infer_response.cc:174] "add response output: output: confidenceLevel, type: BYTES, shape: [1]"
2025-02-14 09:51:14.910I0214 00:51:14.910380 1 http_server.cc:1279] "HTTP using buffer for: 'confidenceLevel', size: 8, addr: 0x563cd9890830"
2025-02-14 09:51:14.910I0214 00:51:14.910449 1 http_server.cc:1353] "HTTP release: size 16, addr 0x563cd9a42030"
2025-02-14 09:51:14.910I0214 00:51:14.910456 1 http_server.cc:1353] "HTTP release: size 131, addr 0x563cd9c15c30"
2025-02-14 09:51:14.910I0214 00:51:14.910460 1 http_server.cc:1353] "HTTP release: size 8, addr 0x563cd9890830"
2025-02-14 09:51:14.910I0214 00:51:14.910539 1 infer_request.cc:132] "[request id: <id_unknown>] Setting state from EXECUTING to RELEASED"
2025-02-14 09:51:14.910I0214 00:51:14.910565 1 python_be.cc:2032] "TRITONBACKEND_ModelInstanceExecute: model instance name llama_90b_finetune_model_0_0 released 1 requests"
2025-02-14 09:51:14.910I0214 00:51:14.910609 1 infer_request.cc:132] "[request id: <id_unknown>] Setting state from PENDING to EXECUTING"
2025-02-14 09:51:14.910I0214 00:51:14.910615 1 infer_request.cc:132] "[request id: <id_unknown>] Setting state from PENDING to EXECUTING"
2025-02-14 09:51:14.910I0214 00:51:14.910622 1 python_be.cc:1198] "model llama_90b_finetune_model, instance llama_90b_finetune_model_0_0, executing 2 requests"
2025-02-14 09:51:13.206I0214 00:51:13.206550 1 http_server.cc:4578] "HTTP request: 2 /v2/models/llama_90b_finetune_model/infer"
2025-02-14 09:51:13.206I0214 00:51:13.206604 1 model_lifecycle.cc:339] "GetModel() 'llama_90b_finetune_model' version -1"
2025-02-14 09:51:13.206I0214 00:51:13.206622 1 model_lifecycle.cc:339] "GetModel() 'llama_90b_finetune_model' version -1"
2025-02-14 09:51:13.206I0214 00:51:13.206701 1 infer_request.cc:132] "[request id: <id_unknown>] Setting state from INITIALIZED to INITIALIZED"
2025-02-14 09:51:13.206I0214 00:51:13.206743 1 infer_request.cc:132] "[request id: <id_unknown>] Setting state from INITIALIZED to PENDING"
2025-02-14 09:51:08.863I0214 00:51:08.862906 1 http_server.cc:1279] "HTTP using buffer for: 'confidenceLevel', size: 8, addr: 0x563cd9a7e030"
2025-02-14 09:51:08.863I0214 00:51:08.862988 1 http_server.cc:1353] "HTTP release: size 16, addr 0x563cd9a1b830"
2025-02-14 09:51:08.863I0214 00:51:08.862996 1 http_server.cc:1353] "HTTP release: size 264, addr 0x563cd9a1b430"
2025-02-14 09:51:08.863I0214 00:51:08.863000 1 http_server.cc:1353] "HTTP release: size 8, addr 0x563cd9a7e030"
2025-02-14 09:51:08.863I0214 00:51:08.863081 1 infer_request.cc:132] "[request id: <id_unknown>] Setting state from EXECUTING to RELEASED"
2025-02-14 09:51:08.863I0214 00:51:08.863132 1 python_be.cc:2032] "TRITONBACKEND_ModelInstanceExecute: model instance name llama_90b_finetune_model_0_0 released 1 requests"
2025-02-14 09:51:08.863I0214 00:51:08.863172 1 infer_request.cc:132] "[request id: <id_unknown>] Setting state from PENDING to EXECUTING"
2025-02-14 09:51:08.863I0214 00:51:08.863181 1 python_be.cc:1198] "model llama_90b_finetune_model, instance llama_90b_finetune_model_0_0, executing 1 requests"
2025-02-14 09:51:08.863I0214 00:51:08.863462 1 batch_request_converter.py:12] "batch_size is 1"
2025-02-14 09:51:08.862I0214 00:51:08.862664 1 model.py:133] "Total Requests: 1"
2025-02-14 09:51:08.862I0214 00:51:08.862851 1 infer_response.cc:174] "add response output: output: criticality, type: BYTES, shape: [1]"
2025-02-14 09:51:08.862I0214 00:51:08.862879 1 http_server.cc:1279] "HTTP using buffer for: 'criticality', size: 16, addr: 0x563cd9a1b830"
2025-02-14 09:51:08.862I0214 00:51:08.862891 1 infer_response.cc:174] "add response output: output: reason, type: BYTES, shape: [1]"
2025-02-14 09:51:08.862I0214 00:51:08.862897 1 http_server.cc:1279] "HTTP using buffer for: 'reason', size: 264, addr: 0x563cd9a1b430"
2025-02-14 09:51:08.862I0214 00:51:08.862902 1 infer_response.cc:174] "add response output: output: confidenceLevel, type: BYTES, shape: [1]"
2025-02-14 09:51:07.584I0214 00:51:07.584388 1 http_server.cc:317] "HTTP request: 0 /metrics"
2025-02-14 09:51:07.070I0214 00:51:07.070630 1 http_server.cc:4578] "HTTP request: 0 /v2/health/ready"
2025-02-14 09:51:00.953I0214 00:51:00.953331 1 http_server.cc:4578] "HTTP request: 0 /v2/health/ready"
2025-02-14 09:50:58.551I0214 00:50:58.551132 1 http_server.cc:4578] "HTTP request: 2 /v2/models/llama_90b_finetune_model/infer"
2025-02-14 09:50:58.551I0214 00:50:58.551167 1 model_lifecycle.cc:339] "GetModel() 'llama_90b_finetune_model' version -1"
2025-02-14 09:50:58.551I0214 00:50:58.551180 1 model_lifecycle.cc:339] "GetModel() 'llama_90b_finetune_model' version -1"
2025-02-14 09:50:58.551I0214 00:50:58.551249 1 infer_request.cc:132] "[request id: <id_unknown>] Setting state from INITIALIZED to INITIALIZED"
2025-02-14 09:50:58.551I0214 00:50:58.551301 1 infer_request.cc:132] "[request id: <id_unknown>] Setting state from INITIALIZED to PENDING"
2025-02-14 09:50:58.534I0214 00:50:58.533893 1 infer_request.cc:132] "[request id: <id_unknown>] Setting state from EXECUTING to RELEASED"
2025-02-14 09:50:58.534I0214 00:50:58.533935 1 python_be.cc:2032] "TRITONBACKEND_ModelInstanceExecute: model instance name llama_90b_finetune_model_0_0 released 1 requests"
2025-02-14 09:50:58.534I0214 00:50:58.533976 1 infer_request.cc:132] "[request id: <id_unknown>] Setting state from PENDING to EXECUTING"
2025-02-14 09:50:58.534I0214 00:50:58.533985 1 python_be.cc:1198] "model llama_90b_finetune_model, instance llama_90b_finetune_model_0_0, executing 1 requests"
2025-02-14 09:50:58.534I0214 00:50:58.534281 1 batch_request_converter.py:12] "batch_size is 1"
2025-02-14 09:50:58.533I0214 00:50:58.533800 1 http_server.cc:1353] "HTTP release: size 16, addr 0x563cd9a7e430"
2025-02-14 09:50:58.533I0214 00:50:58.533806 1 http_server.cc:1353] "HTTP release: size 203, addr 0x563cd98d4430"
2025-02-14 09:50:58.533I0214 00:50:58.533810 1 http_server.cc:1353] "HTTP release: size 8, addr 0x563cd99fb830"
Triton Information
24.07 version, python container
We are making our dockers based on triton server docker
To Reproduce
model info: llama 90b vision instruct
backend: python
llm engine: vllm
Description
We are using triton server with custom python model with custom Llama 3.2 model using VLLM.
After we deploy server, Triton request scheduler always gives "PENDING" status and not changed to "EXECUTING" status.
We suspect that PENDING coming from certain request, so if we remove problematic request from scheduler queue then it will be resolved. But it could not fix the issue.
it is really weird that it always happen 3 days after we newly deploy the server.
here's the metric we got from triton

here's the logs (read bottom up order)
if you see the logs, in first time, it alwasy gives "INITIALIZE to PENDING" -> "PENDING to EXECUTING"
but after some time, it gives "INITIALIZE to PENDING" and no "PENDING to EXECUTING"
Triton Information
24.07 version, python container
We are making our dockers based on triton server docker
To Reproduce
model info: llama 90b vision instruct
backend: python
llm engine: vllm
after 3days of deployment, it always happen
config
Expected behavior
No pending, always PENDING to EXECUTING each request
The text was updated successfully, but these errors were encountered: