GPU memory leak with high load for ONNX model #198

junwang-wish · 2023-06-14T20:10:31Z

Description
GPU memory leak with high load, GPU memory usage goes up and never come down when high load requests stop coming (memory never released)

Triton Information
What version of Triton are you using?

23.02

Are you using the Triton container or did you build it yourself?

Triton container

To Reproduce
Any ONNX model under high load would result in monotonically increasing GPU memory usage

Expected behavior
When requests stop coming, GPU memory should be released

kthui · 2023-06-20T19:41:34Z

I wonder if the memory usage would come down if the model is unloaded (i.e. via the unload API).

cc @tanmayv25 if the memory usage is expected.

junwang-wish · 2023-06-20T20:06:11Z

Thx @kthui , so I don't want to unload the model, since it is used but on a unfrequent basis, ideally if a model is unused for a prolonged period of time (say 2 hours) the GPU memory would be freed

tanmayv25 · 2023-06-22T22:56:38Z

@junwang-wish which execution provider in ORT are you using? Are you using TRT or CUDA?
Is your model having dynamic shaped inputs?
I am transferring the issue to ORT backend team as it seems to be an issue with ORT.

tanmayv25 transferred this issue from triton-inference-server/server Jun 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU memory leak with high load for ONNX model #198

GPU memory leak with high load for ONNX model #198

junwang-wish commented Jun 14, 2023

kthui commented Jun 20, 2023

junwang-wish commented Jun 20, 2023

tanmayv25 commented Jun 22, 2023

GPU memory leak with high load for ONNX model #198

GPU memory leak with high load for ONNX model #198

Comments

junwang-wish commented Jun 14, 2023

kthui commented Jun 20, 2023

junwang-wish commented Jun 20, 2023

tanmayv25 commented Jun 22, 2023