You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thx @kthui , so I don't want to unload the model, since it is used but on a unfrequent basis, ideally if a model is unused for a prolonged period of time (say 2 hours) the GPU memory would be freed
@junwang-wish which execution provider in ORT are you using? Are you using TRT or CUDA?
Is your model having dynamic shaped inputs?
I am transferring the issue to ORT backend team as it seems to be an issue with ORT.
Description
GPU memory leak with high load, GPU memory usage goes up and never come down when high load requests stop coming (memory never released)
Triton Information
What version of Triton are you using?
23.02
Are you using the Triton container or did you build it yourself?
Triton container
To Reproduce
Any ONNX model under high load would result in monotonically increasing GPU memory usage
Expected behavior
When requests stop coming, GPU memory should be released
The text was updated successfully, but these errors were encountered: