-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Monitoring actual GPU memory usage #1407
Comments
@rmothukuru Issue eg: Tensorflow allocated 6GB memory, later I have loaded two models into Tensorflow memory, how can I know how much of this 6GB is used by loaded models and how much of this memory is free? |
Hi there, we can easily export metrics that tell you host memory consumption on a per model basis but I think you're specifically looking for GPU's memory consumption/availability correct? |
Hi, yes I am looking for a way to check GPU's memory availability. Such a feature would be great. |
We also met this problem. TF-serving occupied all GPU memory when it started and there is no way to know how much memory really needed for a specific model. If we deploy too much models in a server instance, sometimes it will hang up and do not response , all connections to it will timeout. Thus, for multiple models, we need to do lot of load test to decide which can be deployed together in one instance and which need to be deployed in another. |
@ctuluhu @troycheng @unclepeddy one way to mitigate the problem is to use environment flag |
Please also see this stackoverflow question about how to monitor memory usage using memory_stat ops and run_metadata. |
Automatically closing due to lack of recent activity. Please update the issue when new information becomes available, and we will reopen the issue. Thanks! |
Is there any update on this issue? In TF 2.2 I still don't see an easy way to measure actually and peak memory usage |
why close |
I believe this can be considered as a basic solution to the problem. But the GPU memory usage cannot be fully separated according to the model loaded as part of the GPU memory usage are cost by stuff like CUDA context, which is shared among loaded models. Meanwhile, it seems there should be a limit for each model's GPU memory growth, which should be related to the parameters of the model and max batch size set to TFServing. Also, |
Any workaround for this ? |
We still don't see an easy way to monitor GPU memory usage. Is there any progress? |
It's been 2 years, 4 days and we still don't have any update on one of the most vital part. Great! |
Sorry for the late reply. Could you try the memory profiling tool to see if it helps https://www.tensorflow.org/guide/profiler#memory_profile_tool? |
@guanxinq I think people are more interested in something from the |
Describe the problem the feature is intended to solve
I have several models loaded and not sure how can I know if Tensorflow still has some memory left. I can check using
nvidia-smi
how much memory is allocated by Tensorflow but I couldn't find a way to check loaded models usage.Describe the solution
Tensorflow could provide some metrics for Prometheus about actual GPU memory usage by each loaded model.
Describe alternatives you've considered
None.
Additional context
I am not sure if this is actually a feature request or it can be done somehow at the moment.
The text was updated successfully, but these errors were encountered: