Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why TF Serving GPU using GPU Memory very much? #1929

Closed
duonghb53 opened this issue Oct 31, 2021 · 10 comments
Closed

Why TF Serving GPU using GPU Memory very much? #1929

duonghb53 opened this issue Oct 31, 2021 · 10 comments
Assignees
Labels
stale This label marks the issue/pr stale - to be closed automatically if no activity stat:awaiting response type:bug

Comments

@duonghb53
Copy link

Bug Report

TFS using GPU Memory very much!

System information

  • **OS Platform and Distribution: CentOS 8.4, Cuda 11.4
  • **TensorFlow Serving installed from: Docker
  • **TensorFlow Serving version: lastest-gpu

Describe the problem

I try install and run TFS with sample follow guide: https://github.com/tensorflow/serving/blob/master/tensorflow_serving/g3doc/docker.md

After install done, I run TFS with sample:
docker run --gpus all -p 8501:8501
--mount type=bind,
source=/tmp/tfserving/serving/tensorflow_serving/servables/tensorflow/testdata/saved_model_half_plus_two_gpu,
target=/models/half_plus_two
-e MODEL_NAME=half_plus_two -t tensorflow/serving:latest-gpu &

but GPU Memory Usage 22553MiB/24265MiB.
I don't know reason about this.
Please tell me about it?

Source code / logs

image

Thank you!

@pindinagesh pindinagesh self-assigned this Nov 2, 2021
@pindinagesh
Copy link

@duonghb53

Could you please refer similar issues described in link1 and link2, let us know if this helps. thanks

@shv07
Copy link

shv07 commented Nov 8, 2021

@pindinagesh

I have also been facing this issue while deploying models with tf serving.

The gpu usage keeps increasing with increasing number of inference requests. And the used memory is not released after the request is over leading to the OOM error very soon when there are multiple models deployed on same gpu.

Though this behavior is there with pure tensorflow-based (non TF-serving) deployment, I was expecting it would not be the case with TF Serving since it has been designed for optimized production deployment.

The link1, link2 were not of much help. Enabling gpu_growth (using TF_FORCE_GPU_ALLOW_GROWTH=true) would lead to low gpu usage initially but finally end up with same issue as the no. of requests increase. And restricting gpu usage to a certain fraction will also not be a feasible approach where same gpu is used for deploying multiple models.

Is it possible to somehow release the unused gpu memory once an inference request is complete or some other way to tackle this issue?

@shv07
Copy link

shv07 commented Nov 16, 2021

Any updates on this?

@guanxinq
Copy link
Contributor

The issue of not releasing unused memory is inherited from TF and currently there is no workaround. We will discuss it internally to see if we could fix it at least for TF serving.

@duonghb53
Copy link
Author

The issue of not releasing unused memory is inherited from TF and currently there is no workaround. We will discuss it internally to see if we could fix it at least for TF serving.

Thank you for response! I will waiting team resolve it.

@shv07
Copy link

shv07 commented Nov 22, 2021

The issue of not releasing unused memory is inherited from TF and currently there is no workaround. We will discuss it internally to see if we could fix it at least for TF serving.

So, whatever additional memory pool keeps getting occupied by the TF serving, is that shared across all the models deployed in it? i.e. if I have a TF and a non-TF model deployed with TF-serving, will that memory pool be shared between both of them?

@singhniraj08
Copy link

@duonghb53,

Underneath the hood, TensorFlow Serving uses the TensorFlow runtime to do the actual inference on your requests. And by default, TensorFlow maps nearly all of the GPU memory of all GPUs (subject to CUDA_VISIBLE_DEVICES) visible to the process.

To mitigate this, currently we have a flag "per_process_gpu_memory_fraction" in command-line flags to define a fraction that each process occupies of the GPU memory space. Kindly let us know if this flag settles your high GPU usage.
Thank you!

@github-actions
Copy link

github-actions bot commented May 5, 2023

This issue has been marked stale because it has no recent activity since 7 days. It will be closed if no further activity occurs. Thank you.

@github-actions github-actions bot added the stale This label marks the issue/pr stale - to be closed automatically if no activity label May 5, 2023
@github-actions
Copy link

This issue was closed due to lack of activity after being marked stale for past 7 days.

@google-ml-butler
Copy link

Are you satisfied with the resolution of your issue?
Yes
No

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale This label marks the issue/pr stale - to be closed automatically if no activity stat:awaiting response type:bug
Projects
None yet
Development

No branches or pull requests

7 participants