Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable AMP (Automatic Mixed Precision ) in Tensorflow Serving. #1583

Open
whatdhack opened this issue Mar 25, 2020 · 11 comments
Open

Enable AMP (Automatic Mixed Precision ) in Tensorflow Serving. #1583

whatdhack opened this issue Mar 25, 2020 · 11 comments

Comments

@whatdhack
Copy link

Describe the problem the feature is intended to solve

AMP accelerates inference significantly.

Describe the solution

A flag for enabling AMP

Describe alternatives you've considered

There is no alternative with Tensorflow Serving

Additional context

N/A

@whatdhack
Copy link
Author

whatdhack commented Apr 3, 2020

This I think should be very high priority ( at the least FP16) , otherwise the case for TFS becomes weak.

@shadowdragon89
Copy link
Contributor

The AMP is mainly target on training instead of serving.(https://www.tensorflow.org/guide/keras/mixed_precision)

Have you observed the significant performance difference for serving as well? If so, could you share the benchmark and related numbers?

@whatdhack
Copy link
Author

whatdhack commented Apr 12, 2020

How do I turn on AMP in serving ? I have observed 50% improvement in processiong time with fp16 over fp32 without any noticeable change in accuracy. Reduced precision is one of the corner stones of Nvidia TensorRT, etc. See this one also - https://medium.com/@whatdhack/neural-network-inference-optimization-8651b95e44ee .

@whatdhack
Copy link
Author

Is there a way to do the following in TFS ?

config = tf.ConfigProto()
config.graph_options.rewrite_options.auto_mixed_precision = 1
sess = tf.Session(config=config)

@whatdhack
Copy link
Author

I just ran some tests on a MaskRCNN Saved Model in nvcr.io/nvidia/tensorflow:20.03-tf1-py3. TF_ENABLE_AUTO_MIXED_PRECISION seems to work very well for inference - requires less memory and speeds up significantly. The following are the numbers , if you need more convincing.

TF_ENABLE_AUTO_MIXED_PRECISION =1, memory = 4.2GB , inference speed 0.25 sec .

vs

memory = 7.1 GB , inference speed 0.53 sec .

@shadowdragon89
Copy link
Contributor

Thanks for the experiments and numbers! Based on the number, we could add the option. I will also follow up with our GPU team.

@jeisinge
Copy link

jeisinge commented Nov 2, 2020

Any update here? Also, is it possible to enable JIT/XLA as well like #1515 ?

@lre
Copy link

lre commented Feb 22, 2022

Any update here?

@DerryFitz
Copy link

I'd really appreciate this feature being added too

@junA2Z
Copy link

junA2Z commented May 17, 2023

Hi, Any updates here?

1 similar comment
@BobLiu20
Copy link

Hi, Any updates here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants