fp8 quantization for inference. #1316

jwyang-google · 2025-02-26T21:58:34Z

Description

Start with a short description of what the PR does and how this is a change from
the past.

The rest of the description includes relevant details and context, examples:

why is this change being made,
the problem being solved and any relevant context,
why this is a good solution,
some information about the specific implementation,
shortcomings of the solution and possible future improvements.

If the change fixes a bug or a Github issue, please include a link, e.g.,:
FIXES: b/123456
FIXES: #123456

Notice 1 Once all tests pass, the "pull ready" label will automatically be assigned. This label is used
for administrative purposes. Please do not add it manually.

Notice 2 For external contributions, our settings currently require an approval from a MaxText maintainer to trigger CI tests.

Tests

Please describe how you tested this change, and include any instructions and/or
commands to reproduce.

Checklist

Before submitting this PR, please make sure (put X in square brackets):

I have performed a self-review of my code.
I have necessary comments in my code, particularly in hard-to-understand areas.
I have run end-to-end tests tests and provided workload links above if applicable.
I have made or will make corresponding changes to the doc if needed.

gemini-code-assist · 2025-02-26T21:58:37Z

Important

The terms of service for this installation has not been accepted. Please ask the Organization owners to visit the Gemini Code Assist Admin Console to sign it.

vipannalla · 2025-02-26T22:06:22Z

@singh-mitali, can you take a look at these quantization changes?

singh-mitali · 2025-02-26T22:22:28Z

MaxText/layers/quantizations.py

@@ -496,6 +521,8 @@ def einsum_fn_with_rhs_qtensor(
  def einsum_fn_with_rhs_qtensor_and_dequant(self, value):
    return self.einsum_fn_with_rhs_qtensor(
        value,
+        lhs_dequant_mode=aqt_config.DequantMode.THIS_INPUT,


Why was this change required?

Adding this would make it perform better with fp8. int8 performance stays the same.

Can we make this a separate function or pass a flag.

jwyang-google requested review from gobbleturk, khatwanimohit, bvandermoon, vipannalla, RissyRan, richjames0, rni418 and gagika as code owners February 26, 2025 21:58

jwyang-google force-pushed the gpu_inference_quant branch from a0ec1b6 to 061b9da Compare February 26, 2025 22:10

fp8 quantization for inference.

a681ba3

jwyang-google force-pushed the gpu_inference_quant branch from 061b9da to a681ba3 Compare February 26, 2025 22:17

singh-mitali reviewed Feb 26, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fp8 quantization for inference. #1316

fp8 quantization for inference. #1316

jwyang-google commented Feb 26, 2025 •

edited

Loading

gemini-code-assist bot commented Feb 26, 2025

vipannalla commented Feb 26, 2025

singh-mitali Feb 26, 2025

jwyang-google Feb 26, 2025

singh-mitali Feb 26, 2025 •

edited

Loading

fp8 quantization for inference. #1316

Are you sure you want to change the base?

fp8 quantization for inference. #1316

Conversation

jwyang-google commented Feb 26, 2025 • edited Loading

Description

Tests

Checklist

gemini-code-assist bot commented Feb 26, 2025

vipannalla commented Feb 26, 2025

singh-mitali Feb 26, 2025

Choose a reason for hiding this comment

jwyang-google Feb 26, 2025

Choose a reason for hiding this comment

singh-mitali Feb 26, 2025 • edited Loading

Choose a reason for hiding this comment

jwyang-google commented Feb 26, 2025 •

edited

Loading

singh-mitali Feb 26, 2025 •

edited

Loading