[MoE] DeepSeek-V3/R1 #1535

kylesayrs · 2025-06-10T15:12:23Z

Purpose

Support DeepSeek-V3 and R1
Update MoE examples to reflect the current state of MoE models
Share information about sequential onloading and deepseekv3 in readme

Fixes

Fixes [DeepSeek-V3/R1] Anyone had success quantizing DeepSeek-V3 using llm-compressor? #1482
Fixes Has anyone successfully quantinize Deepseek-R1 to w8a8? #1274
Fixes [Question] Has anyone successfully quantinize Deepseek-V3 to int4-w4a16? #1203

Changes

Add readme blurb and sequential onloading and deepseek r1
Add example for R1
Add a prepare_for_calibration method which replaces the MoE module with a module which calibrates all experts with all tokens (but still gates expert outputs as the model would normally)
- In the future we can make this method more configurable to support
  - Sending all tokens to all experts
  - Using inference-time activations vs train-time activations

Testing

Ran deepseek r1 example to completion

gemini-code-assist

Summary of Changes

Hello @kylesayrs, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces initial support for the DeepSeekV3 model, specifically focusing on enabling its quantization using the GPTQ algorithm. It includes a new example script that showcases the end-to-end process of quantizing DeepSeekV3 to W4A16 and adds a custom module to correctly handle the model's Mixture-of-Experts (MoE) layers during the quantization process. This addresses issue #1482.

Highlights

DeepSeekV3 Quantization Example: Added a new example script (examples/quantization_w4a16/deepseekv3_example.py) demonstrating how to quantize the DeepSeekV3 model to W4A16 using the GPTQ modifier. This script includes steps for loading the model, preparing the calibration dataset, configuring the GPTQ recipe, applying quantization, saving the compressed model, and performing a sample generation test.
DeepSeekV3 MoE Handling: Introduced a new utility class DeepseekV3MoELinears in src/llmcompressor/utils/module.py specifically designed to handle the unique structure of the DeepSeekV3 MoE layer during quantization. The example script uses the module_bfs utility to replace the standard DeepseekV3MoE module with this new class before applying quantization.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configureGemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request adds an example for quantizing DeepSeekV3 models and introduces a DeepseekV3MoELinears utility class. The review focuses on improving the example's robustness and enhancing documentation in the new utility class.

kylesayrs · 2025-06-10T16:54:15Z

Splitting by DeepseekV3Attention and DeepseekV3MLP yields 14971 subgraphs
Memory usage is ~2331MiB
At about ~12s per expert, ~48 hours (one A100)

This means a single A100 (80gb) can, in theory, parallelize ~34x, cutting the runtime to about ~1.4 hours

kylesayrs · 2025-06-10T19:38:17Z

Another promising approach to reduce runtime, which I scoped out with @anmarques, would be to implement the following:

Implement the option to dispatch a sequential target across N-many GPUs (where N is how many are available). This dispatch would occur before calibration
Implement async gptq quantization (each quantization step kicks off an async thread which operates on the same device as the module & hessian)

Implementing these two features would allow a user to specify a sequential target (such as a decoder layer), and as long as one layer + hessians fits across their N gpus, all quantization operations would be fully parallelized.

This would enable maximal parallelization (excluding parallelizing calibration, which is more honorus and less beneficial than quant parallel). In theory, you could quantize deepseekv3 in 20 minutes across 4 A100s.

ashgold · 2025-06-13T07:54:30Z

Is this approach equally applicable to AWQ?

Signed-off-by: Kyle Sayers <[email protected]>

kylesayrs · 2025-06-19T15:33:41Z

@ashgold Yes, a similar technique could be applied for AWQ, where the smoothing of [qkv + gate/up] or [moe experts] could be done in parallel within an onloaded decoder layer.

kylesayrs · 2025-06-19T15:34:41Z

~~This branch is good to go, just requires validating each of the example scripts~~
EDIT: Good to go

Signed-off-by: Kyle Sayers <[email protected]>

brian-dellabetta

one nit on documentation, otherwise LGTM

src/llmcompressor/modeling/deepseek_v3.py

Signed-off-by: Kyle Sayers <[email protected]>

README.md

Signed-off-by: Kyle Sayers <[email protected]>

src/llmcompressor/utils/module.py

examples/quantizing_moe/deepseek_r1_example.py

src/llmcompressor/modeling/prepare.py

Signed-off-by: Kyle Sayers <[email protected]>

## Purpose ## * Support DeepSeek-V3 and R1 * Update MoE examples to reflect the current state of MoE models * Share information about sequential onloading and deepseekv3 in readme ## Fixes ## * Fixes #1482 * Fixes #1274 * Fixes #1203 ## Changes ## * Add readme blurb and sequential onloading and deepseek r1 * Add example for R1 * Add a `prepare_for_calibration` method which replaces the MoE module with a module which calibrates all experts with all tokens (but still gates expert outputs as the model would normally) * In the future we can make this method more configurable to support * Sending all tokens to all experts * Using inference-time activations vs train-time activations ## Testing ## * Ran deepseek r1 example to completion --------- Signed-off-by: Kyle Sayers <[email protected]> Co-authored-by: Dipika Sikka <[email protected]>

gemini-code-assist bot reviewed Jun 10, 2025

View reviewed changes

kylesayrs mentioned this pull request Jun 12, 2025

How can I quantize DeepSeek-V3 with GPTQ quantization method using 8xH100 GPUs? neuralmagic/compressed-tensors#326

Open

Base automatically changed from kylesayrs/sequential-onloading to main June 17, 2025 20:45

deepseekv3

b30eade

Signed-off-by: Kyle Sayers <[email protected]>

kylesayrs force-pushed the kylesayrs/deepseek-v3 branch from 95822df to b30eade Compare June 19, 2025 14:56

kylesayrs added 3 commits June 19, 2025 10:56

remove dreg

a957f2f

Signed-off-by: Kyle Sayers <[email protected]>

reformat example

2fd2a25

Signed-off-by: Kyle Sayers <[email protected]>

wip: clean up moe examples

b8b217c

Signed-off-by: Kyle Sayers <[email protected]>

kylesayrs changed the title ~~[Research] DeepSeekv3~~ [MoE] DeepSeekV3 and MoE examples cleanup Jun 19, 2025

remove deepseek2.5 for now

43bc91d

Signed-off-by: Kyle Sayers <[email protected]>

kylesayrs marked this pull request as ready for review June 19, 2025 21:22

update readme

7d8ed36

Signed-off-by: Kyle Sayers <[email protected]>

kylesayrs added the ready When a PR is ready for review label Jun 20, 2025

kylesayrs added 3 commits June 20, 2025 13:37

rename files, update examples tests

e9e30c3

Signed-off-by: Kyle Sayers <[email protected]>

revert examples changes

2db2789

Signed-off-by: Kyle Sayers <[email protected]>

remove extra examples

0dc2381

Signed-off-by: Kyle Sayers <[email protected]>

kylesayrs changed the title ~~[MoE] DeepSeekV3 and MoE examples cleanup~~ [MoE] DeepSeekV3 Jun 20, 2025

kylesayrs mentioned this pull request Jun 20, 2025

[MoE] Cleanup MoE examples #1576

Draft

brian-dellabetta previously approved these changes Jun 20, 2025

View reviewed changes

src/llmcompressor/modeling/deepseek_v3.py Show resolved Hide resolved

kylesayrs added 2 commits June 21, 2025 14:20

Merge remote-tracking branch 'origin' into kylesayrs/deepseek-v3

941deac

skip generation

ad506fa

Signed-off-by: Kyle Sayers <[email protected]>

kylesayrs dismissed brian-dellabetta’s stale review via ad506fa June 21, 2025 18:22

kylesayrs commented Jun 23, 2025

View reviewed changes

README.md Outdated Show resolved Hide resolved

update readme, swap to r1, add docstrings

2b84051

Signed-off-by: Kyle Sayers <[email protected]>

kylesayrs changed the title ~~[MoE] DeepSeekV3~~ [MoE] DeepSeek-V3/R1 Jun 23, 2025

remove qconfig, fix typo

d8e8213

Signed-off-by: Kyle Sayers <[email protected]>

brian-dellabetta previously approved these changes Jun 24, 2025

View reviewed changes

src/llmcompressor/utils/module.py Outdated Show resolved Hide resolved

shanjiaz reviewed Jun 24, 2025

View reviewed changes

src/llmcompressor/utils/module.py Outdated Show resolved Hide resolved

dsikka requested changes Jun 24, 2025

View reviewed changes

examples/quantizing_moe/deepseek_r1_example.py Show resolved Hide resolved

src/llmcompressor/modeling/prepare.py Show resolved Hide resolved

remove dfs, replace with replace_module

6a8ed57

Signed-off-by: Kyle Sayers <[email protected]>

kylesayrs dismissed brian-dellabetta’s stale review via 6a8ed57 June 24, 2025 18:42

kylesayrs requested a review from dsikka June 25, 2025 01:14

Merge branch 'main' into kylesayrs/deepseek-v3

f7b4c1b

dsikka approved these changes Jun 25, 2025

View reviewed changes

Merge remote-tracking branch 'origin' into kylesayrs/deepseek-v3

919285a

brian-dellabetta approved these changes Jun 25, 2025

View reviewed changes

Merge branch 'main' into kylesayrs/deepseek-v3

1b2e2f3

dsikka enabled auto-merge (squash) June 25, 2025 16:10

dsikka merged commit af8f44b into main Jun 25, 2025
11 checks passed

dsikka deleted the kylesayrs/deepseek-v3 branch June 25, 2025 16:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[MoE] DeepSeek-V3/R1 #1535

[MoE] DeepSeek-V3/R1 #1535

Uh oh!

kylesayrs commented Jun 10, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

kylesayrs commented Jun 10, 2025

Uh oh!

kylesayrs commented Jun 10, 2025 •

edited

Loading

Uh oh!

ashgold commented Jun 13, 2025

Uh oh!

kylesayrs commented Jun 19, 2025

Uh oh!

kylesayrs commented Jun 19, 2025 •

edited

Loading

Uh oh!

brian-dellabetta left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[MoE] DeepSeek-V3/R1 #1535

[MoE] DeepSeek-V3/R1 #1535

Uh oh!

Conversation

kylesayrs commented Jun 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Fixes

Changes

Testing

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

kylesayrs commented Jun 10, 2025

Uh oh!

kylesayrs commented Jun 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ashgold commented Jun 13, 2025

Uh oh!

kylesayrs commented Jun 19, 2025

Uh oh!

kylesayrs commented Jun 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

brian-dellabetta left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kylesayrs commented Jun 10, 2025 •

edited

Loading

kylesayrs commented Jun 10, 2025 •

edited

Loading

kylesayrs commented Jun 19, 2025 •

edited

Loading