Skip to content

[MoE] DeepSeek-V3/R1 #1535

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 17 commits into from
Jun 25, 2025
Merged

[MoE] DeepSeek-V3/R1 #1535

merged 17 commits into from
Jun 25, 2025

Conversation

kylesayrs
Copy link
Collaborator

@kylesayrs kylesayrs commented Jun 10, 2025

Purpose

  • Support DeepSeek-V3 and R1
  • Update MoE examples to reflect the current state of MoE models
  • Share information about sequential onloading and deepseekv3 in readme

Fixes

Changes

  • Add readme blurb and sequential onloading and deepseek r1
  • Add example for R1
  • Add a prepare_for_calibration method which replaces the MoE module with a module which calibrates all experts with all tokens (but still gates expert outputs as the model would normally)
    • In the future we can make this method more configurable to support
      • Sending all tokens to all experts
      • Using inference-time activations vs train-time activations

Testing

  • Ran deepseek r1 example to completion

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @kylesayrs, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces initial support for the DeepSeekV3 model, specifically focusing on enabling its quantization using the GPTQ algorithm. It includes a new example script that showcases the end-to-end process of quantizing DeepSeekV3 to W4A16 and adds a custom module to correctly handle the model's Mixture-of-Experts (MoE) layers during the quantization process. This addresses issue #1482.

Highlights

  • DeepSeekV3 Quantization Example: Added a new example script (examples/quantization_w4a16/deepseekv3_example.py) demonstrating how to quantize the DeepSeekV3 model to W4A16 using the GPTQ modifier. This script includes steps for loading the model, preparing the calibration dataset, configuring the GPTQ recipe, applying quantization, saving the compressed model, and performing a sample generation test.
  • DeepSeekV3 MoE Handling: Introduced a new utility class DeepseekV3MoELinears in src/llmcompressor/utils/module.py specifically designed to handle the unique structure of the DeepSeekV3 MoE layer during quantization. The example script uses the module_bfs utility to replace the standard DeepseekV3MoE module with this new class before applying quantization.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configureGemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds an example for quantizing DeepSeekV3 models and introduces a DeepseekV3MoELinears utility class. The review focuses on improving the example's robustness and enhancing documentation in the new utility class.

@kylesayrs
Copy link
Collaborator Author

Splitting by DeepseekV3Attention and DeepseekV3MLP yields 14971 subgraphs
Memory usage is ~2331MiB
At about ~12s per expert, ~48 hours (one A100)

This means a single A100 (80gb) can, in theory, parallelize ~34x, cutting the runtime to about ~1.4 hours

@kylesayrs
Copy link
Collaborator Author

kylesayrs commented Jun 10, 2025

Another promising approach to reduce runtime, which I scoped out with @anmarques, would be to implement the following:

  1. Implement the option to dispatch a sequential target across N-many GPUs (where N is how many are available). This dispatch would occur before calibration
  2. Implement async gptq quantization (each quantization step kicks off an async thread which operates on the same device as the module & hessian)

Implementing these two features would allow a user to specify a sequential target (such as a decoder layer), and as long as one layer + hessians fits across their N gpus, all quantization operations would be fully parallelized.

This would enable maximal parallelization (excluding parallelizing calibration, which is more honorus and less beneficial than quant parallel). In theory, you could quantize deepseekv3 in 20 minutes across 4 A100s.

@ashgold
Copy link

ashgold commented Jun 13, 2025

Is this approach equally applicable to AWQ?

Base automatically changed from kylesayrs/sequential-onloading to main June 17, 2025 20:45
Signed-off-by: Kyle Sayers <[email protected]>
@kylesayrs kylesayrs force-pushed the kylesayrs/deepseek-v3 branch from 95822df to b30eade Compare June 19, 2025 14:56
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
@kylesayrs kylesayrs changed the title [Research] DeepSeekv3 [MoE] DeepSeekV3 and MoE examples cleanup Jun 19, 2025
@kylesayrs
Copy link
Collaborator Author

@ashgold Yes, a similar technique could be applied for AWQ, where the smoothing of [qkv + gate/up] or [moe experts] could be done in parallel within an onloaded decoder layer.

@kylesayrs
Copy link
Collaborator Author

kylesayrs commented Jun 19, 2025

This branch is good to go, just requires validating each of the example scripts
EDIT: Good to go

Signed-off-by: Kyle Sayers <[email protected]>
@kylesayrs kylesayrs marked this pull request as ready for review June 19, 2025 21:22
Signed-off-by: Kyle Sayers <[email protected]>
@kylesayrs kylesayrs added the ready When a PR is ready for review label Jun 20, 2025
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
@kylesayrs kylesayrs changed the title [MoE] DeepSeekV3 and MoE examples cleanup [MoE] DeepSeekV3 Jun 20, 2025
Copy link
Collaborator

@brian-dellabetta brian-dellabetta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one nit on documentation, otherwise LGTM

@kylesayrs kylesayrs changed the title [MoE] DeepSeekV3 [MoE] DeepSeek-V3/R1 Jun 23, 2025
Signed-off-by: Kyle Sayers <[email protected]>
@dsikka dsikka enabled auto-merge (squash) June 25, 2025 16:10
@dsikka dsikka merged commit af8f44b into main Jun 25, 2025
11 checks passed
@dsikka dsikka deleted the kylesayrs/deepseek-v3 branch June 25, 2025 16:54
dsikka added a commit that referenced this pull request Jun 25, 2025
## Purpose ##
* Support DeepSeek-V3 and R1
* Update MoE examples to reflect the current state of MoE models
* Share information about sequential onloading and deepseekv3 in readme

## Fixes ##
* Fixes #1482
* Fixes #1274 
* Fixes #1203

## Changes ##
* Add readme blurb and sequential onloading and deepseek r1
* Add example for R1
* Add a `prepare_for_calibration` method which replaces the MoE module
with a module which calibrates all experts with all tokens (but still
gates expert outputs as the model would normally)
  * In the future we can make this method more configurable to support
    * Sending all tokens to all experts
    * Using inference-time activations vs train-time activations

## Testing ##
* Ran deepseek r1 example to completion

---------

Signed-off-by: Kyle Sayers <[email protected]>
Co-authored-by: Dipika Sikka <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready When a PR is ready for review
Projects
None yet
5 participants