-
Notifications
You must be signed in to change notification settings - Fork 156
[MoE] DeepSeek-V3/R1 #1535
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MoE] DeepSeek-V3/R1 #1535
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @kylesayrs, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request introduces initial support for the DeepSeekV3 model, specifically focusing on enabling its quantization using the GPTQ algorithm. It includes a new example script that showcases the end-to-end process of quantizing DeepSeekV3 to W4A16 and adds a custom module to correctly handle the model's Mixture-of-Experts (MoE) layers during the quantization process. This addresses issue #1482.
Highlights
- DeepSeekV3 Quantization Example: Added a new example script (
examples/quantization_w4a16/deepseekv3_example.py
) demonstrating how to quantize the DeepSeekV3 model to W4A16 using the GPTQ modifier. This script includes steps for loading the model, preparing the calibration dataset, configuring the GPTQ recipe, applying quantization, saving the compressed model, and performing a sample generation test. - DeepSeekV3 MoE Handling: Introduced a new utility class
DeepseekV3MoELinears
insrc/llmcompressor/utils/module.py
specifically designed to handle the unique structure of the DeepSeekV3 MoE layer during quantization. The example script uses themodule_bfs
utility to replace the standardDeepseekV3MoE
module with this new class before applying quantization.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configureGemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds an example for quantizing DeepSeekV3 models and introduces a DeepseekV3MoELinears
utility class. The review focuses on improving the example's robustness and enhancing documentation in the new utility class.
Splitting by This means a single A100 (80gb) can, in theory, parallelize ~34x, cutting the runtime to about ~1.4 hours |
Another promising approach to reduce runtime, which I scoped out with @anmarques, would be to implement the following:
Implementing these two features would allow a user to specify a sequential target (such as a decoder layer), and as long as one layer + hessians fits across their N gpus, all quantization operations would be fully parallelized. This would enable maximal parallelization (excluding parallelizing calibration, which is more honorus and less beneficial than quant parallel). In theory, you could quantize deepseekv3 in 20 minutes across 4 A100s. |
Is this approach equally applicable to AWQ? |
Signed-off-by: Kyle Sayers <[email protected]>
95822df
to
b30eade
Compare
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
@ashgold Yes, a similar technique could be applied for AWQ, where the smoothing of [qkv + gate/up] or [moe experts] could be done in parallel within an onloaded decoder layer. |
|
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one nit on documentation, otherwise LGTM
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
## Purpose ## * Support DeepSeek-V3 and R1 * Update MoE examples to reflect the current state of MoE models * Share information about sequential onloading and deepseekv3 in readme ## Fixes ## * Fixes #1482 * Fixes #1274 * Fixes #1203 ## Changes ## * Add readme blurb and sequential onloading and deepseek r1 * Add example for R1 * Add a `prepare_for_calibration` method which replaces the MoE module with a module which calibrates all experts with all tokens (but still gates expert outputs as the model would normally) * In the future we can make this method more configurable to support * Sending all tokens to all experts * Using inference-time activations vs train-time activations ## Testing ## * Ran deepseek r1 example to completion --------- Signed-off-by: Kyle Sayers <[email protected]> Co-authored-by: Dipika Sikka <[email protected]>
Purpose
Fixes
Changes
prepare_for_calibration
method which replaces the MoE module with a module which calibrates all experts with all tokens (but still gates expert outputs as the model would normally)Testing