Skip to content

Added Arbitrary mixed quantization #1834

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

Milkdrop
Copy link

Hi!

I added a quantization method called QX_0, which is mostly useful as a research tool for finding other good quantization methods.

The implementation allows for weights to be stored as an arbitrary stream of bits per block, which allows for virtually any quantization mixing to be tested. As of now, the implementation allows each weight to be stored either on 16 bits or a block-defined quantized bit precision, with each block being able to choose its quantization precision (4bit, 3bit, 2bit, 1bit, but the implementation allows for any other precision such as 5bit, 13bit, etc.).

The motivation behind weight precision mixing is similar to the idea behind Tim Dettmers' LLM.int8(), where a few "outlier" weights with values much larger than the ones of regular weights can badly throw off the quantization precision of the block. Managing these weights separately can greatly quantization accuracy while also having a minimal effect on file size since outliers are very rare.

To demo the implementation of this precision mixing, the current quantizer (ggml_quantize_qx_0) keeps every single weight of the model within a defined maximum quantization error from its original FP16 value, while also attempting to pick the best precision (4bit, 3bit, 2bit or 1bit) for each block (it's really interesting to see how different rows of blocks have different weight variances and thus require different precisions!). A lot of the implementation details are described in the comments of the qx_0 quantization function.

Most of the implementation is in ggml_quantize_qx_0 (quantization) and ggml_vec_dot_qx_0_q8_0 (dequantization + multiplication).

I should add, since the implementation behind QX_0 is very generalized, it's not really meant for inference use since it's pretty difficult to optimize. It's rather meant to be used as an exploration / guidance tool to see what quantization rules / mixing allow for great perplexity at a minimal file size. For example, one could use QX_0 to explore what rows within the model need higher precision / lower RMSE than others, and then develop a fast quantization scheme that only mixes 4bit and 2bit rows, for example.

block

Above is the overall structure of a QX_0 block. This example block mixes 16bit and 2bit weights together (the metadata byte indicating that it's a 2bit quantized block), each weight corresponding to a single bit inside f16_indicator. 0 means that the weight is quantized, 1 means that its stored as full FP16. q_params are two FP16 numbers which store the offset and multiplier for dequantization, similar to how Q4_1 works. This can be easily changed within the code, since the structure of each block is pretty much arbitrary and is only known by the quantizer and dequantizer.

@KerfuffleV2
Copy link
Collaborator

KerfuffleV2 commented Jun 13, 2023

This is really interesting! Perhaps a way to use a similar approach but get good performance would be instead of having arbitrary bit quantizations, just allow selecting between existing quantization types (or even only the k-quants). Then you could delegate the chunk to the existing heavily optimized functions.

The k-quants already use 256 element superblocks.

the current quantizer (ggml_quantize_qx_0) keeps every single weight of the model within a defined maximum quantization error from its original FP16 value

I'm really curious what perplexity/file size you'd get using that approach with a 7B model, for example!


edit: Did a little testing with LLaMA 7B:

type size p1 p2 p3 p4 p5
q8_0 6.7G 4.2166 4.6915 5.5669 6.1696 6.2904
q8_0 -> q5_ks 4.4G 4.2365 4.7075 5.5911 6.2133 6.3313
q8_0 -> q4_k_m 3.9G 4.3017 4.7306 5.6560 6.2542 6.3633
q8_0 -> qx_0 4.3G 4.9184 5.9700 6.6812 7.4172

Note this was requantizing from q8_0 to q5_k_s and qx_0. The qx_0 perplexity was taking 186/sec per chunk so I only let it run for 4 blocks (for comparison the others were around 15sec/block).

At the moment, it seems like qx_0 increases perplexity considerably more than q4_k_m while producing a larger file. I know this is very early in development so I'm not trying to be critical at all: this is just information in the hopes it will be helpful.

@KerfuffleV2 KerfuffleV2 added research 🔬 Less than 4 bits Efforts related to viable quantized models using <4 bits labels Jun 13, 2023
@Milkdrop
Copy link
Author

just allow selecting between existing quantization types

Oh, that sounds like a good idea! Although it could probably be implemented on something like a "QX_1" instead, since I think there's still some value in having a fully arbitrary QX_0 that allows people to mess around and explore any quantization scheme they like.

I'm really curious what perplexity/file size you'd get using that approach with a 7B model, for example!

Regarding file size, QX_0 wouldn't be optimal since it stores 1 extra bit per quantized weight (the bit that determines if that single weight is fp16 or quantized), so there's a flat 500 extra MB of data. It also doesn't quantize the tok_embeddings and output weights, which could also save a few hundreds of MB, so the direct file size from QX_0 isn't really too useful. It's better to just calculate the file size of the quantization method you are planning to do (e.g. entire rows with a set single precision), without all the general fluff from QX_0.

For perplexity, I did some preliminary testing on very few tokens (like 512 or so), since it seems that the difference between the final perplexity scores of different quantization methods (like 5.9066 vs 6.1565 for F16 vs Q4_0) is pretty similar to the difference shown in the first 512 tokens (4.2335 vs 4.4576 for F16 vs Q4_0). So while these perplexity results on 512 tokens aren't too useful, they are at least somewhat of a start.

Model Perplexity Model file size
F16 4.2335 13.0 GB
q4_0 4.4576 3.5 GB
qx_42_05_05 6.5739 4.4 GB
qx_42_05_03 4.6538 4.5 GB
qx_42_03_03 4.6274 4.8 GB
qx_432_03_03 4.5833 4.8 GB
qx_432_01_02 4.3392 5.8 GB
qx_432_01_01 4.2545 8.6 GB

Again, the file size isn't really relevant since QX_0's generalized implementation stores a lot of extra data (like 1 extra bit per every weight), which would be optimized away when someone implements a dedicated optimization method.

The numbers after "qx" represent quantization settings that were used. For example, qx_432_01_02 means that each block is allowed to choose between 4, 3 or 2 bits (432), the max quantization error for 4bits is 0.001 (01) and 0.002 (02) for any lower bit precisions (such as 3bit and 2bit).

It's interesting already to see how the quantization error affects accuracy really. It seems that striving for a hard limit on quantization error isn't that advantageous, since when looking at q4_0 (where the mean quantization error is 0.002 with some weights having an error of even 0.049), it still seems to perform pretty well compared to much more accurate quantization methods.

So minimizing the general quantization error maybe isn't as useful as it seems...? Full perplexity results would be needed to be sure, but maybe some weights / blocks / rows are less sensitive than others when it comes to precision loss, and it would maybe be interesting to take advantage of that when quantizing.

Well, in any way, playing around with QX allows this kind of exploration, where you can fairly easily change the quantization rules and possibly add extra heuristics for weights or per-row rules. The implementation is pretty hackable!

@Milkdrop
Copy link
Author

it seems like qx_0 increases perplexity considerably more than q4_k_m while producing a larger file

Ah, though you can change the max_quantization_error parameters for qx_0 in ggml.c (it's the max_quantization_errors array in ggml_quantize_qx_0, the comments should hopefully explain it a bit more). To be honest it'd be great if it were a command-line parameter, but for now you need to edit the code. Thankfully that array is only used during quantization (alongside other parameters such as QX_0_STARTING_QBITS and QX_0_START_OF_ATTEMPTED_QBITS) and during inference none of these quantization parameters are needed.

Also yes, qx_0 on it's own isn't really meant for inference, I mentioned some stuff in the comment above. The file size it makes is sub-optimal since it needs to support all quantization mixes. Perplexity-wise, that's pretty much what it allows people to research!

It's pretty much just meant as a research tool, where you can play with different quantization rules and mixing and then implement a new "fast" dedicated quantization method that has the optimal file size.

I could maybe try and implement an "optimal" quantizer guided by observations from QX_0, if that'd be more useful. Though I still wanted to share the research tool that allowed to make those observations in the first place :-)

@Milkdrop
Copy link
Author

Milkdrop commented Jun 13, 2023

At the moment, it seems like qx_0 increases perplexity considerably

Oh, I should add, the max quantization error I put as a default in this PR is 0.004 for all quantized weights, which is really large (it's about 2x the mean quantization error from q4_0), so I think those large perplexity results make sense in the end!

@KerfuffleV2
Copy link
Collaborator

Ahh, I misunderstood what you said in the first post as meaning that it currently tries to optimize max allowed error to be equivalent to f16. I did understand it wasn't meant for inference, but I went into my tests expecting a perplexity result about the same as full 16bit.

Is there a way to estimate the overhead and get an idea of what the file size would be if the fast dedicated quantization method was created based on the current q0_x approach? I guess we could immediately subtract 1bit/weight.

I wonder if you really get much of an advantage storing a bitmap for every weight to control whether it's 16bit rather than just saying "this whole block is 16bit". Presumably that won't be needed for very many blocks so I suspect it would make things less complicated and be smaller/faster overall also.

but maybe some weights / blocks / rows are less sensitive than others when it comes to precision loss

I think that's definitely the case. That's actually at least partially what the new k-quants stuff is based on: it uses heuristics to try to use more aggressive quantization on tensors where it doesn't affect perplexity as much.

I also was looking at that kind of thing as well: #1707

This code is super dumb, but you can plug it into the quantize tool right above the printf("quantizing...") line to get a dump of standard deviations:

            if (true) {
                std::vector<size_t> devsa (6);
                long double sum = 0;
                for (auto i = 0; i < nelements; i++) {
                    sum += f32_data[i];
                }
                long double m = sum / nelements;

                long double accum = 0.0;
                double minval = 0, maxval = 0;
                for (auto i = 0; i < nelements; i++) {
                    auto d = f32_data[i];
                    if (d < minval) {
                         minval = d;
                     } else if (d > maxval) {
                        maxval = d;
                     }
                    accum += (d - m) * (d - m);
                }

                long double stdev = sqrtl(accum / (nelements - 1));
                for (auto i = 0; i < nelements; i++) {
                    auto d = f32_data[i];
                    auto devs = 0;
                    auto x = m;
                    auto dir = 1;
                    if (d < m) {
                        devs++;
                        x -= stdev;
                        for (; d < x; devs++, x -= stdev);
                    } else if (d > m) {
                        dir = -1;
                        devs++;
                        x += stdev;
                        for (; d > x; devs++, x += stdev);
                    }
                    devs = std::min(30, devs);
                    if (devs > 0) {
                        auto band = devs / 5;
                        devsa[band]++;
                    }
                }
                printf("\n");
                for (auto it : devsa) {
                    printf(" %9ld", it);
                }
                printf("\n");
            }

It puts them in bands, 1-5, 6-10, etc up to 30. If you run that, you'll see some tensors have outliers, some don't.

I wanted to make something similar to your PR (although I don't have the skills) that did an analysis like that on the block (of 256 items or whatever) and then just decided on an existing quantization based on some heuristic.

I did try just using that stdev analysis to decide what quantization to use for the whole tensor based on how many items were in the bands like:

                if (devsa[5] > 25 || devsa[4] > 50) {
                    new_type = GGML_TYPE_Q8_0;
                } else if (devsa[5] > 10 || devsa[4] > 25) {
                    new_type = GGML_TYPE_Q6_K;
                } else if (devsa[5] > 2 || devsa[4] > 50) {
                    new_type = GGML_TYPE_Q5_K;
                } else if (devsa[3] > 10 || devsa[2] > 100) {
                    new_type = GGML_TYPE_Q4_K;
                } else if (devsa[1] > 4000 || devsa[2] > 60) {
                    new_type = GGML_TYPE_Q4_K;
                } else if (devsa[1] > 2000) {
                    new_type = GGML_TYPE_Q3_K;
                } else {
                    new_type = GGML_TYPE_Q2_K;
                }

It didn't work very well though: it significantly worse than something like Q4_K_M, Q3_K_M.

@ggerganov
Copy link
Member

ggerganov commented Jun 17, 2023

Hi, thanks for this research work. The code is well written.

I don't think the ggml.c changes can be merged because it's a lot of extra code which I don't think will see a lot of application, but it will incur certain amount of technical dept. I think however that the proposed quantization method can be implemented as an example in llama.cpp. To do that, we can easily extend the ggml interface to provide ways for editing the quantize_fns_t for certain values of enum ggml_type and have enum ggml_type extended with a few reserved "custom" quantization types that the user has the option to implement in from the user-code. I think it is a relatively trivial change and it will be very beneficial for the project in general since it will allow to easily support customized quantization methods in the future. Let me know if you are up to doing it this way, and if the description is not clear, I can provide more details how to implement it.

So minimizing the general quantization error maybe isn't as useful as it seems...?

Yes, this is inline with previous observations during the quantization efforts recently in llama.cpp. I guess it's somewhat un-intuitive.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Less than 4 bits Efforts related to viable quantized models using <4 bits research 🔬
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants