-
Notifications
You must be signed in to change notification settings - Fork 12.2k
Added Arbitrary mixed quantization #1834
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
This is really interesting! Perhaps a way to use a similar approach but get good performance would be instead of having arbitrary bit quantizations, just allow selecting between existing quantization types (or even only the k-quants). Then you could delegate the chunk to the existing heavily optimized functions. The k-quants already use 256 element superblocks.
I'm really curious what perplexity/file size you'd get using that approach with a 7B model, for example! edit: Did a little testing with LLaMA 7B:
Note this was requantizing from q8_0 to q5_k_s and qx_0. The qx_0 perplexity was taking 186/sec per chunk so I only let it run for 4 blocks (for comparison the others were around 15sec/block). At the moment, it seems like qx_0 increases perplexity considerably more than q4_k_m while producing a larger file. I know this is very early in development so I'm not trying to be critical at all: this is just information in the hopes it will be helpful. |
Oh, that sounds like a good idea! Although it could probably be implemented on something like a "QX_1" instead, since I think there's still some value in having a fully arbitrary QX_0 that allows people to mess around and explore any quantization scheme they like.
Regarding file size, QX_0 wouldn't be optimal since it stores 1 extra bit per quantized weight (the bit that determines if that single weight is fp16 or quantized), so there's a flat 500 extra MB of data. It also doesn't quantize the tok_embeddings and output weights, which could also save a few hundreds of MB, so the direct file size from QX_0 isn't really too useful. It's better to just calculate the file size of the quantization method you are planning to do (e.g. entire rows with a set single precision), without all the general fluff from QX_0. For perplexity, I did some preliminary testing on very few tokens (like 512 or so), since it seems that the difference between the final perplexity scores of different quantization methods (like 5.9066 vs 6.1565 for F16 vs Q4_0) is pretty similar to the difference shown in the first 512 tokens (4.2335 vs 4.4576 for F16 vs Q4_0). So while these perplexity results on 512 tokens aren't too useful, they are at least somewhat of a start.
Again, the file size isn't really relevant since QX_0's generalized implementation stores a lot of extra data (like 1 extra bit per every weight), which would be optimized away when someone implements a dedicated optimization method. The numbers after "qx" represent quantization settings that were used. For example, It's interesting already to see how the quantization error affects accuracy really. It seems that striving for a hard limit on quantization error isn't that advantageous, since when looking at q4_0 (where the mean quantization error is 0.002 with some weights having an error of even 0.049), it still seems to perform pretty well compared to much more accurate quantization methods. So minimizing the general quantization error maybe isn't as useful as it seems...? Full perplexity results would be needed to be sure, but maybe some weights / blocks / rows are less sensitive than others when it comes to precision loss, and it would maybe be interesting to take advantage of that when quantizing. Well, in any way, playing around with QX allows this kind of exploration, where you can fairly easily change the quantization rules and possibly add extra heuristics for weights or per-row rules. The implementation is pretty hackable! |
Ah, though you can change the max_quantization_error parameters for qx_0 in ggml.c (it's the Also yes, qx_0 on it's own isn't really meant for inference, I mentioned some stuff in the comment above. The file size it makes is sub-optimal since it needs to support all quantization mixes. Perplexity-wise, that's pretty much what it allows people to research! It's pretty much just meant as a research tool, where you can play with different quantization rules and mixing and then implement a new "fast" dedicated quantization method that has the optimal file size. I could maybe try and implement an "optimal" quantizer guided by observations from QX_0, if that'd be more useful. Though I still wanted to share the research tool that allowed to make those observations in the first place :-) |
Oh, I should add, the max quantization error I put as a default in this PR is 0.004 for all quantized weights, which is really large (it's about 2x the mean quantization error from q4_0), so I think those large perplexity results make sense in the end! |
Ahh, I misunderstood what you said in the first post as meaning that it currently tries to optimize max allowed error to be equivalent to f16. I did understand it wasn't meant for inference, but I went into my tests expecting a perplexity result about the same as full 16bit. Is there a way to estimate the overhead and get an idea of what the file size would be if the fast dedicated quantization method was created based on the current q0_x approach? I guess we could immediately subtract 1bit/weight. I wonder if you really get much of an advantage storing a bitmap for every weight to control whether it's 16bit rather than just saying "this whole block is 16bit". Presumably that won't be needed for very many blocks so I suspect it would make things less complicated and be smaller/faster overall also.
I think that's definitely the case. That's actually at least partially what the new k-quants stuff is based on: it uses heuristics to try to use more aggressive quantization on tensors where it doesn't affect perplexity as much. I also was looking at that kind of thing as well: #1707 This code is super dumb, but you can plug it into the quantize tool right above the if (true) {
std::vector<size_t> devsa (6);
long double sum = 0;
for (auto i = 0; i < nelements; i++) {
sum += f32_data[i];
}
long double m = sum / nelements;
long double accum = 0.0;
double minval = 0, maxval = 0;
for (auto i = 0; i < nelements; i++) {
auto d = f32_data[i];
if (d < minval) {
minval = d;
} else if (d > maxval) {
maxval = d;
}
accum += (d - m) * (d - m);
}
long double stdev = sqrtl(accum / (nelements - 1));
for (auto i = 0; i < nelements; i++) {
auto d = f32_data[i];
auto devs = 0;
auto x = m;
auto dir = 1;
if (d < m) {
devs++;
x -= stdev;
for (; d < x; devs++, x -= stdev);
} else if (d > m) {
dir = -1;
devs++;
x += stdev;
for (; d > x; devs++, x += stdev);
}
devs = std::min(30, devs);
if (devs > 0) {
auto band = devs / 5;
devsa[band]++;
}
}
printf("\n");
for (auto it : devsa) {
printf(" %9ld", it);
}
printf("\n");
} It puts them in bands, 1-5, 6-10, etc up to 30. If you run that, you'll see some tensors have outliers, some don't. I wanted to make something similar to your PR (although I don't have the skills) that did an analysis like that on the block (of 256 items or whatever) and then just decided on an existing quantization based on some heuristic. I did try just using that stdev analysis to decide what quantization to use for the whole tensor based on how many items were in the bands like: if (devsa[5] > 25 || devsa[4] > 50) {
new_type = GGML_TYPE_Q8_0;
} else if (devsa[5] > 10 || devsa[4] > 25) {
new_type = GGML_TYPE_Q6_K;
} else if (devsa[5] > 2 || devsa[4] > 50) {
new_type = GGML_TYPE_Q5_K;
} else if (devsa[3] > 10 || devsa[2] > 100) {
new_type = GGML_TYPE_Q4_K;
} else if (devsa[1] > 4000 || devsa[2] > 60) {
new_type = GGML_TYPE_Q4_K;
} else if (devsa[1] > 2000) {
new_type = GGML_TYPE_Q3_K;
} else {
new_type = GGML_TYPE_Q2_K;
} It didn't work very well though: it significantly worse than something like Q4_K_M, Q3_K_M. |
Hi, thanks for this research work. The code is well written. I don't think the
Yes, this is inline with previous observations during the quantization efforts recently in |
Hi!
I added a quantization method called QX_0, which is mostly useful as a research tool for finding other good quantization methods.
The implementation allows for weights to be stored as an arbitrary stream of bits per block, which allows for virtually any quantization mixing to be tested. As of now, the implementation allows each weight to be stored either on 16 bits or a block-defined quantized bit precision, with each block being able to choose its quantization precision (4bit, 3bit, 2bit, 1bit, but the implementation allows for any other precision such as 5bit, 13bit, etc.).
The motivation behind weight precision mixing is similar to the idea behind Tim Dettmers' LLM.int8(), where a few "outlier" weights with values much larger than the ones of regular weights can badly throw off the quantization precision of the block. Managing these weights separately can greatly quantization accuracy while also having a minimal effect on file size since outliers are very rare.
To demo the implementation of this precision mixing, the current quantizer (ggml_quantize_qx_0) keeps every single weight of the model within a defined maximum quantization error from its original FP16 value, while also attempting to pick the best precision (4bit, 3bit, 2bit or 1bit) for each block (it's really interesting to see how different rows of blocks have different weight variances and thus require different precisions!). A lot of the implementation details are described in the comments of the qx_0 quantization function.
Most of the implementation is in
ggml_quantize_qx_0
(quantization) andggml_vec_dot_qx_0_q8_0
(dequantization + multiplication).I should add, since the implementation behind QX_0 is very generalized, it's not really meant for inference use since it's pretty difficult to optimize. It's rather meant to be used as an exploration / guidance tool to see what quantization rules / mixing allow for great perplexity at a minimal file size. For example, one could use QX_0 to explore what rows within the model need higher precision / lower RMSE than others, and then develop a fast quantization scheme that only mixes 4bit and 2bit rows, for example.
Above is the overall structure of a QX_0 block. This example block mixes 16bit and 2bit weights together (the metadata byte indicating that it's a 2bit quantized block), each weight corresponding to a single bit inside f16_indicator. 0 means that the weight is quantized, 1 means that its stored as full FP16. q_params are two FP16 numbers which store the offset and multiplier for dequantization, similar to how Q4_1 works. This can be easily changed within the code, since the structure of each block is pretty much arbitrary and is only known by the quantizer and dequantizer.