Skip to content

Vulkan: Support fp32 accumulator in quantized matmul to fix GLM4-32B incoherence #13607

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

0cc4m
Copy link
Collaborator

@0cc4m 0cc4m commented May 17, 2025

Currently we only support fp16 accumulators in quantized mul mat and this leads to numerical issues with GLM4-32B, some of which have been addressed with the model precision parameter, which was getting ignored by the Vulkan backend.

However, this is draft because it solves only a part of the problem. It seems there are still tensors that fail with fp16 accumulators and have not been set to GGML_PREC_F32 in llama.cpp. I'm not sure why this only affects the Vulkan backend. Forcing all tensors to run with fp32 accumulation resolves the incoherence.

I don't have a good way of finding these all of these tensors. The problem seems to be infinity values in the result. Using the internal results checker of the Vulkan backend gives me the first of the problematic tensors: blk.1.attn_output.weight.

The trouble with GLM4 seems to be unusually large values. I'm opening this in the hopes that someone can help me figure out the rest of the problem, especially why this doesn't affect CUDA or Metal.

ERROR: Invalid value in MUL_MAT i3=0 i2=0 i1=295 i0=3321 result=-inf correct=-68706.3 avg_err=0.0281893
tensor=0x7c0f1a8ac4c0 tensor->name=node_66 tensor->type: f32 ne0=6144 nb0=4 ne1=512 nb1=24576 ne2=1 nb2=12582912 ne3=1 nb3=12582912 offset=0
src0=0x5bf32bb72160 src0->name=blk.1.attn_output.weight op=NONE type=q4_0 ne0=6144 nb0=18 ne1=6144 nb1=3456 ne2=1 nb2=21233664 ne3=1 nb3=21233664 offset=0
src1=0x7c0f1a8ac350 src1->name=kqv_out-1 op=CONT type=f32 ne0=6144 nb0=4 ne1=512 nb1=24576 ne2=1 nb2=12582912 ne3=1 nb3=12582912 offset=0
First error: result=-141.625 correct=-78.4177 i3=0 i2=0 i1=0 i0=114

Result:
             290     291     292     293     294     295     296     297     298     299
   3316: -564.50  3044.00  2774.00   37.53 -856.00  3146.00 -652.50  2886.00   25.59  292.25
   3317:  7000.00 -6492.00 -13120.00  6288.00  6472.00 -11536.00  7540.00 -9336.00  6292.00 -1033.00
   3318: -896.50 -4044.00  7040.00 -1930.00  1650.00  607.50 -1560.00 -857.00 -2019.00  6472.00
   3319:  1666.00 -7552.00  2744.00  133.62  4588.00 -4332.00  1145.00 -4972.00   35.28  7120.00
   3320:  1283.00  5496.00 -1505.00  2428.00 -968.50  2184.00  1628.00  3756.00  2494.00 -2484.00
   3321:  20160.00 -42592.00 -36096.00  16096.00  31488.00    -inf  22592.00 -47328.00  16320.00  11688.00
   3322:  6868.00  2464.00 -1387.00  7196.00  5392.00 -2434.00  7032.00  1039.00  7232.00  3626.00
   3323: -19840.00 -4280.00 -4082.00 -19856.00 -17744.00  5640.00 -19696.00 -3396.00 -19824.00 -17040.00
   3324:  5404.00 -16976.00 -6708.00  2666.00  9808.00 -15056.00  5020.00 -15328.00  2508.00  7612.00
   3325:  4688.00 -7400.00 -10216.00  3648.00  5172.00 -10080.00  5036.00 -9320.00  3646.00   22.81

Correct:
             290     291     292     293     294     295     296     297     298     299
   3316: -570.66  3041.39  2781.36   34.57 -853.04  3156.60 -657.01  2892.20   24.63  292.58
   3317:  6978.54 -6484.46 -13017.67  6266.93  6449.19 -11507.74  7519.72 -9342.18  6274.81 -1018.87
   3318: -889.91 -4043.65  7033.03 -1927.52  1673.27  614.18 -1548.47 -865.19 -2015.27  6494.03
   3319:  1666.02 -7551.79  2731.42  132.06  4592.75 -4339.87  1142.80 -4980.06   32.58  7129.88
   3320:  1278.52  5510.90 -1512.11  2424.03 -973.02  2192.46  1616.08  3763.04  2492.45 -2482.38
   3321:  20265.07 -42728.43 -36189.09  15949.43  31408.91 -68706.27  22610.84 -47450.00  16149.71  11702.63
   3322:  6874.31  2487.35 -1370.15  7227.34  5416.84 -2413.33  7039.71  1047.16  7246.70  3627.93
   3323: -20007.28 -4222.13 -4072.23 -20024.71 -17799.77  5710.86 -19857.95 -3403.51 -20001.98 -17118.29
   3324:  5380.83 -17033.93 -6779.77  2656.03  9755.27 -15092.67  4998.03 -15351.77  2491.30  7547.54
   3325:  4684.53 -7444.94 -10239.35  3656.67  5151.60 -10132.43  5022.94 -9403.43  3652.60   24.86

MUL_MAT gpu=0
 NONE gpu=0
 CONT gpu=0
  PERMUTE gpu=0
   MUL_MAT gpu=0
    VIEW gpu=0
     NONE gpu=0
    SOFT_MAX gpu=0
     MUL_MAT gpu=0
      VIEW gpu=0
       NONE gpu=0
      PERMUTE gpu=0
       ROPE gpu=0
        RESHAPE gpu=0
         MUL_MAT gpu=0
          NONE gpu=0
          MUL gpu=0
        NONE gpu=0
     NONE gpu=0

@0cc4m 0cc4m requested a review from jeffbolznv May 17, 2025 15:25
@github-actions github-actions bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels May 17, 2025
@jeffbolznv
Copy link
Collaborator

I can try to repro this on Monday. I wonder if differences in split_k/stream_k might explain the difference vs other backends? If the partial results are all in range then the resolve can accumulate them at f32.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning Vulkan Issues specific to the Vulkan backend
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants