Vulkan: Support fp32 accumulator in quantized matmul to fix GLM4-32B incoherence #13607

0cc4m · 2025-05-17T15:25:11Z

Currently we only support fp16 accumulators in quantized mul mat and this leads to numerical issues with GLM4-32B, some of which have been addressed with the model precision parameter, which was getting ignored by the Vulkan backend.

However, this is draft because it solves only a part of the problem. It seems there are still tensors that fail with fp16 accumulators and have not been set to GGML_PREC_F32 in llama.cpp. I'm not sure why this only affects the Vulkan backend. Forcing all tensors to run with fp32 accumulation resolves the incoherence.

I don't have a good way of finding these all of these tensors. The problem seems to be infinity values in the result. Using the internal results checker of the Vulkan backend gives me the first of the problematic tensors: blk.1.attn_output.weight.

The trouble with GLM4 seems to be unusually large values. I'm opening this in the hopes that someone can help me figure out the rest of the problem, especially why this doesn't affect CUDA or Metal.

ERROR: Invalid value in MUL_MAT i3=0 i2=0 i1=295 i0=3321 result=-inf correct=-68706.3 avg_err=0.0281893
tensor=0x7c0f1a8ac4c0 tensor->name=node_66 tensor->type: f32 ne0=6144 nb0=4 ne1=512 nb1=24576 ne2=1 nb2=12582912 ne3=1 nb3=12582912 offset=0
src0=0x5bf32bb72160 src0->name=blk.1.attn_output.weight op=NONE type=q4_0 ne0=6144 nb0=18 ne1=6144 nb1=3456 ne2=1 nb2=21233664 ne3=1 nb3=21233664 offset=0
src1=0x7c0f1a8ac350 src1->name=kqv_out-1 op=CONT type=f32 ne0=6144 nb0=4 ne1=512 nb1=24576 ne2=1 nb2=12582912 ne3=1 nb3=12582912 offset=0
First error: result=-141.625 correct=-78.4177 i3=0 i2=0 i1=0 i0=114

Result:
             290     291     292     293     294     295     296     297     298     299
   3316: -564.50  3044.00  2774.00   37.53 -856.00  3146.00 -652.50  2886.00   25.59  292.25
   3317:  7000.00 -6492.00 -13120.00  6288.00  6472.00 -11536.00  7540.00 -9336.00  6292.00 -1033.00
   3318: -896.50 -4044.00  7040.00 -1930.00  1650.00  607.50 -1560.00 -857.00 -2019.00  6472.00
   3319:  1666.00 -7552.00  2744.00  133.62  4588.00 -4332.00  1145.00 -4972.00   35.28  7120.00
   3320:  1283.00  5496.00 -1505.00  2428.00 -968.50  2184.00  1628.00  3756.00  2494.00 -2484.00
   3321:  20160.00 -42592.00 -36096.00  16096.00  31488.00    -inf  22592.00 -47328.00  16320.00  11688.00
   3322:  6868.00  2464.00 -1387.00  7196.00  5392.00 -2434.00  7032.00  1039.00  7232.00  3626.00
   3323: -19840.00 -4280.00 -4082.00 -19856.00 -17744.00  5640.00 -19696.00 -3396.00 -19824.00 -17040.00
   3324:  5404.00 -16976.00 -6708.00  2666.00  9808.00 -15056.00  5020.00 -15328.00  2508.00  7612.00
   3325:  4688.00 -7400.00 -10216.00  3648.00  5172.00 -10080.00  5036.00 -9320.00  3646.00   22.81

Correct:
             290     291     292     293     294     295     296     297     298     299
   3316: -570.66  3041.39  2781.36   34.57 -853.04  3156.60 -657.01  2892.20   24.63  292.58
   3317:  6978.54 -6484.46 -13017.67  6266.93  6449.19 -11507.74  7519.72 -9342.18  6274.81 -1018.87
   3318: -889.91 -4043.65  7033.03 -1927.52  1673.27  614.18 -1548.47 -865.19 -2015.27  6494.03
   3319:  1666.02 -7551.79  2731.42  132.06  4592.75 -4339.87  1142.80 -4980.06   32.58  7129.88
   3320:  1278.52  5510.90 -1512.11  2424.03 -973.02  2192.46  1616.08  3763.04  2492.45 -2482.38
   3321:  20265.07 -42728.43 -36189.09  15949.43  31408.91 -68706.27  22610.84 -47450.00  16149.71  11702.63
   3322:  6874.31  2487.35 -1370.15  7227.34  5416.84 -2413.33  7039.71  1047.16  7246.70  3627.93
   3323: -20007.28 -4222.13 -4072.23 -20024.71 -17799.77  5710.86 -19857.95 -3403.51 -20001.98 -17118.29
   3324:  5380.83 -17033.93 -6779.77  2656.03  9755.27 -15092.67  4998.03 -15351.77  2491.30  7547.54
   3325:  4684.53 -7444.94 -10239.35  3656.67  5151.60 -10132.43  5022.94 -9403.43  3652.60   24.86

MUL_MAT gpu=0
 NONE gpu=0
 CONT gpu=0
  PERMUTE gpu=0
   MUL_MAT gpu=0
    VIEW gpu=0
     NONE gpu=0
    SOFT_MAX gpu=0
     MUL_MAT gpu=0
      VIEW gpu=0
       NONE gpu=0
      PERMUTE gpu=0
       ROPE gpu=0
        RESHAPE gpu=0
         MUL_MAT gpu=0
          NONE gpu=0
          MUL gpu=0
        NONE gpu=0
     NONE gpu=0

…32B incoherence

jeffbolznv · 2025-05-17T16:06:11Z

I can try to repro this on Monday. I wonder if differences in split_k/stream_k might explain the difference vs other backends? If the partial results are all in range then the resolve can accumulate them at f32.

Vulkan: Add f32 accumulator support to quantized mul mat to fix GLM4 …

35d675d

…32B incoherence

0cc4m requested a review from jeffbolznv May 17, 2025 15:25

github-actions bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels May 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vulkan: Support fp32 accumulator in quantized matmul to fix GLM4-32B incoherence #13607

Vulkan: Support fp32 accumulator in quantized matmul to fix GLM4-32B incoherence #13607

0cc4m commented May 17, 2025

jeffbolznv commented May 17, 2025

Vulkan: Support fp32 accumulator in quantized matmul to fix GLM4-32B incoherence #13607

Are you sure you want to change the base?

Vulkan: Support fp32 accumulator in quantized matmul to fix GLM4-32B incoherence #13607

Conversation

0cc4m commented May 17, 2025

jeffbolznv commented May 17, 2025