Vulkan: Support fp32 accumulator in quantized matmul to fix GLM4-32B incoherence #13607
+118
−108
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Currently we only support fp16 accumulators in quantized mul mat and this leads to numerical issues with GLM4-32B, some of which have been addressed with the model precision parameter, which was getting ignored by the Vulkan backend.
However, this is draft because it solves only a part of the problem. It seems there are still tensors that fail with fp16 accumulators and have not been set to GGML_PREC_F32 in llama.cpp. I'm not sure why this only affects the Vulkan backend. Forcing all tensors to run with fp32 accumulation resolves the incoherence.
I don't have a good way of finding these all of these tensors. The problem seems to be
infinity
values in the result. Using the internal results checker of the Vulkan backend gives me the first of the problematic tensors:blk.1.attn_output.weight
.The trouble with GLM4 seems to be unusually large values. I'm opening this in the hopes that someone can help me figure out the rest of the problem, especially why this doesn't affect CUDA or Metal.