-
Notifications
You must be signed in to change notification settings - Fork 12.2k
ggml: adds CONV_2D op and direct GEMM Vulkan implementation #14316
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
…mory efficient instant GEMM based Vulkan implementation * ggml: adds op CONV_2D, ggml_conv_2d_direct, * ggml-vulkan: adds f32 scalar shader to compute 2D convolution directly with gemm (no need for im2col), * test-backend-ops: adds test_case_ref to check the validity/performance of ops against reference implementations having different graphs,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very cool!
@@ -1858,6 +1858,10 @@ static void ggml_compute_forward(struct ggml_compute_params * params, struct ggm | |||
{ | |||
ggml_compute_forward_im2col_back_f32(params, tensor); | |||
} break; | |||
case GGML_OP_CONV_2D: | |||
{ | |||
GGML_ABORT("Op not supported on CPU yet."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a CPU implementation is generally required.
const uint32_t Bsh_len = BS_CRS*Bsh_stride; | ||
|
||
shared float Ash[Ash_len]; // K x CRS | ||
shared float Bsh[Bsh_len]; // CRS x NPQ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the total shared memory needed? Do we need a runtime check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
128x16 tile size works well thus 16.125K shmem is the minimum so I don't think it's absolutely needed to check. But it would be good to test to be sure that we have the required amount in all kernels.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Vulkan requires a minimum of 16KB, so we probably should have a check.
|
||
void main(){ | ||
if(gl_WorkGroupID.x == gl_NumWorkGroups.x-1 || gl_WorkGroupID.y == gl_NumWorkGroups.y-1){ | ||
mainLoopBoundaryCheck(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to just pass in a boolean true/false here and && it with the condition? I think it'll be more readable than all the macros
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested adding the bool expression in the loop but it slows down the execution considerably: 3.7 TFLOPS compared to 4.16 TFLOPS with macros.
https://gist.github.com/etasnadi/a8a3a67fc904c86f35de74c3f8ef819c#file-conv2d_mm-comp
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should write it with the branch in the main function and each side using true/false. The compiler will inline both copies and fold the constants.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, this finally helped to produce better code. I don't want to trigger the pipelines so I pushed the updates to a different branch (etasnadi@fe85b44), but it seems that the FLOPS are much higher:
Vulkan FLOPS:
device direct indirect
------ ------ --------
RTX 2060 5.33 3.40
GTX 1060 (Notebook) 2.2 1.73
Command: GGML_VK_CONV_2D_CONFIG=256,128,16,128,16 GGML_VK_DISABLE_COOPMAT=1 ./bin/test-backend-ops -o CONV_2D_INDIRECT_IMPL -b Vulkan0 perf
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the meantime I implemented the CUDA version and here's the performance:
CUDA FLOPS:
device direct indirect
------ ------ --------
GTX 1060 (Notebook) 2.2 2.5
RTX 2060 5.02 6.14
This suggests that my proposed alg is slower than the indirect alg if the latter is well optimized and the indirect Vulkan kernel can be further optimized or their parameters are not general enough to perform well on my devices.
Edit: CUDA branch added: etasnadi@c71890e
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Turns out that the CUDA matmul uses cuBLAS if possible:
llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu
Line 1341 in 40bfa04
cublasSgemm(ctx.cublas_handle(id), CUBLAS_OP_T, CUBLAS_OP_N, |
On my RX 470 the indirect op is faster as well. IMO it's worth testing with more input and kernel sizes like what we have for im2col, and the real test to get this set up with stablediffusion.cpp (though that thing hasn't been updated for months) to see how it does with an actual model.
|
Sure, older models might introduce other bottlenecks that causes the shader to slow down but the memory saving still a considerable advantage. I'm thinking about reimplementing the shader in CUDA so I can profile it with Nsight to see what causes the issue (hopefully it still supports ancient cards). |
void mainLoop ## FUNC_NAME_SUFFIX(){\ | ||
initReg();\ | ||
/* Advance block in CRS dim */\ | ||
for(uint32_t B_idx_CRS = 0; B_idx_CRS < NB_CRS; B_idx_CRS++){\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might need to add [[unroll]] on loops with constant trip count. Sometimes the compiler will do this automatically, but when there are nested loops sometimes it won't.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I annotated all the loops except this (because the main loop is sequential) and I get the same flops unfortunately.
} | ||
|
||
void outProdReg(){ | ||
for(uint32_t CRS_lidx = 0; CRS_lidx < BS_CRS; CRS_lidx++){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you use coopmat here to do the outer products in parallel?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you ask if the alg supports coopmats -- then yes, we can add it later, but now I focus on achieving good enough flops relative to what can be achieved with the scalar matmul kernel.
This patch adds support for direct computation of 2D convolution on Vulkan backend: it is in a form of a custom GEMM that loads the relevant data from the kernel and input to the shared memory therefore it does not need the materialization of the convolution matrix in the global memory with im2col thus saving lots of memory - similarly how the op implemented in cuDNN. This logic can theoretically result in faster kernels than im2col->matmul because the transfer of the full matrix between GMEM and registers is not needed and the repeating elements for the (virtual) helper matrix can be pulled from L2.
The performance is 2x compared to im2col->matmul on RTX 2060 (2.15 TFLOPS compared to 4.10 TFLOPS according to
test-backend-ops
theoretical max is ~6 TFLOPS):As a negative result, the indirect op is signiticantly faster on a GTX 1060 notebook (1.73 vs 1.21 TFLOPS -- theoretical max is ~3 TFLOPS) might be because blocktile sizes are too big for this older hardware.
The PR also adds support to compare ops with different implementation graphs in
test-backend-ops
, so one can compare/test the actual (potentially fused and optimized op under development) to a reference op that does not have a direct implementation on CPU yet making op development faster.