vulkan: Adjust coopmat2 tile sizes and selection heuristic #12258

jeffbolznv · 2025-03-07T18:14:58Z

This change selects different tile sizes (M/N/K) for the coopmat2 shaders, with the goal of better optimizing for smaller prompt lengths. It turns out the largest tile size didn't need to be so large, and there were better tile sizes for smaller prompts like pp128.

I ran a variety of prompt lengths using each of small/medium/large and found that the previous heuristic of trying to use the largest size that evenly divides the prompt length isn't optimal, and it's better to just round up to the next larger tile size.

I think there's still room to improve some prompt lengths by using a mixture of sizes (e.g. see the falloff from pp128 to pp129, which could do better using a mixture of N=128 and N=32 tiles). But I haven't tried that yet.

		Phi-3-mini-4k-instruct-q4	Llama-3.2-3B-Instruct-Q4_0	DeepSeek-Coder-V2-Lite-Instruct-Q2_K		
		master	PR	delta		master	PR	delta		master	PR	delta
        pp5	447.15	462.92	3.53%		531.84	541.87	1.89%		109.61	144.33	31.68%
       pp10	274.87	494.28	79.82%		320.15	666.86	108.30%		214.57	241.45	12.53%
       pp20	564.73	991.1	75.50%		636.88	1365.68	114.43%		339.73	379.24	11.63%
       pp31	863.29	1418.82	64.35%		1009.82	2012.46	99.29%		482.04	538.28	11.67%
       pp32	894.75	1680.77	87.85%		1001.53	2822.73	181.84%		514.22	584.16	13.60%
       pp33	909.48	1293.95	42.27%		1087.26	1572.45	44.63%		494.02	534.39	8.17%
       pp48	1347.26	1766.27	31.10%		1559.74	2225.58	42.69%		669.58	725.74	8.39%
       pp54	1465.89	2030.28	38.50%		1758.62	2714.58	54.36%		767.75	810.42	5.56%
       pp63	1689.89	2405.95	42.37%		1920.05	2766.84	44.10%		894.12	947.79	6.00%
       pp64	1702.93	2910.49	70.91%		1951.4	3495.35	79.12%		909.94	963.73	5.91%
       pp65	1741.64	1903.68	9.30%		2012.49	2440.7	21.28%		846.71	889.55	5.06%
       pp80	2039.35	2288.33	12.21%		2654.92	3194.68	20.33%		972.21	1032.53	6.20%
       pp96	2397.05	2722.58	13.58%		2997.82	3444.56	14.90%		1127.34	1202.88	6.70%
      pp112	2732.28	3245.6	18.79%		3317.34	3727.07	12.35%		1280.77	1358.74	6.09%
      pp113	2787.73	3057.98	9.69%		3273.65	3841.65	17.35%		1286.89	1354.41	5.25%
      pp127	3064.93	3449.08	12.53%		3983.52	4308.19	8.15%		1387.91	1462.76	5.39%
      pp128	3817.62	4258.3	11.54%		4988.4	5836.28	17.00%		1285.63	1534.76	19.38%
      pp129	2246.45	2516.47	12.02%		3515.47	3659.03	4.08%		1277.89	1414.45	10.69%
      pp140	2357.01	2710.37	14.99%		3560.25	3791.22	6.49%		1349.21	1497.69	11.00%
      pp160	2606.54	2998.89	15.05%		4175.21	4229.96	1.31%		1510.36	1653.83	9.50%
      pp180	2891.38	3292.31	13.87%		4431.38	4698.62	6.03%		1594.75	1823.14	14.32%
      pp192	3036.42	3622.36	19.30%		4845.22	4999.49	3.18%		1882.77	1843.78	-2.07%
      pp200	3106.62	3610.68	16.23%		5037.88	5320.26	5.61%		1609.13	1904.71	18.37%
      pp210	3245.74	3771.82	16.21%		5317.9	5313.79	-0.08%		1657.07	1942.78	17.24%
      pp230	3493.62	4152.18	18.85%		5512.73	5486.64	-0.47%		1726.6	2047.33	18.58%
      pp248	3678.74	4357.96	18.46%		5981.39	6198.01	3.62%		1830.46	2142.02	17.02%
      pp255	3719.19	4338.61	16.65%		6385.1	6439.46	0.85%		1850.39	2173.92	17.48%
      pp256	4818.48	5035.21	4.50%		6876.82	6844.45	-0.47%		1922.72	2281.41	18.66%
      pp257	2898.17	3605.43	24.40%		5340.83	5004.51	-6.30%		1652.4	2016.53	22.04%
      pp280	3143.84	3830.24	21.83%		5523.79	5364.38	-2.89%		1740.22	2143.03	23.15%
      pp300	3300.25	4080.59	23.64%		5917.37	5847.97	-1.17%		1800.03	2199.72	22.20%
      pp320	3498.11	4217.64	20.57%		6204.19	6174.56	-0.48%		2291.67	2310.74	0.83%
      pp350	3672.25	4548.51	23.86%		6630.85	6418.19	-3.21%		1787.9	2326.1	30.10%
      pp384	4251.46	5316.02	25.04%		7845.15	7781.41	-0.81%		2301.83	2466.57	7.16%
      pp410	3478.13	4287.22	23.26%		6224.08	6335.52	1.79%		1741.14	2356.38	35.34%
      pp448	3741.12	4645.57	24.18%		6700.98	6655.14	-0.68%		2399.2	2439.5	1.68%
      pp480	3886.56	4830.78	24.29%		6970.5	7216.27	3.53%		1692.51	2405.76	42.14%
      pp490	3980.64	4909.99	23.35%		7010.11	7031.98	0.31%		1691.38	2420.9	43.13%
      pp511	4051.34	5030.26	24.16%		7622.17	7214.6	-5.35%		1711.46	2461.29	43.81%
      pp512	5434.86	5435.16	0.01%		7948.65	7810.69	-1.74%		2469.96	2498.51	1.16%
      pp513	4935.44	5077.35	2.88%		7085.09	6986.99	-1.38%		2382.88	2423.44	1.70%
      pp767	4633.69	4985.33	7.59%		7023.5	7201.18	2.53%		2192.93	2371.51	8.14%
      pp768	5171.77	5251.81	1.55%		7373.88	7401.12	0.37%		2265.41	2384.82	5.27%
      pp769	4143.96	4575.35	10.41%		6645.42	6699.65	0.82%		2127.2	2295.49	7.91%
     pp1023	4575.58	5227.51	14.25%		7599.04	7661.06	0.82%		2010.02	2436.48	21.22%
     pp1024	5324.02	5318.13	-0.11%		7738.87	7874.87	1.76%		2448.96	2447.92	-0.04%
     pp1025	5119.71	5118.21	-0.03%		7025.86	7397.71	5.29%		2385.33	2384.26	-0.04%
     pp2047	4748.43	5093.44	7.27%		7197.72	7268.61	0.98%		2130.06	2344.45	10.06%
     pp2048	5152.48	5185.36	0.64%		7290.56	7404.3	1.56%		2357.14	2353.71	-0.15%
     pp2049	5045.99	5021.26	-0.49%		7065.23	7198.21	1.88%		2309.27	2332.74	1.02%

Use VK_KHR_pipeline_executable_properties to query the register count, and use that to try to better estimate how many workgroups can fit in the SMs. Particularly with recent tile size changes (ggml-org#12258) the old heuristic is out of date. This heuristic benefits both coopmat1 and coopmat2 paths on NVIDIA. Would be good if somebody can hook up the missing details for other hardware. Calling getPipelineExecutableStatisticsKHR required more fully initializing Vulkan-HPP. The steps needed are documented in the Vulkan-HPP readme.

jeffbolznv · 2025-03-11T14:38:17Z

Note that the Q4_0 perf in the description is out of date now, see discussion in #12319.

Use VK_KHR_pipeline_executable_properties to query the register count, and use that to try to better estimate how many workgroups can fit in the SMs. Particularly with recent tile size changes (ggml-org#12258) the old heuristic is out of date. This heuristic benefits both coopmat1 and coopmat2 paths on NVIDIA. Would be good if somebody can hook up the missing details for other hardware. Calling getPipelineExecutableStatisticsKHR required more fully initializing Vulkan-HPP. The steps needed are documented in the Vulkan-HPP readme.

…12258)

jeffbolznv requested a review from 0cc4m March 7, 2025 18:15

github-actions bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Mar 7, 2025

jeffbolznv mentioned this pull request Mar 8, 2025

vulkan: Pad N dimension of B matrix for coopmat2 perf, to avoid bounds checking #12273

Merged

jeffbolznv mentioned this pull request Mar 10, 2025

vulkan: query register count and use it in a better split_k heuristic #12319

Closed

vulkan: Adjust coopmat2 tile sizes and selection heuristic

1577cfd

jeffbolznv force-pushed the coopmat2_tile_size branch from 29daff0 to 1577cfd Compare March 11, 2025 14:37

0cc4m approved these changes Mar 17, 2025

View reviewed changes

0cc4m merged commit 2f21123 into ggml-org:master Mar 17, 2025
47 checks passed

arthw pushed a commit to arthw/llama.cpp that referenced this pull request Mar 19, 2025

vulkan: Adjust coopmat2 tile sizes and selection heuristic (ggml-org#…

19aab34

…12258)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

vulkan: Adjust coopmat2 tile sizes and selection heuristic #12258

vulkan: Adjust coopmat2 tile sizes and selection heuristic #12258

Uh oh!

jeffbolznv commented Mar 7, 2025

Uh oh!

jeffbolznv commented Mar 11, 2025

Uh oh!

Uh oh!

Uh oh!

vulkan: Adjust coopmat2 tile sizes and selection heuristic #12258

vulkan: Adjust coopmat2 tile sizes and selection heuristic #12258

Uh oh!

Conversation

jeffbolznv commented Mar 7, 2025

Uh oh!

jeffbolznv commented Mar 11, 2025

Uh oh!

Uh oh!

Uh oh!