"Large" and "sleep" versions of "CL N-pipe" #17

void234 · 2021-02-23T21:15:48Z

It is inefficient to poll GPU for results wasting CPU time and
(in case of dGPUs) PCIe bandwidth, especially if CPU is powerful
while (i)GPU is not. Original "CL N-pipe" cores are not touched,
OpenCL kernels are not touched, but scheduling code is modified
to permit 100 times larger work units ("CL 1-pipe large" etc) and
also to flush assignment to GPU and put CPU to sleep
("CL 1-pipe sleep" etc).

"Large" cores are marginally faster than original ones.
"Sleep" cores are slightly slower than "large" ones because GPU may
sometimes finish processing work unit while CPU still sleeps.
These cores, however, consume zero CPU (all other cores consume 1
logical CPU unless sleep is transparently performed by GPU driver -
Intel does this for gen8 but not for newer GPUs, this helps but only
if work unit is large enough for CPU to sleep for several milliseconds).
This results in higher power efficiency and, if we are not limited
by TDP, significant performance improvement. Effect is more pronounced
when CPU does not support MT.

Note that with "sleep" cores there is no need to manually limit
number of threads for CPU cruncher.

Performance/efficiency can be further improved by growing work unit
size faster. Wider testing and benchmarking (especially on high-end
GPUs) are welcome.

Benchmarks below are performed with CPU being loaded with
2.9116.525-amd64 core #4 (YK AVX2).
CUDA client is 2.9110.519b, core #10 (CUDA 1-pipe 64-thd sleep 100us).
"521" refers to 2.9112.521 dnetc-win32-x86-opencl.zip/
dnetc-linux-amd64-opencl.tar.gz
Power consumption is "measured" with "Core Temp" / "s-tui".

Core i5-8265U (15W, 4C8T, 14 nm, 1.6-3.9 GHz,
Intel UHD Graphics 620 [gen9] 1100 MHz), Ubuntu 20.04
CL 2-pipe/large/sleep

Mode               CPU     iGPU   Summary Power Efficiency
521, 7 threads     124      150       274    15   18.27
521, 8 threads     127      150       277    15   18.47
521, iGPU only       0      184       184    15   12.27
CPU only           181        0       181    15   12.07
Sleep, 8 threads   135      148       283    15   18.87
iGPU only, sleep     0      186       186    15   12.40
iGPU only, large     0      186       187    15   12.47

[1.022 efficiency improvement, "sleep" is optimal]

Core i7-9700K (95W, 8C8T, 14 nm, 3.6-4.9 GHz,
Intel UHD Graphics 630 [gen9] 1200 MHz), Windows 10 20H2
CL 2-pipe/large/sleep

Mode               CPU     iGPU   Summary Power Efficiency
521, 8 threads     480       92       572    95    6.02
521, 7 threads     406      187       593    95    6.24
Sleep, 8 threads   457      178       635    95    6.68
Sleep, 7 threads   403      188       591    95    6.22
CPU only           473        0       473    95    4.98
iGPU only, sleep     0      188       188    22    8.55
iGPU only, large     0      190       190    44    4.32

<Note terrible power efficiency of polling - "large" vs "sleep">
[1.071 efficiency improvement, "sleep" is optimal]

Core i5-5200U (15W, 2C4T, 14 nm, 2.2-2.7 GHz,
Intel HD Graphics 5500 [gen8] 900 MHz)
NVidia GeForce 820M 2048 MB, ForceWare 382.05
Windows 10 20H2
CL 4-pipe/large/sleep

Mode                CPU     iGPU   dGPU Summary Power* Efficiency*
521, 4 threads       66       59      0     125    15    8.33
521, 3 threads       63      167      0     230  21.4   10.75
521, 3 threads, CUDA 29      161     89     279   15*       *
CPU only             67        0      0      67  10.2    6.57
Sleep, 4 threads     71      168      0     239  21.4   11.17
Large, 4 threads     68      172      0     240  21.4   11.21
iGPU only, sleep      0      173      0     173  13.5   12.81
iGPU only, large      0      175      0     175  13.5   12.96
dGPU only, sleep      0        0    123     123  1.3*       *
dGPU only, large      0        0    134     134  8.3*       *
Sleep, 4 threads, dG 42      153    119     314   15*       *
Custom**, 4 threads  41      155    120     316   15*       *

*dGPU is not included in power measurements
**Custom - "large" for iGPU (gen8 driver idles CPU himself),
"sleep" for dGPU
[1.043 efficiency improvement, "large" is optimal for iGPU]
[CPU+iGPU+dGPU: 1.133 performance improvement, "sleep" is optimal for dGPU]

"-bench" Intel UHD Graphics 620 [gen9] 1100 MHz (Core i5-8265U)

RC5-72: using core #0 (CL ANSI 1-pipe).
RC5-72: Benchmark for core #0 (CL ANSI 1-pipe)
0.00:00:16.14 [113,990,283 keys/sec]
RC5-72: using core #1 (CL 1-pipe).
RC5-72: Benchmark for core #1 (CL 1-pipe)
0.00:00:16.32 [187,106,455 keys/sec]
RC5-72: using core #2 (CL 2-pipe).
RC5-72: Benchmark for core #2 (CL 2-pipe)
0.00:00:16.92 [184,015,486 keys/sec]
RC5-72: using core #3 (CL 4-pipe).
RC5-72: Benchmark for core #3 (CL 4-pipe)
0.00:00:16.80 [166,416,580 keys/sec]
RC5-72: using core #4 (CL 1-pipe large).
RC5-72: Benchmark for core #4 (CL 1-pipe large)
0.00:00:16.80 [184,818,394 keys/sec]
RC5-72: using core #5 (CL 2-pipe large).
RC5-72: Benchmark for core #5 (CL 2-pipe large)
0.00:00:16.81 [188,636,921 keys/sec]
RC5-72: using core #6 (CL 4-pipe large).
RC5-72: Benchmark for core #6 (CL 4-pipe large)
0.00:00:16.61 [170,029,327 keys/sec]
RC5-72: using core #7 (CL 1-pipe sleep).
RC5-72: Benchmark for core #7 (CL 1-pipe sleep)
0.00:00:16.05 [189,540,521 keys/sec]
RC5-72: using core #8 (CL 2-pipe sleep).
RC5-72: Benchmark for core #8 (CL 2-pipe sleep)
0.00:00:17.02 [192,711,899 keys/sec]
RC5-72: using core #9 (CL 4-pipe sleep).
RC5-72: Benchmark for core #9 (CL 4-pipe sleep)
0.00:00:16.93 [174,570,008 keys/sec]
RC5-72 benchmark summary :
Default core : #-1 (undefined) 0 keys/sec
Fastest core : #8 (CL 2-pipe sleep) 192,711,899 keys/sec

"-bench" Intel UHD Graphics 630 [gen9] 1200 MHz (Core i7-9700K)

RC5-72: using core #0 (CL ANSI 1-pipe).
RC5-72: Benchmark for core #0 (CL ANSI 1-pipe)
0.00:00:16.96 [124,370,534 keys/sec]
RC5-72: using core #1 (CL 1-pipe).
RC5-72: Benchmark for core #1 (CL 1-pipe)
0.00:00:16.84 [186,580,220 keys/sec]
RC5-72: using core #2 (CL 2-pipe).
RC5-72: Benchmark for core #2 (CL 2-pipe)
0.00:00:16.76 [189,445,953 keys/sec]
RC5-72: using core #3 (CL 4-pipe).
RC5-72: Benchmark for core #3 (CL 4-pipe)
0.00:00:16.53 [172,042,275 keys/sec]
RC5-72: using core #4 (CL 1-pipe large).
RC5-72: Benchmark for core #4 (CL 1-pipe large)
0.00:00:16.10 [191,761,686 keys/sec]
RC5-72: using core #5 (CL 2-pipe large).
RC5-72: Benchmark for core #5 (CL 2-pipe large)
0.00:00:16.84 [192,842,719 keys/sec]
RC5-72: using core #6 (CL 4-pipe large).
RC5-72: Benchmark for core #6 (CL 4-pipe large)
0.00:00:16.59 [176,169,744 keys/sec]
RC5-72: using core #7 (CL 1-pipe sleep).
RC5-72: Benchmark for core #7 (CL 1-pipe sleep)
0.00:00:16.59 [183,669,420 keys/sec]
RC5-72: using core #8 (CL 2-pipe sleep).
RC5-72: Benchmark for core #8 (CL 2-pipe sleep)
0.00:00:16.57 [186,548,997 keys/sec]
RC5-72: using core #9 (CL 4-pipe sleep).
RC5-72: Benchmark for core #9 (CL 4-pipe sleep)
0.00:00:16.35 [169,087,725 keys/sec]
RC5-72 benchmark summary :
Default core : #-1 (undefined) 0 keys/sec
Fastest core : #5 (CL 2-pipe large) 192,842,719 keys/sec

"-bench" Intel HD Graphics 5500 [gen8] 900 MHz (Core i5-5200U)

RC5-72: using core #0 (CL ANSI 1-pipe).
RC5-72: Benchmark for core #0 (CL ANSI 1-pipe)
0.00:00:16.15 [9,209,485 keys/sec]
RC5-72: using core #1 (CL 1-pipe).
RC5-72: Benchmark for core #1 (CL 1-pipe)
0.00:00:16.06 [168,667,029 keys/sec]
RC5-72: using core #2 (CL 2-pipe).
RC5-72: Benchmark for core #2 (CL 2-pipe)
0.00:00:16.81 [168,043,318 keys/sec]
RC5-72: using core #3 (CL 4-pipe).
RC5-72: Benchmark for core #3 (CL 4-pipe)
0.00:00:17.03 [171,313,110 keys/sec]
RC5-72: using core #4 (CL 1-pipe large).
RC5-72: Benchmark for core #4 (CL 1-pipe large)
0.00:00:16.86 [173,663,198 keys/sec]
RC5-72: using core #5 (CL 2-pipe large).
RC5-72: Benchmark for core #5 (CL 2-pipe large)
0.00:00:17.06 [177,573,667 keys/sec]
RC5-72: using core #6 (CL 4-pipe large).
RC5-72: Benchmark for core #6 (CL 4-pipe large)
0.00:00:16.70 [176,852,285 keys/sec]
RC5-72: using core #7 (CL 1-pipe sleep).
RC5-72: Benchmark for core #7 (CL 1-pipe sleep)
0.00:00:16.51 [166,997,768 keys/sec]
RC5-72: using core #8 (CL 2-pipe sleep).
RC5-72: Benchmark for core #8 (CL 2-pipe sleep)
0.00:00:16.59 [168,755,292 keys/sec]
RC5-72: using core #9 (CL 4-pipe sleep).
RC5-72: Benchmark for core #9 (CL 4-pipe sleep)
0.00:00:16.64 [170,413,224 keys/sec]
RC5-72 benchmark summary :
Default core : #-1 (undefined) 0 keys/sec
Fastest core : #5 (CL 2-pipe large) 177,573,667 keys/sec

"-bench" NVidia GeForce 820M 2048 MB, ForceWare 382.05

RC5-72: using core #0 (CL ANSI 1-pipe).
RC5-72: Benchmark for core #0 (CL ANSI 1-pipe)
0.00:00:16.20 [102,620,050 keys/sec]
RC5-72: using core #1 (CL 1-pipe).
RC5-72: Benchmark for core #1 (CL 1-pipe)
0.00:00:16.98 [129,678,653 keys/sec]
RC5-72: using core #2 (CL 2-pipe).
RC5-72: Benchmark for core #2 (CL 2-pipe)
0.00:00:16.95 [123,092,851 keys/sec]
RC5-72: using core #3 (CL 4-pipe).
RC5-72: Benchmark for core #3 (CL 4-pipe)
0.00:00:16.98 [78,567,847 keys/sec]
RC5-72: using core #4 (CL 1-pipe large).
RC5-72: Benchmark for core #4 (CL 1-pipe large)
0.00:00:17.03 [135,449,921 keys/sec]
RC5-72: using core #5 (CL 2-pipe large).
RC5-72: Benchmark for core #5 (CL 2-pipe large)
0.00:00:16.89 [128,422,603 keys/sec]
RC5-72: using core #6 (CL 4-pipe large).
RC5-72: Benchmark for core #6 (CL 4-pipe large)
0.00:00:16.43 [78,558,193 keys/sec]
RC5-72: using core #7 (CL 1-pipe sleep).
RC5-72: Benchmark for core #7 (CL 1-pipe sleep)
0.00:00:16.65 [127,347,752 keys/sec]
RC5-72: using core #8 (CL 2-pipe sleep).
RC5-72: Benchmark for core #8 (CL 2-pipe sleep)
0.00:00:16.10 [117,091,782 keys/sec]
RC5-72: using core #9 (CL 4-pipe sleep).
RC5-72: Benchmark for core #9 (CL 4-pipe sleep)
0.00:00:16.14 [71,550,849 keys/sec]
RC5-72 benchmark summary :
Default core : #-1 (undefined) 0 keys/sec
Fastest core : #4 (CL 1-pipe large) 135,449,921 keys/sec

It is inefficient to poll GPU for results wasting CPU time and (in case of dGPUs) PCIe bandwidth, especially if CPU is powerful while (i)GPU is not. Original "CL N-pipe" cores are not touched, OpenCL kernels are not touched, but scheduling code is modified to permit 100 times larger work units ("CL 1-pipe large" etc) and also to flush assignment to GPU and put CPU to sleep ("CL 1-pipe sleep" etc). "Large" cores are marginally faster than original ones. "Sleep" cores are slightly slower than "large" ones because GPU may sometimes finish processing work unit while CPU still sleeps. These cores, however, consume zero CPU (all other cores consume 1 logical CPU unless sleep is transparently performed by GPU driver - Intel does this for gen8 but not for newer GPUs, this helps but only if work unit is large enough for CPU to sleep for several milliseconds). This results in higher power efficiency and, if we are not limited by TDP, significant performance improvement. Effect is more pronounced when CPU does not support MT. Note that with "sleep" cores there is no need to manually limit number of threads for CPU cruncher. Performance/efficiency can be further improved by growing work unit size faster. Wider testing and benchmarking (especially on high-end GPUs) are welcome. Benchmarks below are performed with CPU being loaded with 2.9116.525-amd64 core dcti#4 (YK AVX2). CUDA client is 2.9110.519b, core dcti#10 (CUDA 1-pipe 64-thd sleep 100us). "521" refers to 2.9112.521 dnetc-win32-x86-opencl.zip/ dnetc-linux-amd64-opencl.tar.gz Power consumption is "measured" with "Core Temp" / "s-tui". Core i5-8265U (15W, 4C8T, 14 nm, 1.6-3.9 GHz, Intel UHD Graphics 620 [gen9] 1100 MHz), Ubuntu 20.04 CL 2-pipe/large/sleep Mode CPU iGPU Summary Power Efficiency 521, 7 threads 124 150 274 15 18.27 521, 8 threads 127 150 277 15 18.47 521, iGPU only 0 184 184 15 12.27 CPU only 181 0 181 15 12.07 Sleep, 8 threads 135 148 283 15 18.87 iGPU only, sleep 0 186 186 15 12.40 iGPU only, large 0 186 187 15 12.47 [1.022 efficiency improvement, "sleep" is optimal] Core i7-9700K (95W, 8C8T, 14 nm, 3.6-4.9 GHz, Intel UHD Graphics 630 [gen9] 1200 MHz), Windows 10 20H2 CL 2-pipe/large/sleep Mode CPU iGPU Summary Power Efficiency 521, 8 threads 480 92 572 95 6.02 521, 7 threads 406 187 593 95 6.24 Sleep, 8 threads 457 178 635 95 6.68 Sleep, 7 threads 403 188 591 95 6.22 CPU only 473 0 473 95 4.98 iGPU only, sleep 0 188 188 22 8.55 iGPU only, large 0 190 190 44 4.32 <Note terrible power efficiency of polling - "large" vs "sleep"> [1.071 efficiency improvement, "sleep" is optimal] Core i5-5200U (15W, 2C4T, 14 nm, 2.2-2.7 GHz, Intel HD Graphics 5500 [gen8] 900 MHz) NVidia GeForce 820M 2048 MB, ForceWare 382.05 Windows 10 20H2 CL 4-pipe/large/sleep Mode CPU iGPU dGPU Summary Power* Efficiency* 521, 4 threads 66 59 0 125 15 8.33 521, 3 threads 63 167 0 230 21.4 10.75 521, 3 threads, CUDA 29 161 89 279 15* * CPU only 67 0 0 67 10.2 6.57 Sleep, 4 threads 71 168 0 239 21.4 11.17 Large, 4 threads 68 172 0 240 21.4 11.21 iGPU only, sleep 0 173 0 173 13.5 12.81 iGPU only, large 0 175 0 175 13.5 12.96 dGPU only, sleep 0 0 123 123 1.3* * dGPU only, large 0 0 134 134 8.3* * Sleep, 4 threads, dG 42 153 119 314 15* * Custom**, 4 threads 41 155 120 316 15* * *dGPU is not included in power measurements **Custom - "large" for iGPU (gen8 driver idles CPU himself), "sleep" for dGPU [1.043 efficiency improvement, "large" is optimal for iGPU] [CPU+iGPU+dGPU: 1.133 performance improvement, "sleep" is optimal for dGPU] "-bench" Intel UHD Graphics 620 [gen9] 1100 MHz (Core i5-8265U) RC5-72: using core #0 (CL ANSI 1-pipe). RC5-72: Benchmark for core #0 (CL ANSI 1-pipe) 0.00:00:16.14 [113,990,283 keys/sec] RC5-72: using core dcti#1 (CL 1-pipe). RC5-72: Benchmark for core dcti#1 (CL 1-pipe) 0.00:00:16.32 [187,106,455 keys/sec] RC5-72: using core dcti#2 (CL 2-pipe). RC5-72: Benchmark for core dcti#2 (CL 2-pipe) 0.00:00:16.92 [184,015,486 keys/sec] RC5-72: using core dcti#3 (CL 4-pipe). RC5-72: Benchmark for core dcti#3 (CL 4-pipe) 0.00:00:16.80 [166,416,580 keys/sec] RC5-72: using core dcti#4 (CL 1-pipe large). RC5-72: Benchmark for core dcti#4 (CL 1-pipe large) 0.00:00:16.80 [184,818,394 keys/sec] RC5-72: using core dcti#5 (CL 2-pipe large). RC5-72: Benchmark for core dcti#5 (CL 2-pipe large) 0.00:00:16.81 [188,636,921 keys/sec] RC5-72: using core dcti#6 (CL 4-pipe large). RC5-72: Benchmark for core dcti#6 (CL 4-pipe large) 0.00:00:16.61 [170,029,327 keys/sec] RC5-72: using core dcti#7 (CL 1-pipe sleep). RC5-72: Benchmark for core dcti#7 (CL 1-pipe sleep) 0.00:00:16.05 [189,540,521 keys/sec] RC5-72: using core dcti#8 (CL 2-pipe sleep). RC5-72: Benchmark for core dcti#8 (CL 2-pipe sleep) 0.00:00:17.02 [192,711,899 keys/sec] RC5-72: using core dcti#9 (CL 4-pipe sleep). RC5-72: Benchmark for core dcti#9 (CL 4-pipe sleep) 0.00:00:16.93 [174,570,008 keys/sec] RC5-72 benchmark summary : Default core : #-1 (undefined) 0 keys/sec Fastest core : dcti#8 (CL 2-pipe sleep) 192,711,899 keys/sec "-bench" Intel UHD Graphics 630 [gen9] 1200 MHz (Core i7-9700K) RC5-72: using core #0 (CL ANSI 1-pipe). RC5-72: Benchmark for core #0 (CL ANSI 1-pipe) 0.00:00:16.96 [124,370,534 keys/sec] RC5-72: using core dcti#1 (CL 1-pipe). RC5-72: Benchmark for core dcti#1 (CL 1-pipe) 0.00:00:16.84 [186,580,220 keys/sec] RC5-72: using core dcti#2 (CL 2-pipe). RC5-72: Benchmark for core dcti#2 (CL 2-pipe) 0.00:00:16.76 [189,445,953 keys/sec] RC5-72: using core dcti#3 (CL 4-pipe). RC5-72: Benchmark for core dcti#3 (CL 4-pipe) 0.00:00:16.53 [172,042,275 keys/sec] RC5-72: using core dcti#4 (CL 1-pipe large). RC5-72: Benchmark for core dcti#4 (CL 1-pipe large) 0.00:00:16.10 [191,761,686 keys/sec] RC5-72: using core dcti#5 (CL 2-pipe large). RC5-72: Benchmark for core dcti#5 (CL 2-pipe large) 0.00:00:16.84 [192,842,719 keys/sec] RC5-72: using core dcti#6 (CL 4-pipe large). RC5-72: Benchmark for core dcti#6 (CL 4-pipe large) 0.00:00:16.59 [176,169,744 keys/sec] RC5-72: using core dcti#7 (CL 1-pipe sleep). RC5-72: Benchmark for core dcti#7 (CL 1-pipe sleep) 0.00:00:16.59 [183,669,420 keys/sec] RC5-72: using core dcti#8 (CL 2-pipe sleep). RC5-72: Benchmark for core dcti#8 (CL 2-pipe sleep) 0.00:00:16.57 [186,548,997 keys/sec] RC5-72: using core dcti#9 (CL 4-pipe sleep). RC5-72: Benchmark for core dcti#9 (CL 4-pipe sleep) 0.00:00:16.35 [169,087,725 keys/sec] RC5-72 benchmark summary : Default core : #-1 (undefined) 0 keys/sec Fastest core : dcti#5 (CL 2-pipe large) 192,842,719 keys/sec "-bench" Intel HD Graphics 5500 [gen8] 900 MHz (Core i5-5200U) RC5-72: using core #0 (CL ANSI 1-pipe). RC5-72: Benchmark for core #0 (CL ANSI 1-pipe) 0.00:00:16.15 [9,209,485 keys/sec] RC5-72: using core dcti#1 (CL 1-pipe). RC5-72: Benchmark for core dcti#1 (CL 1-pipe) 0.00:00:16.06 [168,667,029 keys/sec] RC5-72: using core dcti#2 (CL 2-pipe). RC5-72: Benchmark for core dcti#2 (CL 2-pipe) 0.00:00:16.81 [168,043,318 keys/sec] RC5-72: using core dcti#3 (CL 4-pipe). RC5-72: Benchmark for core dcti#3 (CL 4-pipe) 0.00:00:17.03 [171,313,110 keys/sec] RC5-72: using core dcti#4 (CL 1-pipe large). RC5-72: Benchmark for core dcti#4 (CL 1-pipe large) 0.00:00:16.86 [173,663,198 keys/sec] RC5-72: using core dcti#5 (CL 2-pipe large). RC5-72: Benchmark for core dcti#5 (CL 2-pipe large) 0.00:00:17.06 [177,573,667 keys/sec] RC5-72: using core dcti#6 (CL 4-pipe large). RC5-72: Benchmark for core dcti#6 (CL 4-pipe large) 0.00:00:16.70 [176,852,285 keys/sec] RC5-72: using core dcti#7 (CL 1-pipe sleep). RC5-72: Benchmark for core dcti#7 (CL 1-pipe sleep) 0.00:00:16.51 [166,997,768 keys/sec] RC5-72: using core dcti#8 (CL 2-pipe sleep). RC5-72: Benchmark for core dcti#8 (CL 2-pipe sleep) 0.00:00:16.59 [168,755,292 keys/sec] RC5-72: using core dcti#9 (CL 4-pipe sleep). RC5-72: Benchmark for core dcti#9 (CL 4-pipe sleep) 0.00:00:16.64 [170,413,224 keys/sec] RC5-72 benchmark summary : Default core : #-1 (undefined) 0 keys/sec Fastest core : dcti#5 (CL 2-pipe large) 177,573,667 keys/sec "-bench" NVidia GeForce 820M 2048 MB, ForceWare 382.05 RC5-72: using core #0 (CL ANSI 1-pipe). RC5-72: Benchmark for core #0 (CL ANSI 1-pipe) 0.00:00:16.20 [102,620,050 keys/sec] RC5-72: using core dcti#1 (CL 1-pipe). RC5-72: Benchmark for core dcti#1 (CL 1-pipe) 0.00:00:16.98 [129,678,653 keys/sec] RC5-72: using core dcti#2 (CL 2-pipe). RC5-72: Benchmark for core dcti#2 (CL 2-pipe) 0.00:00:16.95 [123,092,851 keys/sec] RC5-72: using core dcti#3 (CL 4-pipe). RC5-72: Benchmark for core dcti#3 (CL 4-pipe) 0.00:00:16.98 [78,567,847 keys/sec] RC5-72: using core dcti#4 (CL 1-pipe large). RC5-72: Benchmark for core dcti#4 (CL 1-pipe large) 0.00:00:17.03 [135,449,921 keys/sec] RC5-72: using core dcti#5 (CL 2-pipe large). RC5-72: Benchmark for core dcti#5 (CL 2-pipe large) 0.00:00:16.89 [128,422,603 keys/sec] RC5-72: using core dcti#6 (CL 4-pipe large). RC5-72: Benchmark for core dcti#6 (CL 4-pipe large) 0.00:00:16.43 [78,558,193 keys/sec] RC5-72: using core dcti#7 (CL 1-pipe sleep). RC5-72: Benchmark for core dcti#7 (CL 1-pipe sleep) 0.00:00:16.65 [127,347,752 keys/sec] RC5-72: using core dcti#8 (CL 2-pipe sleep). RC5-72: Benchmark for core dcti#8 (CL 2-pipe sleep) 0.00:00:16.10 [117,091,782 keys/sec] RC5-72: using core dcti#9 (CL 4-pipe sleep). RC5-72: Benchmark for core dcti#9 (CL 4-pipe sleep) 0.00:00:16.14 [71,550,849 keys/sec] RC5-72 benchmark summary : Default core : #-1 (undefined) 0 keys/sec Fastest core : dcti#4 (CL 1-pipe large) 135,449,921 keys/sec

bovine · 2021-02-23T21:37:48Z

is clWaitForEvents() the main contributor of the wasted CPU time that the sleeping is solving?

void234 · 2021-02-24T09:41:53Z

is clWaitForEvents() the main contributor of the wasted CPU time that the sleeping is solving?

Yes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"Large" and "sleep" versions of "CL N-pipe" #17

"Large" and "sleep" versions of "CL N-pipe" #17

void234 commented Feb 23, 2021 •

edited by bovine

Loading

bovine commented Feb 23, 2021

void234 commented Feb 24, 2021

"Large" and "sleep" versions of "CL N-pipe" #17

Are you sure you want to change the base?

"Large" and "sleep" versions of "CL N-pipe" #17

Conversation

void234 commented Feb 23, 2021 • edited by bovine Loading

bovine commented Feb 23, 2021

void234 commented Feb 24, 2021

void234 commented Feb 23, 2021 •

edited by bovine

Loading