-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"Large" and "sleep" versions of "CL N-pipe" #17
Open
void234
wants to merge
1
commit into
dcti:master
Choose a base branch
from
void234:opencl-sleep
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
It is inefficient to poll GPU for results wasting CPU time and (in case of dGPUs) PCIe bandwidth, especially if CPU is powerful while (i)GPU is not. Original "CL N-pipe" cores are not touched, OpenCL kernels are not touched, but scheduling code is modified to permit 100 times larger work units ("CL 1-pipe large" etc) and also to flush assignment to GPU and put CPU to sleep ("CL 1-pipe sleep" etc). "Large" cores are marginally faster than original ones. "Sleep" cores are slightly slower than "large" ones because GPU may sometimes finish processing work unit while CPU still sleeps. These cores, however, consume zero CPU (all other cores consume 1 logical CPU unless sleep is transparently performed by GPU driver - Intel does this for gen8 but not for newer GPUs, this helps but only if work unit is large enough for CPU to sleep for several milliseconds). This results in higher power efficiency and, if we are not limited by TDP, significant performance improvement. Effect is more pronounced when CPU does not support MT. Note that with "sleep" cores there is no need to manually limit number of threads for CPU cruncher. Performance/efficiency can be further improved by growing work unit size faster. Wider testing and benchmarking (especially on high-end GPUs) are welcome. Benchmarks below are performed with CPU being loaded with 2.9116.525-amd64 core dcti#4 (YK AVX2). CUDA client is 2.9110.519b, core dcti#10 (CUDA 1-pipe 64-thd sleep 100us). "521" refers to 2.9112.521 dnetc-win32-x86-opencl.zip/ dnetc-linux-amd64-opencl.tar.gz Power consumption is "measured" with "Core Temp" / "s-tui". Core i5-8265U (15W, 4C8T, 14 nm, 1.6-3.9 GHz, Intel UHD Graphics 620 [gen9] 1100 MHz), Ubuntu 20.04 CL 2-pipe/large/sleep Mode CPU iGPU Summary Power Efficiency 521, 7 threads 124 150 274 15 18.27 521, 8 threads 127 150 277 15 18.47 521, iGPU only 0 184 184 15 12.27 CPU only 181 0 181 15 12.07 Sleep, 8 threads 135 148 283 15 18.87 iGPU only, sleep 0 186 186 15 12.40 iGPU only, large 0 186 187 15 12.47 [1.022 efficiency improvement, "sleep" is optimal] Core i7-9700K (95W, 8C8T, 14 nm, 3.6-4.9 GHz, Intel UHD Graphics 630 [gen9] 1200 MHz), Windows 10 20H2 CL 2-pipe/large/sleep Mode CPU iGPU Summary Power Efficiency 521, 8 threads 480 92 572 95 6.02 521, 7 threads 406 187 593 95 6.24 Sleep, 8 threads 457 178 635 95 6.68 Sleep, 7 threads 403 188 591 95 6.22 CPU only 473 0 473 95 4.98 iGPU only, sleep 0 188 188 22 8.55 iGPU only, large 0 190 190 44 4.32 <Note terrible power efficiency of polling - "large" vs "sleep"> [1.071 efficiency improvement, "sleep" is optimal] Core i5-5200U (15W, 2C4T, 14 nm, 2.2-2.7 GHz, Intel HD Graphics 5500 [gen8] 900 MHz) NVidia GeForce 820M 2048 MB, ForceWare 382.05 Windows 10 20H2 CL 4-pipe/large/sleep Mode CPU iGPU dGPU Summary Power* Efficiency* 521, 4 threads 66 59 0 125 15 8.33 521, 3 threads 63 167 0 230 21.4 10.75 521, 3 threads, CUDA 29 161 89 279 15* * CPU only 67 0 0 67 10.2 6.57 Sleep, 4 threads 71 168 0 239 21.4 11.17 Large, 4 threads 68 172 0 240 21.4 11.21 iGPU only, sleep 0 173 0 173 13.5 12.81 iGPU only, large 0 175 0 175 13.5 12.96 dGPU only, sleep 0 0 123 123 1.3* * dGPU only, large 0 0 134 134 8.3* * Sleep, 4 threads, dG 42 153 119 314 15* * Custom**, 4 threads 41 155 120 316 15* * *dGPU is not included in power measurements **Custom - "large" for iGPU (gen8 driver idles CPU himself), "sleep" for dGPU [1.043 efficiency improvement, "large" is optimal for iGPU] [CPU+iGPU+dGPU: 1.133 performance improvement, "sleep" is optimal for dGPU] "-bench" Intel UHD Graphics 620 [gen9] 1100 MHz (Core i5-8265U) RC5-72: using core #0 (CL ANSI 1-pipe). RC5-72: Benchmark for core #0 (CL ANSI 1-pipe) 0.00:00:16.14 [113,990,283 keys/sec] RC5-72: using core dcti#1 (CL 1-pipe). RC5-72: Benchmark for core dcti#1 (CL 1-pipe) 0.00:00:16.32 [187,106,455 keys/sec] RC5-72: using core dcti#2 (CL 2-pipe). RC5-72: Benchmark for core dcti#2 (CL 2-pipe) 0.00:00:16.92 [184,015,486 keys/sec] RC5-72: using core dcti#3 (CL 4-pipe). RC5-72: Benchmark for core dcti#3 (CL 4-pipe) 0.00:00:16.80 [166,416,580 keys/sec] RC5-72: using core dcti#4 (CL 1-pipe large). RC5-72: Benchmark for core dcti#4 (CL 1-pipe large) 0.00:00:16.80 [184,818,394 keys/sec] RC5-72: using core dcti#5 (CL 2-pipe large). RC5-72: Benchmark for core dcti#5 (CL 2-pipe large) 0.00:00:16.81 [188,636,921 keys/sec] RC5-72: using core dcti#6 (CL 4-pipe large). RC5-72: Benchmark for core dcti#6 (CL 4-pipe large) 0.00:00:16.61 [170,029,327 keys/sec] RC5-72: using core dcti#7 (CL 1-pipe sleep). RC5-72: Benchmark for core dcti#7 (CL 1-pipe sleep) 0.00:00:16.05 [189,540,521 keys/sec] RC5-72: using core dcti#8 (CL 2-pipe sleep). RC5-72: Benchmark for core dcti#8 (CL 2-pipe sleep) 0.00:00:17.02 [192,711,899 keys/sec] RC5-72: using core dcti#9 (CL 4-pipe sleep). RC5-72: Benchmark for core dcti#9 (CL 4-pipe sleep) 0.00:00:16.93 [174,570,008 keys/sec] RC5-72 benchmark summary : Default core : #-1 (undefined) 0 keys/sec Fastest core : dcti#8 (CL 2-pipe sleep) 192,711,899 keys/sec "-bench" Intel UHD Graphics 630 [gen9] 1200 MHz (Core i7-9700K) RC5-72: using core #0 (CL ANSI 1-pipe). RC5-72: Benchmark for core #0 (CL ANSI 1-pipe) 0.00:00:16.96 [124,370,534 keys/sec] RC5-72: using core dcti#1 (CL 1-pipe). RC5-72: Benchmark for core dcti#1 (CL 1-pipe) 0.00:00:16.84 [186,580,220 keys/sec] RC5-72: using core dcti#2 (CL 2-pipe). RC5-72: Benchmark for core dcti#2 (CL 2-pipe) 0.00:00:16.76 [189,445,953 keys/sec] RC5-72: using core dcti#3 (CL 4-pipe). RC5-72: Benchmark for core dcti#3 (CL 4-pipe) 0.00:00:16.53 [172,042,275 keys/sec] RC5-72: using core dcti#4 (CL 1-pipe large). RC5-72: Benchmark for core dcti#4 (CL 1-pipe large) 0.00:00:16.10 [191,761,686 keys/sec] RC5-72: using core dcti#5 (CL 2-pipe large). RC5-72: Benchmark for core dcti#5 (CL 2-pipe large) 0.00:00:16.84 [192,842,719 keys/sec] RC5-72: using core dcti#6 (CL 4-pipe large). RC5-72: Benchmark for core dcti#6 (CL 4-pipe large) 0.00:00:16.59 [176,169,744 keys/sec] RC5-72: using core dcti#7 (CL 1-pipe sleep). RC5-72: Benchmark for core dcti#7 (CL 1-pipe sleep) 0.00:00:16.59 [183,669,420 keys/sec] RC5-72: using core dcti#8 (CL 2-pipe sleep). RC5-72: Benchmark for core dcti#8 (CL 2-pipe sleep) 0.00:00:16.57 [186,548,997 keys/sec] RC5-72: using core dcti#9 (CL 4-pipe sleep). RC5-72: Benchmark for core dcti#9 (CL 4-pipe sleep) 0.00:00:16.35 [169,087,725 keys/sec] RC5-72 benchmark summary : Default core : #-1 (undefined) 0 keys/sec Fastest core : dcti#5 (CL 2-pipe large) 192,842,719 keys/sec "-bench" Intel HD Graphics 5500 [gen8] 900 MHz (Core i5-5200U) RC5-72: using core #0 (CL ANSI 1-pipe). RC5-72: Benchmark for core #0 (CL ANSI 1-pipe) 0.00:00:16.15 [9,209,485 keys/sec] RC5-72: using core dcti#1 (CL 1-pipe). RC5-72: Benchmark for core dcti#1 (CL 1-pipe) 0.00:00:16.06 [168,667,029 keys/sec] RC5-72: using core dcti#2 (CL 2-pipe). RC5-72: Benchmark for core dcti#2 (CL 2-pipe) 0.00:00:16.81 [168,043,318 keys/sec] RC5-72: using core dcti#3 (CL 4-pipe). RC5-72: Benchmark for core dcti#3 (CL 4-pipe) 0.00:00:17.03 [171,313,110 keys/sec] RC5-72: using core dcti#4 (CL 1-pipe large). RC5-72: Benchmark for core dcti#4 (CL 1-pipe large) 0.00:00:16.86 [173,663,198 keys/sec] RC5-72: using core dcti#5 (CL 2-pipe large). RC5-72: Benchmark for core dcti#5 (CL 2-pipe large) 0.00:00:17.06 [177,573,667 keys/sec] RC5-72: using core dcti#6 (CL 4-pipe large). RC5-72: Benchmark for core dcti#6 (CL 4-pipe large) 0.00:00:16.70 [176,852,285 keys/sec] RC5-72: using core dcti#7 (CL 1-pipe sleep). RC5-72: Benchmark for core dcti#7 (CL 1-pipe sleep) 0.00:00:16.51 [166,997,768 keys/sec] RC5-72: using core dcti#8 (CL 2-pipe sleep). RC5-72: Benchmark for core dcti#8 (CL 2-pipe sleep) 0.00:00:16.59 [168,755,292 keys/sec] RC5-72: using core dcti#9 (CL 4-pipe sleep). RC5-72: Benchmark for core dcti#9 (CL 4-pipe sleep) 0.00:00:16.64 [170,413,224 keys/sec] RC5-72 benchmark summary : Default core : #-1 (undefined) 0 keys/sec Fastest core : dcti#5 (CL 2-pipe large) 177,573,667 keys/sec "-bench" NVidia GeForce 820M 2048 MB, ForceWare 382.05 RC5-72: using core #0 (CL ANSI 1-pipe). RC5-72: Benchmark for core #0 (CL ANSI 1-pipe) 0.00:00:16.20 [102,620,050 keys/sec] RC5-72: using core dcti#1 (CL 1-pipe). RC5-72: Benchmark for core dcti#1 (CL 1-pipe) 0.00:00:16.98 [129,678,653 keys/sec] RC5-72: using core dcti#2 (CL 2-pipe). RC5-72: Benchmark for core dcti#2 (CL 2-pipe) 0.00:00:16.95 [123,092,851 keys/sec] RC5-72: using core dcti#3 (CL 4-pipe). RC5-72: Benchmark for core dcti#3 (CL 4-pipe) 0.00:00:16.98 [78,567,847 keys/sec] RC5-72: using core dcti#4 (CL 1-pipe large). RC5-72: Benchmark for core dcti#4 (CL 1-pipe large) 0.00:00:17.03 [135,449,921 keys/sec] RC5-72: using core dcti#5 (CL 2-pipe large). RC5-72: Benchmark for core dcti#5 (CL 2-pipe large) 0.00:00:16.89 [128,422,603 keys/sec] RC5-72: using core dcti#6 (CL 4-pipe large). RC5-72: Benchmark for core dcti#6 (CL 4-pipe large) 0.00:00:16.43 [78,558,193 keys/sec] RC5-72: using core dcti#7 (CL 1-pipe sleep). RC5-72: Benchmark for core dcti#7 (CL 1-pipe sleep) 0.00:00:16.65 [127,347,752 keys/sec] RC5-72: using core dcti#8 (CL 2-pipe sleep). RC5-72: Benchmark for core dcti#8 (CL 2-pipe sleep) 0.00:00:16.10 [117,091,782 keys/sec] RC5-72: using core dcti#9 (CL 4-pipe sleep). RC5-72: Benchmark for core dcti#9 (CL 4-pipe sleep) 0.00:00:16.14 [71,550,849 keys/sec] RC5-72 benchmark summary : Default core : #-1 (undefined) 0 keys/sec Fastest core : dcti#4 (CL 1-pipe large) 135,449,921 keys/sec
is clWaitForEvents() the main contributor of the wasted CPU time that the sleeping is solving? |
Yes. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
It is inefficient to poll GPU for results wasting CPU time and
(in case of dGPUs) PCIe bandwidth, especially if CPU is powerful
while (i)GPU is not. Original "CL N-pipe" cores are not touched,
OpenCL kernels are not touched, but scheduling code is modified
to permit 100 times larger work units ("CL 1-pipe large" etc) and
also to flush assignment to GPU and put CPU to sleep
("CL 1-pipe sleep" etc).
"Large" cores are marginally faster than original ones.
"Sleep" cores are slightly slower than "large" ones because GPU may
sometimes finish processing work unit while CPU still sleeps.
These cores, however, consume zero CPU (all other cores consume 1
logical CPU unless sleep is transparently performed by GPU driver -
Intel does this for gen8 but not for newer GPUs, this helps but only
if work unit is large enough for CPU to sleep for several milliseconds).
This results in higher power efficiency and, if we are not limited
by TDP, significant performance improvement. Effect is more pronounced
when CPU does not support MT.
Note that with "sleep" cores there is no need to manually limit
number of threads for CPU cruncher.
Performance/efficiency can be further improved by growing work unit
size faster. Wider testing and benchmarking (especially on high-end
GPUs) are welcome.
Benchmarks below are performed with CPU being loaded with
2.9116.525-amd64 core #4 (YK AVX2).
CUDA client is 2.9110.519b, core #10 (CUDA 1-pipe 64-thd sleep 100us).
"521" refers to 2.9112.521 dnetc-win32-x86-opencl.zip/
dnetc-linux-amd64-opencl.tar.gz
Power consumption is "measured" with "Core Temp" / "s-tui".
Core i5-8265U (15W, 4C8T, 14 nm, 1.6-3.9 GHz,
Intel UHD Graphics 620 [gen9] 1100 MHz), Ubuntu 20.04
CL 2-pipe/large/sleep
[1.022 efficiency improvement, "sleep" is optimal]
Core i7-9700K (95W, 8C8T, 14 nm, 3.6-4.9 GHz,
Intel UHD Graphics 630 [gen9] 1200 MHz), Windows 10 20H2
CL 2-pipe/large/sleep
<Note terrible power efficiency of polling - "large" vs "sleep">
[1.071 efficiency improvement, "sleep" is optimal]
Core i5-5200U (15W, 2C4T, 14 nm, 2.2-2.7 GHz,
Intel HD Graphics 5500 [gen8] 900 MHz)
NVidia GeForce 820M 2048 MB, ForceWare 382.05
Windows 10 20H2
CL 4-pipe/large/sleep
*dGPU is not included in power measurements
**Custom - "large" for iGPU (gen8 driver idles CPU himself),
"sleep" for dGPU
[1.043 efficiency improvement, "large" is optimal for iGPU]
[CPU+iGPU+dGPU: 1.133 performance improvement, "sleep" is optimal for dGPU]
"-bench" Intel UHD Graphics 620 [gen9] 1100 MHz (Core i5-8265U)
"-bench" Intel UHD Graphics 630 [gen9] 1200 MHz (Core i7-9700K)
"-bench" Intel HD Graphics 5500 [gen8] 900 MHz (Core i5-5200U)
"-bench" NVidia GeForce 820M 2048 MB, ForceWare 382.05