You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Performance of thrust::sort with OpenMP backend seems to be far worse than even standard std::sort. The code provided in "How to reproduce" section compares timings of sorting with std and thrust. I tested it on two machines, one with AMD Ryzen 5950X and 32GB DDR4 memory, and the other with Intel i7-1365U and 32GB DDR5 memory, both on Fedora 38 operating system. In the first case, std::sort manages to sort 100 000 000 floats in 32s and thrust performs the same task in ~87s. In the second case, the timings are ~31s and ~117s respectively. Looking at htop it seems that thrust indeed uses some parallelization as all cores being used.
How to Reproduce
Save the following file as sorting_comparison.cpp:
By the way, I just checked with clang++ and libomp instead of g++ and libgomp, so the problem seems to not be tied to a specific compiler/openmp implementation.
Hello @dexter2206 and thank you for reporting the issue!
I can reproduce it on a Threadripper:
std took: 34s
thrust took: 80s
Looking at the implementation, the issue is related to the reduction of the number of threads that perform merging. Only 3 seconds out of 80 are spent in the sorting of thread partitions. At the final step, merging is done serially by a single thread.
I think we should experiment with merge path approach, so that all threads are utilized during the merge phases. Apart from that, we should give radix sort a try.
Is this a duplicate?
Type of Bug
Performance
Component
Thrust
Describe the bug
Performance of
thrust::sort
with OpenMP backend seems to be far worse than even standardstd::sort
. The code provided in "How to reproduce" section compares timings of sorting withstd
andthrust
. I tested it on two machines, one with AMD Ryzen 5950X and 32GB DDR4 memory, and the other with Intel i7-1365U and 32GB DDR5 memory, both on Fedora 38 operating system. In the first case,std::sort
manages to sort100 000 000
floats in 32s and thrust performs the same task in ~87s. In the second case, the timings are ~31s and ~117s respectively. Looking at htop it seems that thrust indeed uses some parallelization as all cores being used.How to Reproduce
sorting_comparison.cpp
:./sorting-comparison
, it should print timings forstd::sort
andthrust::sort
to stdout.Expected behavior
I would expect
thrust::sort
to be at least as fast asstd::sort
.Reproduction link
No response
Operating System
Fedora 38
nvidia-smi output
N/A
NVCC version
N/A
The text was updated successfully, but these errors were encountered: