[BUG]: Low performance of sorting with OMP backend #1244

dexter2206 · 2023-12-28T15:49:54Z

Is this a duplicate?

I confirmed there appear to be no duplicate issues for this bug and that I agree to the Code of Conduct

Type of Bug

Performance

Component

Thrust

Describe the bug

Performance of thrust::sort with OpenMP backend seems to be far worse than even standard std::sort. The code provided in "How to reproduce" section compares timings of sorting with std and thrust. I tested it on two machines, one with AMD Ryzen 5950X and 32GB DDR4 memory, and the other with Intel i7-1365U and 32GB DDR5 memory, both on Fedora 38 operating system. In the first case, std::sort manages to sort 100 000 000 floats in 32s and thrust performs the same task in ~87s. In the second case, the timings are ~31s and ~117s respectively. Looking at htop it seems that thrust indeed uses some parallelization as all cores being used.

How to Reproduce

Save the following file as sorting_comparison.cpp:

#include <algorithm>
#include <chrono>
#include <iostream>
#include <random>
#include <vector>

#include <thrust/sort.h>
#include <thrust/device_vector.h>

using namespace std;
using namespace std::chrono;

const int VEC_SIZE = 100'000'000;

vector<float> random_vector(int size) {
    vector<float> data(size);
    random_device rnd_device;
    mt19937 mersenne_engine{rnd_device()};
    uniform_real_distribution<float> dist{-10.0f, 10.0f};

    generate(begin(data), end(data),
             [&dist, &mersenne_engine]() { return dist(mersenne_engine); });

    return data;
}

int main() {
    vector<float> data = random_vector(VEC_SIZE);
    thrust::device_vector<float> data_for_thrust(data);

    auto start = high_resolution_clock::now();
    sort(begin(data), end(data));
    auto stop = high_resolution_clock::now();
    cout << "std took: " << duration_cast<seconds>(stop-start).count() << "s" << endl;

    start = high_resolution_clock::now();
    thrust::sort(begin(data_for_thrust), end(data_for_thrust));
    stop = high_resolution_clock::now();
    cout << "thrust took: " << duration_cast<seconds>(stop-start).count() << "s" << endl;

    return 0;
}

Make sure thrust is available in include path and compile the file with:

g++ -fopenmp -lgomp -DTHRUST_DEVICE_SYSTEM=THRUST_DEVICE_SYSTEM_OMP sorting_comparison.cpp -o sorting-comparison

Run ./sorting-comparison, it should print timings for std::sort and thrust::sort to stdout.

Expected behavior

I would expect thrust::sort to be at least as fast as std::sort.

Reproduction link

No response

Operating System

Fedora 38

nvidia-smi output

N/A

NVCC version

N/A

The text was updated successfully, but these errors were encountered:

dexter2206 · 2023-12-28T17:09:27Z

By the way, I just checked with clang++ and libomp instead of g++ and libgomp, so the problem seems to not be tied to a specific compiler/openmp implementation.

gevtushenko · 2023-12-28T23:03:11Z

Hello @dexter2206 and thank you for reporting the issue!

I can reproduce it on a Threadripper:

std took: 34s
thrust took: 80s

Looking at the implementation, the issue is related to the reduction of the number of threads that perform merging. Only 3 seconds out of 80 are spent in the sorting of thread partitions. At the final step, merging is done serially by a single thread.

I think we should experiment with merge path approach, so that all threads are utilized during the merge phases. Apart from that, we should give radix sort a try.

dexter2206 added the bug Something isn't working right. label Dec 28, 2023

github-project-automation bot added this to CCCL Dec 28, 2023

github-project-automation bot moved this to Todo in CCCL Dec 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG]: Low performance of sorting with OMP backend #1244

[BUG]: Low performance of sorting with OMP backend #1244

dexter2206 commented Dec 28, 2023 •

edited

Loading

dexter2206 commented Dec 28, 2023

gevtushenko commented Dec 28, 2023

[BUG]: Low performance of sorting with OMP backend #1244

[BUG]: Low performance of sorting with OMP backend #1244

Comments

dexter2206 commented Dec 28, 2023 • edited Loading

Is this a duplicate?

Type of Bug

Component

Describe the bug

How to Reproduce

Expected behavior

Reproduction link

Operating System

nvidia-smi output

NVCC version

dexter2206 commented Dec 28, 2023

gevtushenko commented Dec 28, 2023

dexter2206 commented Dec 28, 2023 •

edited

Loading