Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]: Low performance of sorting with OMP backend #1244

Open
1 task done
dexter2206 opened this issue Dec 28, 2023 · 2 comments
Open
1 task done

[BUG]: Low performance of sorting with OMP backend #1244

dexter2206 opened this issue Dec 28, 2023 · 2 comments
Labels
bug Something isn't working right.

Comments

@dexter2206
Copy link

dexter2206 commented Dec 28, 2023

Is this a duplicate?

Type of Bug

Performance

Component

Thrust

Describe the bug

Performance of thrust::sort with OpenMP backend seems to be far worse than even standard std::sort. The code provided in "How to reproduce" section compares timings of sorting with std and thrust. I tested it on two machines, one with AMD Ryzen 5950X and 32GB DDR4 memory, and the other with Intel i7-1365U and 32GB DDR5 memory, both on Fedora 38 operating system. In the first case, std::sort manages to sort 100 000 000 floats in 32s and thrust performs the same task in ~87s. In the second case, the timings are ~31s and ~117s respectively. Looking at htop it seems that thrust indeed uses some parallelization as all cores being used.

How to Reproduce

  1. Save the following file as sorting_comparison.cpp:
#include <algorithm>
#include <chrono>
#include <iostream>
#include <random>
#include <vector>

#include <thrust/sort.h>
#include <thrust/device_vector.h>

using namespace std;
using namespace std::chrono;

const int VEC_SIZE = 100'000'000;

vector<float> random_vector(int size) {
    vector<float> data(size);
    random_device rnd_device;
    mt19937 mersenne_engine{rnd_device()};
    uniform_real_distribution<float> dist{-10.0f, 10.0f};

    generate(begin(data), end(data),
             [&dist, &mersenne_engine]() { return dist(mersenne_engine); });

    return data;
}

int main() {
    vector<float> data = random_vector(VEC_SIZE);
    thrust::device_vector<float> data_for_thrust(data);

    auto start = high_resolution_clock::now();
    sort(begin(data), end(data));
    auto stop = high_resolution_clock::now();
    cout << "std took: " << duration_cast<seconds>(stop-start).count() << "s" << endl;

    start = high_resolution_clock::now();
    thrust::sort(begin(data_for_thrust), end(data_for_thrust));
    stop = high_resolution_clock::now();
    cout << "thrust took: " << duration_cast<seconds>(stop-start).count() << "s" << endl;

    return 0;
}
  1. Make sure thrust is available in include path and compile the file with:
    g++ -fopenmp -lgomp -DTHRUST_DEVICE_SYSTEM=THRUST_DEVICE_SYSTEM_OMP sorting_comparison.cpp -o sorting-comparison
  2. Run ./sorting-comparison, it should print timings for std::sort and thrust::sort to stdout.

Expected behavior

I would expect thrust::sort to be at least as fast as std::sort.

Reproduction link

No response

Operating System

Fedora 38

nvidia-smi output

N/A

NVCC version

N/A

@dexter2206 dexter2206 added the bug Something isn't working right. label Dec 28, 2023
@github-project-automation github-project-automation bot moved this to Todo in CCCL Dec 28, 2023
@dexter2206
Copy link
Author

By the way, I just checked with clang++ and libomp instead of g++ and libgomp, so the problem seems to not be tied to a specific compiler/openmp implementation.

@gevtushenko
Copy link
Collaborator

Hello @dexter2206 and thank you for reporting the issue!

I can reproduce it on a Threadripper:

std took: 34s
thrust took: 80s

Looking at the implementation, the issue is related to the reduction of the number of threads that perform merging. Only 3 seconds out of 80 are spent in the sorting of thread partitions. At the final step, merging is done serially by a single thread.

I think we should experiment with merge path approach, so that all threads are utilized during the merge phases. Apart from that, we should give radix sort a try.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working right.
Projects
Status: Todo
Development

No branches or pull requests

2 participants