Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]: Potential race in merge sort due to PDL #3131

Closed
1 task done
gevtushenko opened this issue Dec 11, 2024 · 3 comments
Closed
1 task done

[BUG]: Potential race in merge sort due to PDL #3131

gevtushenko opened this issue Dec 11, 2024 · 3 comments
Assignees
Labels
bug Something isn't working right.

Comments

@gevtushenko
Copy link
Collaborator

gevtushenko commented Dec 11, 2024

Is this a duplicate?

Type of Bug

Silent Failure

Component

CUB

Describe the bug

#3114 introduced programmatic dependent launch into device merge sort. I think it should cause date races. Dependent launch consists of two steps:

  1. primary kernel invoking cudaTriggerProgrammaticLaunchCompletion(),
  2. and secondary kernel invoking cudaGridDependencySynchronize()

This causes concurrent execution of these kernels:

Image

As written, merge sort now has the following structure:

  1. BlockSort (primary for partition):
    1. Dependency sync (why is it needed?)
    2. Load input keys (A)
    3. Trigger next launch
    4. Sort
    5. Store output keys (B)
  2. Partition (secondary for block sort / primary for merge):
    1. Dependency sync (waiting for block sort to trigger launch)
    2. Read output keys (B)
    3. Write partitions (P)
    4. No explicit triggering of the next launch, so happens implicitly on last block exit
  3. Merge kernel (secondary for previous partition / primary for next partition):
    1. Dependency sync (waiting for partition to finish)
    2. Read output keys (B, P)
    3. Trigger next launch
    4. Merge
    5. Write buffer keys (C)
  4. Partition
    1. Dependency sync
    2. Load buffer keys (C)
    3. Write partition (P)
    4. No explicit triggering of the next launch, so happens implicitly on last block exit
  5. ...

Since each pair of primary / secondary kernels is concurrent, we should have a data race between:

  • 2.ii and 1.v
  • 4.ii and 3.v
  • ...

We likely want to trigger dependent launch after we write the data, not before.

How to Reproduce

Likely have to find workload that is under occupancy so that last CTA of primary kernel runs concurrently with last CTA of secondary kernel.

Expected behavior

No race in merge sort

Reproduction link

No response

Operating System

No response

nvidia-smi output

No response

NVCC version

No response

@bernhardmgruber
Copy link
Contributor

  1. BlockSort (primary for partition):

    1. Dependency sync (why is it needed?)

So we can overlap BlockSort with a previous kernel (outside CUB). However, I just saw that I missed launching the kernel with the PDL flag.

  1. Partition (secondary for block sort / primary for merge):
    4. No explicit triggering of the next launch, so happens implicitly on last block exit

I tried adding one, but I always ended up crashing. Under compute-sanitizer, the bug disappeared. I discussed with @ahendriksen and it seems the workaround was to __syncthreads() before triggering the next launch.
I documented this here:

// TODO(bgruber): if we put a call to cudaTriggerProgrammaticLaunchCompletion inside this kernel, the tests fail with

However, the entire kernel is divergent until it exits (because it branches based on whether the thread id is smaller then the problem size), so we could only trigger the next launch at the end of the kernel, which is also done implicitly. Therefore, such a call is missing.

Since each pair of primary / secondary kernels is concurrent, we should have a data race between:

  • 2.ii and 1.v
  • 4.ii and 3.v
  • ...

We likely want to trigger dependent launch after we write the data, not before.

I think you may have fallen for the same misconception as I did, but @ahendriksen could help me out here:

At the end of a kernel, there is something called a "grid-ending membar". cudaGridDependencySynchronize waits for that membar to finish.

cudaGridDependencySynchronize does not wait for cudaTriggerProgrammaticLaunchCompletion, but for the end of the previous kernel. cudaTriggerProgrammaticLaunchCompletion allows the next kernel to ramp-up, but then wait for the previous kernel to complete.

@bernhardmgruber
Copy link
Contributor

The only data race I could introduce is if I would put the cudaGridDependencySynchronize after reading the first data and enabling PDL for that kernel. Then the kernel could start reading while a previous kernel is still writing.

@gevtushenko
Copy link
Collaborator Author

cudaGridDependencySynchronize does not wait for cudaTriggerProgrammaticLaunchCompletion, but for the end of the previous kernel.

My bad, I completely missed that. Thank you for elaborating!

@github-project-automation github-project-automation bot moved this from Todo to Done in CCCL Dec 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working right.
Projects
Archived in project
Development

No branches or pull requests

2 participants