[BUG]: Potential race in merge sort due to PDL #3131

gevtushenko · 2024-12-11T22:49:58Z

Is this a duplicate?

I confirmed there appear to be no duplicate issues for this bug and that I agree to the Code of Conduct

Type of Bug

Silent Failure

Component

CUB

Describe the bug

#3114 introduced programmatic dependent launch into device merge sort. I think it should cause date races. Dependent launch consists of two steps:

primary kernel invoking cudaTriggerProgrammaticLaunchCompletion(),
and secondary kernel invoking cudaGridDependencySynchronize()

This causes concurrent execution of these kernels:

As written, merge sort now has the following structure:

BlockSort (primary for partition):
1. Dependency sync (why is it needed?)
2. Load input keys (A)
3. Trigger next launch
4. Sort
5. Store output keys (B)
Partition (secondary for block sort / primary for merge):
1. Dependency sync (waiting for block sort to trigger launch)
2. Read output keys (B)
3. Write partitions (P)
4. No explicit triggering of the next launch, so happens implicitly on last block exit
Merge kernel (secondary for previous partition / primary for next partition):
1. Dependency sync (waiting for partition to finish)
2. Read output keys (B, P)
3. Trigger next launch
4. Merge
5. Write buffer keys (C)
Partition
1. Dependency sync
2. Load buffer keys (C)
3. Write partition (P)
4. No explicit triggering of the next launch, so happens implicitly on last block exit
...

Since each pair of primary / secondary kernels is concurrent, we should have a data race between:

2.ii and 1.v
4.ii and 3.v
...

We likely want to trigger dependent launch after we write the data, not before.

How to Reproduce

Likely have to find workload that is under occupancy so that last CTA of primary kernel runs concurrently with last CTA of secondary kernel.

Expected behavior

No race in merge sort

Reproduction link

No response

Operating System

No response

nvidia-smi output

No response

NVCC version

No response

The text was updated successfully, but these errors were encountered:

bernhardmgruber · 2024-12-12T09:12:26Z

BlockSort (primary for partition):

Dependency sync (why is it needed?)

So we can overlap BlockSort with a previous kernel (outside CUB). However, I just saw that I missed launching the kernel with the PDL flag.

Partition (secondary for block sort / primary for merge):
4. No explicit triggering of the next launch, so happens implicitly on last block exit

I tried adding one, but I always ended up crashing. Under compute-sanitizer, the bug disappeared. I discussed with @ahendriksen and it seems the workaround was to __syncthreads() before triggering the next launch.
I documented this here:

cccl/cub/cub/device/dispatch/dispatch_merge_sort.cuh

Line 215 in 53f69a4

    
           // TODO(bgruber): if we put a call to cudaTriggerProgrammaticLaunchCompletion inside this kernel, the tests fail with

However, the entire kernel is divergent until it exits (because it branches based on whether the thread id is smaller then the problem size), so we could only trigger the next launch at the end of the kernel, which is also done implicitly. Therefore, such a call is missing.

Since each pair of primary / secondary kernels is concurrent, we should have a data race between:

2.ii and 1.v

4.ii and 3.v

...

We likely want to trigger dependent launch after we write the data, not before.

I think you may have fallen for the same misconception as I did, but @ahendriksen could help me out here:

At the end of a kernel, there is something called a "grid-ending membar". cudaGridDependencySynchronize waits for that membar to finish.

cudaGridDependencySynchronize does not wait for cudaTriggerProgrammaticLaunchCompletion, but for the end of the previous kernel. cudaTriggerProgrammaticLaunchCompletion allows the next kernel to ramp-up, but then wait for the previous kernel to complete.

bernhardmgruber · 2024-12-12T09:13:51Z

The only data race I could introduce is if I would put the cudaGridDependencySynchronize after reading the first data and enabling PDL for that kernel. Then the kernel could start reading while a previous kernel is still writing.

gevtushenko · 2024-12-12T15:55:23Z

cudaGridDependencySynchronize does not wait for cudaTriggerProgrammaticLaunchCompletion, but for the end of the previous kernel.

My bad, I completely missed that. Thank you for elaborating!

gevtushenko added the bug Something isn't working right. label Dec 11, 2024

github-project-automation bot added this to CCCL Dec 11, 2024

github-project-automation bot moved this to Todo in CCCL Dec 11, 2024

gevtushenko assigned bernhardmgruber Dec 11, 2024

gevtushenko mentioned this issue Dec 11, 2024

[EPIC] Use programmatic dependent launch in all CUB algorithms #3115

Open

17 tasks

gevtushenko closed this as completed Dec 12, 2024

github-project-automation bot moved this from Todo to Done in CCCL Dec 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG]: Potential race in merge sort due to PDL #3131

[BUG]: Potential race in merge sort due to PDL #3131

gevtushenko commented Dec 11, 2024 •

edited

Loading

bernhardmgruber commented Dec 12, 2024

bernhardmgruber commented Dec 12, 2024

gevtushenko commented Dec 12, 2024

[BUG]: Potential race in merge sort due to PDL #3131

[BUG]: Potential race in merge sort due to PDL #3131

Comments

gevtushenko commented Dec 11, 2024 • edited Loading

Is this a duplicate?

Type of Bug

Component

Describe the bug

How to Reproduce

Expected behavior

Reproduction link

Operating System

nvidia-smi output

NVCC version

bernhardmgruber commented Dec 12, 2024

bernhardmgruber commented Dec 12, 2024

gevtushenko commented Dec 12, 2024

gevtushenko commented Dec 11, 2024 •

edited

Loading