[FEA]: Introduce cuda.cooperative overloads not requiring temporary storage #2527

gevtushenko · 2024-10-09T23:23:18Z

Is this a duplicate?

I confirmed there appear to be no duplicate issues for this request and that I agree to the Code of Conduct

Area

CUB

Is your feature request related to a problem? Please describe.

cuda.cooperative API currently has an issue. We do not specify alignment of the temporary storage. This leads to bugs like the following one:

cx_int_sum = cudax.block.sum(numba.int32, 256)
int_sum_storage_bytes = cx_int_sum.temp_storage_bytes
cx_float_sum = cudax.block.sum(dtype=numba.float64, threads_in_block=256)
float_sum_storage_bytes = cx_float_sum.temp_storage_bytes

cx_files = cx_float_sum.files + cx_int_sum.files


@numba.cuda.jit(
    "void(int32[:], float64[:])",
    link=cx_files
)
def kernel(args_in, args_out):
    int_sum_storage = numba.cuda.shared.array(shape=int_sum_storage_bytes, dtype=numba.uint8)
    float_sum_storage = numba.cuda.shared.array(shape=float_sum_storage_bytes, dtype=numba.uint8)
    # do int reduction first
    tix = numba.cuda.threadIdx.x
    val = cx_int_sum(int_sum_storage, args_in[tix])
    val2 = cx_float_sum(float_sum_storage, numba.float64(args_in[tix]))
    args_out[tix] = val + val2

Because both allocations of shared memory are made at uint8 granularity, second one is not properly aligned, leading to:

cupy_backends.cuda.api.runtime.CUDARuntimeError: cudaErrorMisalignedAddress: misaligned address

Describe the solution you'd like

Majority of kernels do not create temporary storage unions, so we could simplify the API by not requiring temporary storage:

    block_reduce = cudax.block.reduce(numba.int32, threads_in_block, op)

    @cuda.jit(link=block_reduce.files)
    def kernel(input, output):
        block_output = block_reduce(input[cuda.threadIdx.x])

Describe alternatives you've considered

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

gevtushenko added the feature request New feature or request. label Oct 9, 2024

github-project-automation bot added this to CCCL Oct 9, 2024

github-project-automation bot moved this to Todo in CCCL Oct 9, 2024

gevtushenko self-assigned this Oct 9, 2024

gevtushenko mentioned this issue Oct 9, 2024

Introduce cuda.cooperative overloads not requiring temporary storage #2528

Merged

2 tasks

cccl-authenticator-app bot moved this from Todo to In Review in CCCL Oct 9, 2024

cccl-authenticator-app bot moved this from In Review to In Progress in CCCL Oct 10, 2024

gevtushenko mentioned this issue Oct 10, 2024

[BUG]: Design a fix for temporary storage alignment in cuda.cooperative module #2558

Open

1 task

cccl-authenticator-app bot moved this from In Progress to In Review in CCCL Dec 5, 2024

gevtushenko closed this as completed in #2528 Dec 5, 2024

github-project-automation bot moved this from In Review to Done in CCCL Dec 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA]: Introduce cuda.cooperative overloads not requiring temporary storage #2527

[FEA]: Introduce cuda.cooperative overloads not requiring temporary storage #2527

gevtushenko commented Oct 9, 2024 •

edited

Loading

[FEA]: Introduce cuda.cooperative overloads not requiring temporary storage #2527

[FEA]: Introduce cuda.cooperative overloads not requiring temporary storage #2527

Comments

gevtushenko commented Oct 9, 2024 • edited Loading

Is this a duplicate?

Area

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered

Additional context

gevtushenko commented Oct 9, 2024 •

edited

Loading