Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA]: Introduce cuda.cooperative overloads not requiring temporary storage #2527

Closed
1 task done
gevtushenko opened this issue Oct 9, 2024 · 0 comments · Fixed by #2528
Closed
1 task done

[FEA]: Introduce cuda.cooperative overloads not requiring temporary storage #2527

gevtushenko opened this issue Oct 9, 2024 · 0 comments · Fixed by #2528
Assignees
Labels
feature request New feature or request.

Comments

@gevtushenko
Copy link
Collaborator

gevtushenko commented Oct 9, 2024

Is this a duplicate?

Area

CUB

Is your feature request related to a problem? Please describe.

cuda.cooperative API currently has an issue. We do not specify alignment of the temporary storage. This leads to bugs like the following one:

cx_int_sum = cudax.block.sum(numba.int32, 256)
int_sum_storage_bytes = cx_int_sum.temp_storage_bytes
cx_float_sum = cudax.block.sum(dtype=numba.float64, threads_in_block=256)
float_sum_storage_bytes = cx_float_sum.temp_storage_bytes

cx_files = cx_float_sum.files + cx_int_sum.files


@numba.cuda.jit(
    "void(int32[:], float64[:])",
    link=cx_files
)
def kernel(args_in, args_out):
    int_sum_storage = numba.cuda.shared.array(shape=int_sum_storage_bytes, dtype=numba.uint8)
    float_sum_storage = numba.cuda.shared.array(shape=float_sum_storage_bytes, dtype=numba.uint8)
    # do int reduction first
    tix = numba.cuda.threadIdx.x
    val = cx_int_sum(int_sum_storage, args_in[tix])
    val2 = cx_float_sum(float_sum_storage, numba.float64(args_in[tix]))
    args_out[tix] = val + val2

Because both allocations of shared memory are made at uint8 granularity, second one is not properly aligned, leading to:

cupy_backends.cuda.api.runtime.CUDARuntimeError: cudaErrorMisalignedAddress: misaligned address

Describe the solution you'd like

Majority of kernels do not create temporary storage unions, so we could simplify the API by not requiring temporary storage:

    block_reduce = cudax.block.reduce(numba.int32, threads_in_block, op)

    @cuda.jit(link=block_reduce.files)
    def kernel(input, output):
        block_output = block_reduce(input[cuda.threadIdx.x])

Describe alternatives you've considered

No response

Additional context

No response

@gevtushenko gevtushenko added the feature request New feature or request. label Oct 9, 2024
@github-project-automation github-project-automation bot moved this to Todo in CCCL Oct 9, 2024
@gevtushenko gevtushenko self-assigned this Oct 9, 2024
@cccl-authenticator-app cccl-authenticator-app bot moved this from Todo to In Review in CCCL Oct 9, 2024
@cccl-authenticator-app cccl-authenticator-app bot moved this from In Review to In Progress in CCCL Oct 10, 2024
@cccl-authenticator-app cccl-authenticator-app bot moved this from In Progress to In Review in CCCL Dec 5, 2024
@github-project-automation github-project-automation bot moved this from In Review to Done in CCCL Dec 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request.
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

1 participant