You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
I propose modifying the MaxSmOccupancy function's inline specification from inline to force_inline (utilizing the _CCCL_FORCEINLINE) here. This change is crucial for enhancing the compatibility and functionality of projects that employ multiple shared libraries incorporating the Thrust/CUB libraries.
When utilizing Thrust/CUB across several shared libraries, it's possible to encounter a cudaErrorInvalidDeviceFunction error. This issue arises if the compiler fails to correctly inline the MaxSmOccupancy function. Our project structure comprises multiple libraries (e.g., libA.so and libB.so) that utilize Thrust/CUB and are linked together. We discovered that MaxSmOccupancy is implemented within libA.so. However, when invoking Thrust functions in libB.so, the kernel pointer (kernel_ptr), which is a Thrust device function within libB.so, is passed to and queried by MaxSmOccupancy in libA.so. This operation is problematic within CUDA, which triggers cudaErrorInvalidDeviceFunction because passing the function pointer of a device function between libraries is restricted by the CUDA Runtime API, given that CUlibrary structures of CUDA Driver are opaque and managed by CUDA Runtime.
To address this issue and prevent the cudaErrorInvalidDeviceFunction, it's imperative to ensure that MaxSmOccupancy is forcefully inlined. This adjustment ensures that the cudaOccupancyMaxActiveBlocksPerMultiprocessor function is invoked within the same library that calls the Thrust function, thereby circumventing the aforementioned error.
Thanks for the excellent write up! We've dealt with countless insidious issues that originate from the interplay between symbol visibility across shared libraries and how kernel registration works in the CUDA Runtime. It's been a nasty problem that we've hoped was finally put to rest. You can read about the saga here #443.
I think you may have just identified an area that we missed 🙁.
Similar to how in #443 we had to decorate the thrust::cuda_cub::launcher::triple_chevron kernel launch function with _LIBCUDACXX_HIDDEN (which is ultimately just __attribute__((visibility(hidden)))) to avoid symbol collisions across shared objects, it would seem we need to do the same thing with cub::MaxSmOccupancy. Using forceinline as you suggest would probably work too, but the symbol visibility annotation is the more targeted solution to the real root of the problem.
I'll need @gevtushenko to confirm that this is indeed the right fix and then we'll try and take care of that ASAP.
@eee4017 thank you for reporting the issue! I agree with your analysis. Every function taking kernel pointers should be hidden. I think it goes beyond SM occupancy calculator and triple chevron launcher. In the CUB dispatch layer, we also have this issue that was likely masked by force inlining. Some places (segmented sort) missed force inline annotation, potentially leading to linkage issue. I've filed #1391 that hides all functions taking kernel pointers. Please, take a look if it addresses the issue for you.
Is this a duplicate?
Area
CUB
Is your feature request related to a problem? Please describe.
I propose modifying the
MaxSmOccupancy
function'sinline
specification frominline
toforce_inline
(utilizing the_CCCL_FORCEINLINE
) here. This change is crucial for enhancing the compatibility and functionality of projects that employ multiple shared libraries incorporating the Thrust/CUB libraries.When utilizing Thrust/CUB across several shared libraries, it's possible to encounter a
cudaErrorInvalidDeviceFunction
error. This issue arises if the compiler fails to correctly inline theMaxSmOccupancy
function. Our project structure comprises multiple libraries (e.g.,libA.so
andlibB.so
) that utilize Thrust/CUB and are linked together. We discovered thatMaxSmOccupancy
is implemented withinlibA.so
. However, when invoking Thrust functions inlibB.so
, the kernel pointer (kernel_ptr
), which is a Thrust device function withinlibB.so
, is passed to and queried byMaxSmOccupancy
inlibA.so
. This operation is problematic within CUDA, which triggerscudaErrorInvalidDeviceFunction
because passing the function pointer of a device function between libraries is restricted by the CUDA Runtime API, given thatCUlibrary
structures of CUDA Driver are opaque and managed by CUDA Runtime.To address this issue and prevent the
cudaErrorInvalidDeviceFunction
, it's imperative to ensure thatMaxSmOccupancy
is forcefully inlined. This adjustment ensures that thecudaOccupancyMaxActiveBlocksPerMultiprocessor
function is invoked within the same library that calls the Thrust function, thereby circumventing the aforementioned error.Describe the solution you'd like
Change
inline
to_CCCL_FORCEINLINE
hereDescribe alternatives you've considered
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: