Fix compilation with FP16_QK_REDUCTION enabled. #962
+300
−16
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
As described in #806 and #936, setting the cmake build flag
FLASHINFER_GEN_USE_FP16_QK_REDUCTIONS
to "true" causes a build failure due tocuda_fp16.h
not supportingconstexpr
cast from__half
type tofloat
. Note that the issue is not just a CMake/C++ configuration issue the issue will be triggered even in the flashinfer JIT code compilation path as reported in #915.The PR fixes #806 and #936 by adding a modified version of the FP16 header from the FP16 library that supports
constexpr
versions of the conversion functions. To make the conversion functionsconstexpr
, I am usingstd::bit_cast
that is the reason for bumping the required standard to 20.With these changes I am able to build the C++ API with both
FLASHINFER_GEN_USE_FP16_QK_REDUCTIONS
ON and OFF.Fixes #936
Fixes #806