- misc: addressing the package renaming issues by @yzh119 in flashinfer-ai#770
- feat: support deepseek prefill attention shape by @yzh119 in flashinfer-ai#765
- refactor: change the structure of attention updater by @yzh119 in flashinfer-ai#772
- hotfix: follow up of #772 by @yzh119 in flashinfer-ai#773
- bugfix: Ensure Loop Termination by Enforcing IEEE-754 Compliance in Sampling Kernels by @yzh119 in flashinfer-ai#774
- bugfix: fix the JIT warmup arguments in unittests by @yzh119 in flashinfer-ai#775
- ci: change whl folder to flashinfer-python by @abcdabcd987 in flashinfer-ai#779
- perf: refactor fa2 prefill template by @yzh119 in flashinfer-ai#776
- feat: Separate QK/VO head dim dispatch for sm90 AOT by @abcdabcd987 in flashinfer-ai#778
- bugfix: fix batch prefill attention kernel unittests by @yzh119 in flashinfer-ai#781
- misc: remove head dimension 64 from AOT by @yzh119 in flashinfer-ai#782
- misc: allow head_dim=64 for sm90 AOT by @abcdabcd987 in flashinfer-ai#783
- bugfix: drop CTA_TILE_Q=32 by @abcdabcd987 in flashinfer-ai#785
- refactor: make
group_size
a part of params by @yzh119 in flashinfer-ai#786 - bugfix: MLA decode should multiply sm_scale by math::log2e by @tsu-bin in flashinfer-ai#787
- fix rope logic in mla decoding by @zhyncs in flashinfer-ai#793
- Fix arguments of
plan
for split QK/VO head dims by @abmfy in flashinfer-ai#795 - test: add unittest comparing deepseek prefill fa2 & 3 implementation by @yzh119 in flashinfer-ai#797
- bugfix: fix aot build not compatible with cmake command by @tsu-bin in flashinfer-ai#796
- Fix the type annotation of q_dtype and kv_dtype on ragged prefill by @nandor in flashinfer-ai#798
- feat: support f32 attention output in FA2 template by @yzh119 in flashinfer-ai#799
- feat: apply sm_scale at logits instead of q in FA2 template by @yzh119 in flashinfer-ai#801
- bugfix: mla decode failed under cuda graph mode, and update test case by @tsu-bin in flashinfer-ai#803
- perf: memory efficient deepseek mla fused page-attention kernel by @yzh119 in flashinfer-ai#804
- bugfix: mla page-attention kernel for different page sizes by @yzh119 in flashinfer-ai#810
- doc: add documentation to new MLA interface by @yzh119 in flashinfer-ai#811
- feat: unlocking MLA for A100 by @yzh119 in flashinfer-ai#812
- feat: cudagraph-compatible MLA API by @yzh119 in flashinfer-ai#813
- feat: unlock MLA attention for sm89 (L40/L40s/4090) by @yzh119 in flashinfer-ai#814
- misc: fix sphinx by @abcdabcd987 in flashinfer-ai#815
- bugfix: fix the behavior of mla plan function when provided with host tensors by @yzh119 in flashinfer-ai#816
- doc: improve mla related documentation by @yzh119 in flashinfer-ai#818
- @abmfy made their first contribution in flashinfer-ai#795
- ci: fix the update_whl_index script to regonize version number with "post" and add torch2.5 by @yzh119 in flashinfer-ai#694
- bugfix: casting int array to int32 for rope input arguments by @yzh119 in flashinfer-ai#697
- bugfix: only use sm90 group gemm when torch cuda >= 12.3 by @yzh119 in flashinfer-ai#699
- misc: remove release-please workflow by @yzh119 in flashinfer-ai#705
- Customizable SM90 prefill kernels. by @hyhieu in flashinfer-ai#704
- hotfix: revert torch.library register by @yzh119 in flashinfer-ai#709
- Improve compatibility with pytorch 2.5 by @zifeitong in flashinfer-ai#711
- misc: add bibtex reference by @yzh119 in flashinfer-ai#712
- sampling: simplify min-p sampling by @yzh119 in flashinfer-ai#713
- perf: fix the iteration bound of SWA in FA2 prefill template by @yzh119 in flashinfer-ai#714
- bugfix: fix min-p AOT compilation in #713 by @yzh119 in flashinfer-ai#717
- Triton implementation of
silu_and_mul
by @nandor in flashinfer-ai#716 - bugfix: FusedAddRMSNorm kernels might require more than 48KB shared memory when d is large. by @bobboli in flashinfer-ai#718
- bugfix: Choose sm90 kernels only for Hopper GPUs. by @bobboli in flashinfer-ai#719
- Finer-grained control over fp16/fp8 builds by @nandor in flashinfer-ai#722
- Align KV chunk size binary search with actual KV chunk splitting. by @timzsu in flashinfer-ai#728
- ci: rename python package name to
flashinfer-python
by @yzh119 in flashinfer-ai#729 - Add a note about int32/int64 datatypes to the
kv_layout
tutorial by @fergusfinn in flashinfer-ai#737 - fix return type of cuBLAS by @zhyncs in flashinfer-ai#749
- [Refactor] Unify JIT/Customization/AOT mode by @yzh119 in flashinfer-ai#748
- Move allocations out of torch ops by @nandor in flashinfer-ai#740
- [Lint] Fix some linting issues and provide automatic format check script by @LeiWang1999 in flashinfer-ai#743
- Filter out unsupported head dim for sm90 by @abcdabcd987 in flashinfer-ai#751
- bugfix: various AOT issues by @abcdabcd987 in flashinfer-ai#752
- [bugfix] Fix cpp tests/benchmarks by @yzh119 in flashinfer-ai#753
- fix pin memory device by @youkaichao in flashinfer-ai#755
- Add dev container for easier development by @ByronHsu in flashinfer-ai#680
- hotfix: bugfix to #756 by @yzh119 in flashinfer-ai#757
- Change
apply_rope_with_cos_sin_cache
to acceptcos_sin_cache
by @ByronHsu in flashinfer-ai#754 - fix: match statement not supported in Python 3.8 by @xslingcn in flashinfer-ai#759
- bugfix: use actual sm count for num_sm90_ctas by @LLLLKKKK in flashinfer-ai#762
- bugfix: Fix block-sparse attention API by @yzh119 in flashinfer-ai#767
- Version bump: v0.2.0.post2 by @yzh119 in flashinfer-ai#768
- @hyhieu made their first contribution in flashinfer-ai#704
- @zifeitong made their first contribution in flashinfer-ai#711
- @bobboli made their first contribution in flashinfer-ai#718
- @timzsu made their first contribution in flashinfer-ai#728
- @fergusfinn made their first contribution in flashinfer-ai#737
- @LeiWang1999 made their first contribution in flashinfer-ai#743
- @youkaichao made their first contribution in flashinfer-ai#755
- @LLLLKKKK made their first contribution in flashinfer-ai#762
0.2.0.post1 (2024-12-22)
- bug fix on determine_attention_backend condition (#688) (bcf7a3e)
- accelerate plan speed of fa3 template (#690) (db8f04d)
0.2.0 (2024-12-17)
FlashInfer 0.2 - Efficient and Customizable Kernels for LLM Inference Serving
- add
rotary_dim
argument to rope APIs for partial apply rope (#599) (eb9bc71) - add a
use_softmax
field in variant class (#533) (d81af97) - add an option
non_blocking
to plan function (#622) (560af6f) - add gemma_rmsnorm and gemma_fused_add_rmsnorm (#477) (1a6b17e)
- add group size 3 to GQA decode dispatch (#558) (6227562)
- add JIT compilation support for FA3 templates (#672) (d4e8d79)
- allow the cascade kernels to be executed using varying sequence lengths (#627) (92ac440)
- CUDAGraph compatibility of multi-level cascade inference APIs (#586) (2332e8a)
- fix the maximal grid dimension in prefill planning with CUDA graphs (#639) (86ca89a)
- improve the precision of the FusedAddRMSNormKernel function (#587) (c7dc921)
- JIT compilation (#507) (3613a5b)
- modify group-gemm stage number (#497) (52dab1d)
- non-contiguous query with paged kv cache (#553) (89f2c4a)
- pass a dynamic token count to the cascade kernels (#635) (5fe9f7d)
- simplify prefill JIT compilation (#605) (fe4f898)
- specify gemm backend (#648) (0cc1a51)
- support cached cos/sin in rope APIs (#585) (83e541d)
- support huggingface transformer style rope interface (#568) (4f40420)
- support sm90 cutlass group gemm (#509) (794bdda)
- torch custom_op fix for rope (#569) (3e104bc)
- torch custom_op support: norm (#552) (f6e0010)
- torch.compile and custom_op support (#554) (9bf916f)
- warmup for jit kernel tests (#629) (8f5f349)
- AOT compiler flags on non-sm90 (#522) (0aa4726)
- batch decode kernel redundant store output to gmem (#505) (90e42a7)
- compatible with torch 2.2 (#478) (ac41d1b)
- flashinfer-ai#452 (b53a46f)
- remove redundant load (#495) (2de16b0)
- update bmm fp8 test (#487) (45eac04)
- accelerate JIT compilation speed (#618) (eaf73fd)
- Dense and sparse customizable flashattention-3 template (#667) (51236c9)
- fix prefill kernel performance degradation (step 1) (#602) (595cf60)
- fix the performance issue of
append_paged_kv_cache
(#588) (e15f7c9) - improve parallelism in RoPE with pos_ids (#609) (ff05155)
- improve plan performance by using non-blocking memcpy (#547) (41ebe6d)
- reduce the read and write of shared memory in the FusedAddRMSNormKernel (#592) (2043ca2)
- reduce total_num_tiles_q by one (#644) (553ace5)
- remove unnecessary contiguous operation in block sparse attention (#561) (7a7ad46)
- speedup jit compilation of prefill attention kernels (#632) (a059586)
- use cuda-core implementation for io-bound block-sparse attention (#560) (3fbf028)
0.1.6 (2024-08-27)
Starting from 0.1.6, our pre-built wheels include experimental support sm75 (Turing architecture GPUs such as Tesla T4, Quadro RTX 6000 and RTX 2080).
Since 0.1.6 on, begin_forward
/forward
/end_forward
APIs are replaced with the new plan
/run
API.
forward
is renamed torun
, which is more precise and consistent with the naming convention of cutlass's python API.begin_forward
is renamed toplan
, which is consistent with the naming convention of nvmath API.end_forward
is deprecated and has no effect after this PR.
There is some slight difference between the old forward
and the new run
API:
- All extra arguments such as
causal
andlogits_soft_cap
will be provided inplan
(previouslybegin_forward
) API, and cached until nextplan
call, and we only need to provide query and KV-Cache tensors inrun
API.
The old begin_forward
/forward
/end_forward
APIs are still functional, but we will gradually deprecate them in future releases.
Check #466 for more details.
Since 0.1.6 on, we introduce a new MultiLevelCascadeAttentionWrapper
API for cascade inference,
which supports multi-level cascade inference where all levels' KV-Cache can be managed in a unified Paged KV-Cache.
See documentation and tutorial on API usage and layout explanation.
The old BatchDecodeWithSharedPrefixPagedKVCacheWrapper
and BatchPrefillWithSharedPrefixPagedKVCacheWrapper
will be deprecated in future releases.
- sm75 support (#448, #449)
- add
MultiLevelCascadeAttentionWrapper
API (#462) (1e37989) - add accept num, emit num metric for ChainSpeculativeSampling (#450) (fa38b5e)
- support bmm fp8 (#469) (f1c0b68)
- refactor: replace
begin_forward
/forward
/end_forward
withplan
/run
#466
- slight optimization on f16->f8 fragment layout swizzling (#453) (0d61871)
- slight optimization on fragment layout swizzle (#458) (7c397cb)
- use persistent kernel for merging attention states (#459) (be6bf5b)
We thank @LiuXiaoxuanPKU on enhance of speculative sampling operator, @merrymercy on API change suggestion and @zhyncs on integrating fp8 BMM cublas implementation.
0.1.5 (2024-08-13)
- resolve cu121 compile wired issue (#446) (5f0159e)
- Fix PagedPrefill python api and some typos (#441) (3fff008)
- fix prefill kernels' lse result for empty kv-cache (#440) (6ac28f4)
We thank contributions and feedbacks from the community: @comaniac, @hnyls2002, @jianfei-wangg, @Yard1.
0.1.4 (2024-08-09)
- append attention kernels for fp8 kv-cache (#420) (906c2f5)
- support min_p sampling (#422) (d52f2da)
- deterministic sampling (#417) (0dd801d)
- more sampling operator options (#431) (68df9c4)
- support fused add rmsnorm (#419) (b781513)
- support fused silu mul (#427) (ea0ba9a)
- fix dispatch fp16 type when enable fp8 (#430) (daa5566)
- improve numerical stability of sampling kernels (#429) (898d8ea)
We thank contributions and feedbacks from the community: @comaniac, @esmeetu, @LiuXiaoxuanPKU, @peng1999, @xslingcn, @Yard1, @zhyncs.
0.1.3 (2024-07-31)
- bugfix: Fix cudagraph mode of BatchPrefillWithRaggedKVCacheWrapper (#412) (9907bc)
- fix cu118 cub usage for sampling kernels (#410) (58d359)
- enhance allocator error info and add shape check for prefill begin forward functions (#413) (5e36c5)
0.1.2 (2024-07-29)
- add llama 3.1 style rope (#401) (4c89dec)
- non-inplace rope operators (#405) (74ffba1)
- sliding window attention (#406) (28cffd3)
- support non-contiguous (packed) input for prefill kernels (#404) (68c3719)
0.1.1 (2024-07-20)
- fix the invalid kernel configuration for architectures with small shared memory size (#385) (cdac57)
0.1.0 (2024-07-17)
- Add mask to
merge_state_in_place
(#372) (e14fa81) - expose pytorch api for block sparse attention (#375) (4bba6fa)
- Fused GPU sampling kernel for joint top-k & top-p sampling (#374) (6e028eb)
0.0.9 (2024-07-12)
- fix decode kernels output for empty kv cache (#363)(ac72b1)
- check gpu id in PyTorch APIs and use input tensor's gpu default stream (#361)(1b84fa)
- accelerate alibi (#365) (4f0a9f9)
- accelerate gqa performance (#356) (e56ddad)
- Optimize tensor conversions in C++ code to avoid unnecessary copies (#366) (1116237)
We thank @Yard1, @Ying1123 and @zhyncs for their contributions.
0.0.8 (2024-07-03)
- fix prefill/append kernel behavior for empty kv-cache (#353) (7adc8c)
- fix decode attention kernel with logits cap (#350) (f5f7a2)
0.0.7 (2024-06-28)
batch_decode_with_padded_kv_cache
was removed, we encourage user to useBatchDecodeWithPagedKVCacheWrapper
instead. (#343)
- fix the
forward_return_lse
function inBatchPrefillWithRaggedKVCache
class (#337) - fix the scheduler behavior of large page size (#333)
- change minimal
kv_chunk_size
back to 128 (#329) (f237f5f) - more options for kv tile size (#336) (bf2a6c7)
0.0.6 (2024-06-21)
Fix some bug in v0.0.5 that might lead to crashes and instable performance.
0.0.5 (2024-06-20)
- Support any GQA group size support for tensor-cores kernels.
- Support any page size support for tensor-cores kernels.
- Support CUDA-Graph for prefill/decode APIs.
- Add an option to accelerate decode kernels with Tensor Cores.
- Support custom attention mask. (https://docs.flashinfer.ai/tutorials/kv_layout.html#mask-layout-2d-ragged-tensor)
- Support logits cap in Grok-1 models.
- Fused GPU-sampling kernels: top-p, top-k, speculative verification. (https://docs.flashinfer.ai/api/python/sampling.html)
- PyTorch wrapper of group-gemm cutlass kernels. (https://docs.flashinfer.ai/api/python/group_gemm.html)
We thank @ibsidorenko, @LiuXiaoxuanPKU, @Yard1 @AgrawalAmey, @xuzhenqi, @mgerstgrasser, @esmeetu, @yz-tang, @HSQ79815, @Qubitium, @shreygupta2809, @sighingnow, @vinx13, @tqchen, @merrymercy, @comaniac and many others for their contributions and helpful discussions for 0.0.5 release.
- support any GQA group size for tensor-cores kernels (#301) (c111ca)
- support any page size for tensor-cores kernels (#306) (82fd8c)
- add
use_tensor_cores
option to decode kernels to accelerate GQA (#317) (3b50dd5) - add group gemm operators (#282) (e08ba42)
- initial support of distributed operators (#289) (03553da)
- initial support of logits hook (#298) (ab1e2ad)
- Separate Q and KV dtypes for decode (#286) (5602659)
- support cuda graph for batched multi-query(prefill/append) attention (#275) (83ceb67)
- support cuda graph for batched multi-query(prefill/append) attention (#277) (24cc583)
- support custom attention mask in prefill/append attention kernels (#266) (7304282)
- fused speculative sampilng kernels (#259) (cea2bb)
- expose sampling APIs in pytorch (#238) (092902)
- initial cuda graph support (#256) (7e9cc7f)
- split kv-cache for prefill/append kernels (#310) (f0bb0a3)
- use packed bit array for attention mask (#308) (3d43dc9)
0.0.4 (2024-05-01)
- pytorch 2.3 support
- gpu sampling kernels (top-p, top-k)
- more gqa group sizes
- add mma instructions for fp8 (#179) (d305798)
- mma rowsum for fp8 (#180) (5af935c)
- support any num_heads for get_alibi_slope (#200) (b217a6f)
0.0.3 (2024-03-08)
- adding
sm_scale
field for all attention APIs (#145) (85d4018) - enable
head_dim=256
for attention kernels (#132) (0372acc) - pytorch api of fp8 kv-cache (#156) (66ee066)
- support ALiBi (#146) (383518b)
- bugfix to pr 135 (#136) (3d55c71)
- fix bugs introduced in #132 (#135) (9b7b0b9)
- fix FindThrust.cmake (#161) (30fa584)