What's Changed
- ci: add option to skip nvbench build by @guocuimi in #390
- ci: build devel image with cuda 12.8 for blackwell by @guocuimi in #391
- kernel: added query packing support for attention by @guocuimi in #392
- refactor: rename attention to mha to differentiate it from mla by @guocuimi in #393
- kernel: added triton aot compiler by @guocuimi in #394
- kernel: generate smaller kernel instantiations by @guocuimi in #395
- kernel: fix register spilling issue for attention head_dim=256 by @guocuimi in #397
- upgrade libtorch to 2.6.0 and cutlass to 3.8.0 by @guocuimi in #398
- kernel: added simple MLA kernel by @guocuimi in #396
- kernel: added pipeline support for mla by @guocuimi in #399
- kernel: added ping-pong rmem support for MLA by @guocuimi in #400
- kernel: revert experimental TiledMMA separation change. by @guocuimi in #401
- kernel: put query alwasy in registers for mha by @guocuimi in #402
- kernel: use 8 warps to avoid register spilling for mla with hdim=512 by @guocuimi in #403
- kernel: revert mla ping-pong rmem change by @guocuimi in #404
- kernel: refactor mask logic to avoid using hard-coded stride. by @guocuimi in #405
- kernel: added causal mask for MLA kernel by @guocuimi in #406
- kernel: added blk_n=16 for MLA to support sm_86/sm_89 with only 100kb smem by @guocuimi in #407
- kernel: fix mask bugs for MLA by @guocuimi in #408
- kernel: use differnt TiledMma for GEMM qk and pv by @guocuimi in #409
- kernel: added stage support for MLA kernel by @guocuimi in #410
- misc: upgrade cuda version and add devcontainer for manylinux by @guocuimi in #412
- kernel: added q and kv oob handling for MLA kernel by @guocuimi in #413
- kernel: optimize mask loop for MLA kernel by @guocuimi in #414
- kernel: added paged kv support for MLA kernel by @guocuimi in #415
- kernel: fix kv oob issue and added more unittests for paged MLA by @guocuimi in #416
- kernel: use FastDivmod in attention kernels by @guocuimi in #417
Full Changelog: v0.2.3...v0.2.4