-
"ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference". Hanshi Sun, Li-Wen Chang, Wenlei Bao, Size Zheng, Ningxin Zheng, Xin Liu, Harry Dong, Yuejie Chi, Beidi Chen. arXiv, October 2024.
-
"FLUX: Fast Software-based Communication Overlap On GPUs Through Kernel Fusion". Li-Wen Chang, Wenlei Bao, Qi Hou, Chengquan Jiang, Ningxin Zheng, Yinmin Zhong, Xuanrun Zhang, Zuquan Song, Chengji Yao, Ziheng Jiang, Haibin Lin, Xin Jin, Xin Liu. arXiv, June 2024.
-
"EVT: Accelerating Deep Learning Training with Epilogue Visitor Tree". Zhaodong Chen, Andrew Kerr, Richard Cai, Jack Kosaian, Haicheng Wu, Yufei Ding, and Yuan Xie. Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, April 2024.
-
"Faster Neighborhood Attention: Reducing the O(n^2) Cost of Self Attention at the Threadblock Level". Ali Hassani, Wen-Mei Hwu, Humphrey Shi. arXiv, March 2024.
-
"A Case Study in CUDA Kernel Fusion: Implementing FlashAttention-2 on NVIDIA Hopper Architecture using the CUTLASS Library". Ganesh Bikshandi, Jay Shah. arXiv, December 2023.
-
"Benchmarking GPU Tensor Cores on General Matrix Multiplication Kernels through CUTLASS". Xuanteng Huang, Xianwei Zhang, Panfei Yang, Nong Xiao. Journal of Applied Sciences, December 2023.
-
"A Speed Odyssey for Deployable Quantization of LLMs". Qingyuan Li, Ran Meng, Yiduo Li, Bo Zhang, Liang Li, Yifan Lu, Xiangxiang Chu, Yerui Sun, Yuchen Xie. arXiv, November 2023.
-
"FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning". Tri Dao. Technical Report, July 2023.
-
"MegaBlocks: Efficient Sparse Training with Mixture-of-Experts". Trevor Gale, Deepak Narayanan, Cliff Young, Matei Zaharia. Proceedings of the Sixth Machine Learning and Systems, May 2023.
-
"ByteTransformer: A High-Performance Transformer Boosted for Variable-Length Inputs". Yujia Zhai, Chengquan Jiang, Leyuan Wang, Xiaoying Jia, Shang Zhang, Zizhong Chen, Xin Liu, Yibo Zhu. Proceedings of the 37th IEEE International Parallel & Distributed Processing Symposium (Best Paper), May 2023.
-
"A Framework for Fine-Grained Synchronization of Dependent GPU Kernels". Abhinav Jangda, Saeed Maleki, Maryam Mehri Dehnavi, Madan Musuvathi, Olli Saarikivi. Computing Research Repository, May 2023.
-
"Graphene: An IR for Optimized Tensor Computations on GPUs". Hagedorn, Bastian, Bin Fan, Hanfeng Chen, Cris Cecka, Michael Garland, Vinod Grover. Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, March 2023.
-
"Mixed Precision Post Training Quantization of Neural Networks with Sensitivity Guided Search". Clemens JS Schaefer, Elfie Guo, Caitlin Stanton, Xiaofan Zhang, Tom Jablin, Navid Lambert-Shirzad, Jian Li, Chiachen Chou, Siddharth Joshi, Yu Emma Wang. arXiv, Feburary 2023.
-
"Dynamic N:M Fine-Grained Structured Sparse Attention Mechanism". Zhaodong Chen, Zheng Qu, Yuying Quan, Liu Liu, Yufei Ding, Yuan Xie. Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, Feburary 2023.
-
"Stream-K: Work-centric Parallel Decomposition for Dense Matrix-Matrix Multiplication on the GPU". Muhammad Osama, Duane Merrill, Cris Cecka, Michael Garland, John D. Owens. arXiv, January 2023.
-
"GPU Load Balancing". Muhammad Osama. Doctoral dissertation, University of California, Davis, December 2022.
-
"Who Says Elephants Can't Run: Bringing Large Scale MoE Models into Cloud Scale Production". Young Jin Kim, Rawn Henry, Raffy Fahim, Hany Hassan Awadalla. Proceedings of the Third Workshop on Simple and Efficient Natural Language Processing, December 2022.
-
"Bolt: Bridging the Gap between Auto-tuners and Hardware-native Performance". Jiarong Xing, Leyuan Wang, Shang Zhang, Jack Chen, Ang Chen, Yibo Zhu. Proceedings of the 5th MLSys Conference, August 2022.
-
"Recovering single precision accuracy from Tensor Cores while surpassing the FP32 theoretical peak performance". Hiroyuki Ootomo, Rio Yokota. International Journal of High Performance Computing, March 2022.
-
"Breaking the Computation and Communication Abstraction Barrier in Distributed Machine Learning Workloads". Abhinav Jangda, Jun Huang, Guodong Liu, Amir Hossein Nodehi Sabet, Saeed Maleki, Youshan Miao, Madanlal Musuvathi, Todd Mytkowicz, Olli Sarikivi. Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, February 2022.
-
"Arithmetic-intensity-guided fault tolerance for neural network inference on GPUs". Jack Kosaian, K. V. Rashmi. Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, November 2021.
-
"Real-time Neural Radiance Caching for Path Tracing". Thomas Muller, Fabrice Rousselle, Jan Novak, Alex Keller. ACM Trans. Graph., August 2021.
-
"Scalable Knowledge Graph Analytics at 136 Petaflop/s". Ramakrishnan Kannan, Piyush Sao, Hao Lu, Drahomira Herrmannova, Vijay Thakkar, Robert Patton, Richard Vuduc, Thomas Potok. Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, November 2020.
-
"Accelerating Sparse DNN Models without Hardware-Support via Tile-Wise Sparsity ". Cong Guo, Bo Yang Hsueh, Jingwen Leng, Yuxian Qiu, Yue Guan, Zehuan Wang, Xiaoying Jia, Xipeng Li, Minyi Guo, Yuhao Zhu. Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, November 2020.
-
"Strassen's Algorithm Reloaded on GPUs". Jianyu Huang, Chenhan D. Yu, Robert A. van de Geijn. ACM Transactions on Mathematical Software, March 2020.