-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
oneDNN v3.7 release notes #2481
base: rls-v3.7
Are you sure you want to change the base?
Changes from 25 commits
23e00b1
8be4815
73e9276
f87e982
6a59eb1
e12db5a
c4cafcd
632296f
128ba81
b1b6fc4
56d1562
b74779a
a5dbb42
60a8ad7
c5c9ce4
cf5db22
31b1170
7fdc7cf
f8305d7
a18e6d0
93a14c9
c4c6e26
358438f
84e671b
ca55afc
c83a60a
1ce6a56
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,51 @@ | ||
# Performance Optimizations | ||
## Intel Architecture Processors | ||
tprimak marked this conversation as resolved.
Show resolved
Hide resolved
vgvozdeva marked this conversation as resolved.
Show resolved
Hide resolved
|
||
* Improved fp16/bf16 softmax performance with relaxed [accumulation mode](https://oneapi-src.github.io/oneDNN/dev_guide_attributes_accumulation_mode.html#doxid-dev-guide-attributes-accumulation-mode). | ||
* Improved performance for int8 RNN primitive on processors with Intel AVX2 and Intel AVX512 instruction set support. | ||
* Improved performance of convolution and matmul primitives on processors with Intel AMX support. | ||
* Improved performance of fp8 matmul primitives with bf16 and fp16 bias data type on processors with Intel AMX instruction set support. | ||
* Improved performance of int8 matmul primitive with fp16 output data type. | ||
* Improved performance of int8 depthwise separable convolution primitive with pre-channel zero points on processors with Intel AVX2 and Intel AVX512 instruction set support. | ||
|
||
## Intel Graphics Products | ||
vpirogov marked this conversation as resolved.
Show resolved
Hide resolved
|
||
* Introduced initial optimizations for GPUs based on Xe3 architecture. | ||
* Improved performance for Intel Arc Graphics for Intel Core Ultra processors (Series 2) (formerly Lunar Lake) and Intel Arc B-series discrete graphics (formerly Battlemage). | ||
* Improved performance of the following subgraphs with Graph API | ||
* Scaled Dot-Product Attention (SDPA) [with implicit causal mask](https://oneapi-src.github.io/oneDNN/dev_guide_graph_sdpa.html#doxid-dev-guide-graph-sdpa) | ||
* Scaled Dot-Product Attention (SDPA) [with int8/int4 compressed key and value](https://oneapi-src.github.io/oneDNN/dev_guide_graph_sdpa_compressed_kv.html#doxid-dev-guide-graph-sdpa-compressed-kv) | ||
vpirogov marked this conversation as resolved.
Show resolved
Hide resolved
|
||
## AArch64-based Processors | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @jondea, @theComputeKid, could you please help summarizing AArch64 improvements? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @Sqvid I think you have a list of our improvements? |
||
|
||
# Functionality | ||
* Introduced support for `select` algorithm in binary primitive. The functionality is optimized for Intel CPUs. | ||
* Enabled support for matmul primitive with grouped quantization on weight along N dimension | ||
vgvozdeva marked this conversation as resolved.
Show resolved
Hide resolved
|
||
* Graph API: new [`Select`](https://oneapi-src.github.io/oneDNN/dev_guide_op_select.html), [`GenIndex`](https://oneapi-src.github.io/oneDNN/dev_guide_op_genindex.html) and [`GreaterEqual`](https://oneapi-src.github.io/oneDNN/dev_guide_op_greaterequal.html) operations. | ||
* Introduced support for fp16/bf16 compressed weights in fp32 matmul on Intel CPUs. | ||
* Introduced support for grouped scales and zero points in reorder primitive. | ||
* Enabled support for 4d weight scale in matmul primitive. | ||
vgvozdeva marked this conversation as resolved.
Show resolved
Hide resolved
vgvozdeva marked this conversation as resolved.
Show resolved
Hide resolved
|
||
* Graph API: Added support for quantized and non-quantized Gated MLP patterns | ||
* Introduced preliminary support for 4-bit floating-point data types `f4_e2m1` and `f4_e3m0` in matmul and reorder, as well as `e8m0` scales data type in matmul and reorder. | ||
* [experimental] Extended microkernel API: | ||
Introduced int4 quantization support. | ||
Fpmath mode API | ||
vpirogov marked this conversation as resolved.
Show resolved
Hide resolved
|
||
# Usability | ||
* With the SYCL runtime, memory objects on the CPU engine are now reference-counted and no longer need to be explicitly kept alive for the duration of the primitive execution. This aligns memory object lifetime behavior on CPU and GPU engines. | ||
* Improved verbose diagnostics to better identify issues during dispatching, primitive and kernel creation for CPU primitive and GPU (in case of OpenCL implementation) primitive implementations. | ||
* Improved verbose diagnostics for Intel GPU driver compatibility issues. | ||
* Enabled frame pointers support on Intel64 platforms to improve integration with profilers. | ||
vgvozdeva marked this conversation as resolved.
Show resolved
Hide resolved
|
||
* Added [examples](https://github.com/oneapi-src/oneDNN/tree/main/examples/graph) for Gated MLP and int4 Gated MLP | ||
# Validation | ||
* Extended benchdnn with support and validation for fp8 matmul patterns for tensor tags in RNN primitive validation. | ||
vgvozdeva marked this conversation as resolved.
Show resolved
Hide resolved
|
||
* Extended benchdnn with support for rewriting data types in the test JSON files in the graph driver. | ||
* Extended benchdnn with support and validation for the number of partitions returned from the test JSON files. | ||
# Deprecated Functionality | ||
|
||
# Breaking Changes | ||
* Updated minimal supported CMake version to 3.13 (was 2.8.12). | ||
* Updated minimal supported GCC version to 8.0 (was 4.8). | ||
* Updated minimal supported Clang version to 11.0 (was 3.0). | ||
vgvozdeva marked this conversation as resolved.
Show resolved
Hide resolved
|
||
* Removed support for SYCL older than 2020 | ||
# Thanks to these Contributors | ||
|
||
This release contains contributions from the [project core team] as well as Aditya Tewari @aditew01, Alexandra Sidorova @a-sidorova, Atharva Dubey @AD2605, Deb Taylor @deb-intel, Dmitriy Ovchinnikov @inteldimitrius, Fadi Arafeh @fadara01, Hengyu Meng @airMeng, @hmaciak, John Osorio @kala855, Marek Michalowski @michalowski-arm, Michael Froelich @MichaelFroelich, Michał Górny @mgorny, Nikhil Sharma @nikhilfujitsu, Permanence AI Coder @Permanence-AI-Coder, @raistefintel, Ravi Pushkar @rpushkarr, Renato Barros Arantes @renato-arantes, Romain Biessy @Rbiessy, Ryo Suzuki @Ryo-not-rio, @Shreyas-fuj, Varad Ahirwadkar @varad-ahirwadkar, @vishwascm, and Ye Tao @taoye9. We would also like to thank everyone who asked questions and reported issues. | ||
|
||
[project core team]: https://github.com/oneapi-src/oneDNN/blob/rls-v3.7/MAINTAINERS.md |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sgeor255, @ShanoToni, @t4c1, @Rbiessy, could you please help with release notes content for NVIDIA backend and generic SYCL kernels?
We are primarily looking for two things: performance improvements (stuff that works faster) and new features (stuff that did not work before).