-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
oneDNN v3.7 release notes #2481
base: rls-v3.7
Are you sure you want to change the base?
Changes from all commits
23e00b1
8be4815
73e9276
f87e982
6a59eb1
e12db5a
c4cafcd
632296f
128ba81
b1b6fc4
56d1562
b74779a
a5dbb42
60a8ad7
c5c9ce4
cf5db22
31b1170
7fdc7cf
f8305d7
a18e6d0
93a14c9
c4c6e26
358438f
84e671b
ca55afc
c83a60a
1ce6a56
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,89 @@ | ||
# Performance Optimizations | ||
|
||
## Intel Architecture Processors | ||
tprimak marked this conversation as resolved.
Show resolved
Hide resolved
vgvozdeva marked this conversation as resolved.
Show resolved
Hide resolved
|
||
* Improved performance of convolution and matmul primitives on Intel Xeon processors with Intel AMX support (formerly Sapphire Rapids and Granite Rapids). | ||
* Improved performance of `fp8` matmul primitives with `bf16` and `fp16` bias data type on Intel Xeon processors with Intel AMX instruction set support (formerly Sapphire Rapids and Granite Rapids). | ||
* Improved performance of `int8` RNN primitive on processors with Intel AVX2 and Intel AVX-512 instruction set support. | ||
* Improved performance of `int8` depthwise separable convolution primitive with per-channel zero points on processors with Intel AVX2 and Intel AVX-512 instruction set support. | ||
* Improved `fp16` and `bf16` softmax performance with relaxed [accumulation mode]. | ||
* Improved performance of `int8` matmul primitive with `fp16` output data type. | ||
* Improved performance of the following subgraphs with Graph API: | ||
* [Gated Multi-Layer Perceptron (Gated MLP)]. | ||
|
||
[accumulation mode]: https://oneapi-src.github.io/oneDNN/dev_guide_attributes_accumulation_mode.html#doxid-dev-guide-attributes-accumulation-mode | ||
|
||
## Intel Graphics Products | ||
vpirogov marked this conversation as resolved.
Show resolved
Hide resolved
|
||
* Introduced initial optimizations for Intel GPUs based on Xe3 architecture. | ||
* Improved performance for Intel Arc Graphics for Intel Core Ultra processors (Series 2) (formerly Lunar Lake) and Intel Arc B-series discrete graphics (formerly Battlemage). | ||
* Improved performance of convolution with source zero points by pre-packing compenstation. | ||
* Improved performance of backward by data convolution with strides for large filter. | ||
* Improved performance of the following subgraphs with Graph API: | ||
* Scaled Dot-Product Attention (SDPA) with [implicit causal mask]. | ||
* SDPA with [`int8` or `int4` compressed key and value]. | ||
* Gated-MLP. | ||
|
||
[implicit causal mask]: https://oneapi-src.github.io/oneDNN/dev_guide_graph_sdpa.html#doxid-dev-guide-graph-sdpa | ||
[`int8` or `int4` compressed key and value]: https://oneapi-src.github.io/oneDNN/dev_guide_graph_sdpa_compressed_kv.html#doxid-dev-guide-graph-sdpa-compressed-kv | ||
[Gated Multi-Layer Perceptron (Gated MLP)]: https://oneapi-src.github.io/oneDNN/dev_guide_graph_gated_mlp.html#doxid-dev-guide-graph-gated-mlp | ||
|
||
vgvozdeva marked this conversation as resolved.
Show resolved
Hide resolved
|
||
## AArch64-based Processors | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @jondea, @theComputeKid, could you please help summarizing AArch64 improvements? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @Sqvid I think you have a list of our improvements? |
||
|
||
# Functionality | ||
|
||
## Common | ||
* Introduced support for `select` algorithm in binary primitive. The functionality is optimized for Intel CPUs. | ||
* Extended quantization support in matmul and reorder with grouped scales and zero-points for weights. This functionality is optimized for Intel CPUs and GPUs. | ||
* Introduced initial support for 4-bit floating-point data types `f4_e2m1` and `f4_e3m0` in matmul and reorder, as well as `e8m0` scales data type in matmul and reorder. This functionality is available on Intel CPUs and GPUs. | ||
* Introduced [`Select`], [`GenIndex`], and [`GreaterEqual`] operations in Graph API. | ||
|
||
[`Select`]: https://oneapi-src.github.io/oneDNN/dev_guide_op_select.html | ||
[`GenIndex`]: https://oneapi-src.github.io/oneDNN/dev_guide_op_genindex.html | ||
[`GreaterEqual`]: https://oneapi-src.github.io/oneDNN/dev_guide_op_greaterequal.html | ||
|
||
## Intel Architecture Processors | ||
* Introduced support for `fp32` matmul with `fp16` and `bf16` weights. | ||
|
||
## Intel Graphics Products | ||
* Introduced stochastic rounding support for convolution, matmul and reorder based on Philox counter-based random number generator. | ||
* Introduced support for strided memory formats in convolution. | ||
|
||
# Usability | ||
|
||
## Common | ||
* With the SYCL runtime, memory objects on the CPU engine are now reference-counted and no longer need to be explicitly kept alive for the duration of the primitive execution. This aligns memory object lifetime behavior on CPU and GPU engines. | ||
* Added Graph API examples for [Gated MLP] and [`int4` Gated MLP] patterns. | ||
|
||
[Gated MLP]: https://github.com/oneapi-src/oneDNN/blob/rls-v3.7/examples/graph/gated_mlp.cpp | ||
[`int4` Gated MLP]: https://github.com/oneapi-src/oneDNN/blob/rls-v3.7/examples/graph/gated_mlp_int4.cpp | ||
|
||
## Intel Architecture Processors | ||
* Improved verbose diagnostics to better identify issues during dispatching, primitive and kernel creation for Intel CPU and Intel GPU implementations. | ||
* Enabled frame pointers support on Intel64 platforms to improve integration with profilers. | ||
vgvozdeva marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
## Intel Processor Graphics | ||
* Improved verbose diagnostics for Intel GPU driver compatibility issues. | ||
* Improved support of large size tensors in convolution, matmul and reduction primitives on Intel GPUs. | ||
* Reduced scratchpad usage for NCHW convolution on Intel GPUs. | ||
|
||
# Validation | ||
* Extended benchdnn with support and validation for fp8 matmul patterns for tensor tags in RNN primitive validation. | ||
vgvozdeva marked this conversation as resolved.
Show resolved
Hide resolved
|
||
* Extended benchdnn with support for rewriting data types in the test JSON files in the graph driver. | ||
* Extended benchdnn with support and validation for the number of partitions returned from the test JSON files. | ||
|
||
# Deprecated Functionality | ||
* Experimental [Graph Compiler] is deprecated and will be removed in future releases. | ||
|
||
[Graph Compiler]: https://oneapi-src.github.io/oneDNN/v3.7/dev_guide_graph_compiler.html | ||
|
||
# Breaking Changes | ||
* Updated minimal supported CMake version to 3.13 (was 2.8.12). | ||
* Updated minimal supported GCC version to 8.0 (was 4.8). | ||
* Updated minimal supported Clang version to 11.0 (was 3.0). | ||
vgvozdeva marked this conversation as resolved.
Show resolved
Hide resolved
|
||
* Removed support for SYCL older than 2020. | ||
* Enforced `fp32` accumulation mode in `fp16` matmul and inner product primitives on Intel Graphics products without Intel XMX cores. Previous behavir can be enabled with relaxed [accumulation mode]. | ||
|
||
# Thanks to our Contributors | ||
|
||
This release contains contributions from the [project core team] as well as Aditya Tewari @aditew01, Alexandra Sidorova @a-sidorova, Atharva Dubey @AD2605, Deb Taylor @deb-intel, Dmitriy Ovchinnikov @inteldimitrius, Fadi Arafeh @fadara01, Hengyu Meng @airMeng, @hmaciak, John Osorio @kala855, Marek Michalowski @michalowski-arm, Michael Froelich @MichaelFroelich, Michał Górny @mgorny, Nikhil Sharma @nikhilfujitsu, Permanence AI Coder @Permanence-AI-Coder, @raistefintel, Ravi Pushkar @rpushkarr, Renato Barros Arantes @renato-arantes, Romain Biessy @Rbiessy, Ryo Suzuki @Ryo-not-rio, @Shreyas-fuj, Varad Ahirwadkar @varad-ahirwadkar, @vishwascm, and Ye Tao @taoye9. We would also like to thank everyone who asked questions and reported issues. | ||
|
||
[project core team]: https://github.com/oneapi-src/oneDNN/blob/rls-v3.7/MAINTAINERS.md |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sgeor255, @ShanoToni, @t4c1, @Rbiessy, could you please help with release notes content for NVIDIA backend and generic SYCL kernels?
We are primarily looking for two things: performance improvements (stuff that works faster) and new features (stuff that did not work before).