-
Notifications
You must be signed in to change notification settings - Fork 58
[BYOC] Add pass to merge composite functions to offload large subgraphs #372
Conversation
|
||
@tvm.script.ir_module | ||
class MergeCompilerRegionsExample: | ||
@R.function |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an interesting example from https://discuss.tvm.apache.org/t/relay-improved-graph-partitioning-algorithm/5830
add and relu: blue nodes
gelu: red nodes
BuildCompositeGroups(IRModule mod, support::Arena* arena) | ||
: mod_(mod), arena_(arena), default_group_(CompositeGroup{nullptr, kDefaultTarget}) {} | ||
|
||
GroupMap Run(Function func) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely a nitpick, but I think Run
is a potentially misleading name, since this is not running the function. It is more in line with most style guidelines to make class names nouns; if you use CompositeGroupBuilder
for this class, then you could have the more descriptive Build
for the method name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed to CompositeGroupBuilder
, but I want to keep Run
since I feel CompositeGroupBuilder::Build(...)
is a bit verbose and for the other class in this file, CompositeInliner
, CompositeInliner::Inline(function)
is worse. I'd say "run" is a pretty standard terminology, e.g. runOnFunction
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not a big deal :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the great contribution.
Overall, it looks good to me. (And thank you for catching the interesting bug!)
Please see my comments below.
93b706b
to
b451df7
Compare
d379c84
to
47f9fde
Compare
47f9fde
to
71354c9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for reflecting my comments!
…hs (tlc-pack#372) * Add FuseCompositeFunctions pass * clean up and add doc / comments * cpplint * fix trt build * FuseCompositeFunctions -> MergeCompositeFunctions * rename * add algo explanation * Drop requirement on group being always a root in OperatorFusor
…hs (tlc-pack#372) * Add FuseCompositeFunctions pass * clean up and add doc / comments * cpplint * fix trt build * FuseCompositeFunctions -> MergeCompositeFunctions * rename * add algo explanation * Drop requirement on group being always a root in OperatorFusor
…hs (#372) * Add FuseCompositeFunctions pass * clean up and add doc / comments * cpplint * fix trt build * FuseCompositeFunctions -> MergeCompositeFunctions * rename * add algo explanation * Drop requirement on group being always a root in OperatorFusor
…hs (#372) * Add FuseCompositeFunctions pass * clean up and add doc / comments * cpplint * fix trt build * FuseCompositeFunctions -> MergeCompositeFunctions * rename * add algo explanation * Drop requirement on group being always a root in OperatorFusor
…hs (#372) * Add FuseCompositeFunctions pass * clean up and add doc / comments * cpplint * fix trt build * FuseCompositeFunctions -> MergeCompositeFunctions * rename * add algo explanation * Drop requirement on group being always a root in OperatorFusor
…hs (#372) * Add FuseCompositeFunctions pass * clean up and add doc / comments * cpplint * fix trt build * FuseCompositeFunctions -> MergeCompositeFunctions * rename * add algo explanation * Drop requirement on group being always a root in OperatorFusor
…hs (#372) * Add FuseCompositeFunctions pass * clean up and add doc / comments * cpplint * fix trt build * FuseCompositeFunctions -> MergeCompositeFunctions * rename * add algo explanation * Drop requirement on group being always a root in OperatorFusor
…hs (#372) * Add FuseCompositeFunctions pass * clean up and add doc / comments * cpplint * fix trt build * FuseCompositeFunctions -> MergeCompositeFunctions * rename * add algo explanation * Drop requirement on group being always a root in OperatorFusor
…subgraphs (#14062) This PR adds a pass that merges neighboring calls to composite functions offloaded to the same external backend into one function. This is important for backends that want to receive as large subgraph as possible, for example TensorRT. It plays the same role as the MergeCompilerRegion pass in Relay BYOC does, and the algorithm follows the same idea described in https://discuss.tvm.apache.org/t/relay-improved-graph-partitioning-algorithm/5830. Original PR tlc-pack/relax#372 Substantial improvement by @yelite tlc-pack/relax#411 Related fix PR by @yelite tlc-pack/relax#406 Co-authored-by: Lite Ye <[email protected]>
…subgraphs (#14062) This PR adds a pass that merges neighboring calls to composite functions offloaded to the same external backend into one function. This is important for backends that want to receive as large subgraph as possible, for example TensorRT. It plays the same role as the MergeCompilerRegion pass in Relay BYOC does, and the algorithm follows the same idea described in https://discuss.tvm.apache.org/t/relay-improved-graph-partitioning-algorithm/5830. Original PR tlc-pack/relax#372 Substantial improvement by @yelite tlc-pack/relax#411 Related fix PR by @yelite tlc-pack/relax#406 Co-authored-by: Lite Ye <[email protected]>
…subgraphs (#14062) This PR adds a pass that merges neighboring calls to composite functions offloaded to the same external backend into one function. This is important for backends that want to receive as large subgraph as possible, for example TensorRT. It plays the same role as the MergeCompilerRegion pass in Relay BYOC does, and the algorithm follows the same idea described in https://discuss.tvm.apache.org/t/relay-improved-graph-partitioning-algorithm/5830. Original PR tlc-pack/relax#372 Substantial improvement by @yelite tlc-pack/relax#411 Related fix PR by @yelite tlc-pack/relax#406 Co-authored-by: Lite Ye <[email protected]>
…subgraphs (#14062) This PR adds a pass that merges neighboring calls to composite functions offloaded to the same external backend into one function. This is important for backends that want to receive as large subgraph as possible, for example TensorRT. It plays the same role as the MergeCompilerRegion pass in Relay BYOC does, and the algorithm follows the same idea described in https://discuss.tvm.apache.org/t/relay-improved-graph-partitioning-algorithm/5830. Original PR tlc-pack/relax#372 Substantial improvement by @yelite tlc-pack/relax#411 Related fix PR by @yelite tlc-pack/relax#406 Co-authored-by: Lite Ye <[email protected]>
…subgraphs (#14062) This PR adds a pass that merges neighboring calls to composite functions offloaded to the same external backend into one function. This is important for backends that want to receive as large subgraph as possible, for example TensorRT. It plays the same role as the MergeCompilerRegion pass in Relay BYOC does, and the algorithm follows the same idea described in https://discuss.tvm.apache.org/t/relay-improved-graph-partitioning-algorithm/5830. Original PR tlc-pack/relax#372 Substantial improvement by @yelite tlc-pack/relax#411 Related fix PR by @yelite tlc-pack/relax#406 Co-authored-by: Lite Ye <[email protected]>
…subgraphs (#14062) This PR adds a pass that merges neighboring calls to composite functions offloaded to the same external backend into one function. This is important for backends that want to receive as large subgraph as possible, for example TensorRT. It plays the same role as the MergeCompilerRegion pass in Relay BYOC does, and the algorithm follows the same idea described in https://discuss.tvm.apache.org/t/relay-improved-graph-partitioning-algorithm/5830. Original PR tlc-pack/relax#372 Substantial improvement by @yelite tlc-pack/relax#411 Related fix PR by @yelite tlc-pack/relax#406 Co-authored-by: Lite Ye <[email protected]>
…subgraphs (#14062) This PR adds a pass that merges neighboring calls to composite functions offloaded to the same external backend into one function. This is important for backends that want to receive as large subgraph as possible, for example TensorRT. It plays the same role as the MergeCompilerRegion pass in Relay BYOC does, and the algorithm follows the same idea described in https://discuss.tvm.apache.org/t/relay-improved-graph-partitioning-algorithm/5830. Original PR tlc-pack/relax#372 Substantial improvement by @yelite tlc-pack/relax#411 Related fix PR by @yelite tlc-pack/relax#406 Co-authored-by: Lite Ye <[email protected]>
* [microTVM] Fix tvmc tutorial (#14076) This PR applies appropriate changes to make sure the CI fails if micro_tvmc.sh tutorial fails. This issue was captured in #14074. This PR also makes changes to avoid this breakage in bash script tutorials in future. In addition, this PR fixes the bug in running TVMC tutorial which happened due to renaming zephyr_board to board. * [MetaSchedule] Introduce Async Pipeline in MultiLevelTiling (#14009) This PR introduces async pipeline in the current TVM's MultiLevelTiling Rules. This PR is based on apache/tvm#13966, which is already merged. This is because some conv2d workload will use `tir.if_then_else` to pad the input to the correct size, and this PR uses async copy in such copy statement. 1. Add a subrule in `src/meta_schedule/schedule_rule/multi_level_tiling.h/.cc` that annotate async copy for mlt in supported arch (>= sm80). In CUDA Core, this PR has a perf boost of around 1T GFLOP/s in most Conv2d test cases and 1T ~ 2T in most GEMM test cases. All generated codes, scripts, and traces are available at https://github.com/Rainy-Memory/tvm-async-rule-benchmark. Currently tested on commit `afbfb7aa7e43732cb716f8e443df696110be6afc` in conv2d NHWC workload, with a RTX 3080 GPU. **Notice: given the stochastic nature of evolutionary search, perfromance might become worse if enable this PR.** Workload: Conv2d NHWC |Shape|Mainline TVM|Mainline TVM with Async|Performance Boost| |-|-|-|-| |N=1_H=224_W=224_C=3_K=64_R=7_S=7_STR=2_PAD=3_DIL=1|13838.05219|14687.89452|6.141343581679319%| |N=1_H=56_W=56_C=64_K=64_R=1_S=1_STR=1_PAD=0_DIL=1|5398.305085|5613.892553|3.9936140067192905%| |N=1_H=56_W=56_C=64_K=64_R=3_S=3_STR=1_PAD=1_DIL=1|11652.96825|13157.88249|12.91442839038028%| |N=1_H=56_W=56_C=64_K=256_R=1_S=1_STR=1_PAD=0_DIL=1|10638.8309|11674.68499|9.736540600527816%| |N=1_H=56_W=56_C=256_K=64_R=1_S=1_STR=1_PAD=0_DIL=1|8692.32829|9469.264089|8.938178277203573%| |N=1_H=56_W=56_C=256_K=128_R=1_S=1_STR=2_PAD=0_DIL=1|4685.767442|5698.19634|21.606469175684712%| |N=1_H=28_W=28_C=128_K=128_R=3_S=3_STR=1_PAD=1_DIL=1|9872.787087|10404.60405|5.38669535070061%| |N=1_H=28_W=28_C=128_K=512_R=1_S=1_STR=1_PAD=0_DIL=1|9974.281496|10073.31657|0.9929043414276753%| |N=1_H=28_W=28_C=512_K=128_R=1_S=1_STR=1_PAD=0_DIL=1|7075.866932|8564.572712|21.039199780135142%| |N=1_H=28_W=28_C=512_K=256_R=1_S=1_STR=2_PAD=0_DIL=1|3648.330914|4021.923142|10.240086132713124%| |N=1_H=14_W=14_C=256_K=256_R=3_S=3_STR=1_PAD=1_DIL=1|8192.954618|9160.182054|11.805599824451525%| |N=1_H=14_W=14_C=256_K=1024_R=1_S=1_STR=1_PAD=0_DIL=1|8008.870153|9362.825279|16.90569456283206%| |N=1_H=14_W=14_C=1024_K=256_R=1_S=1_STR=1_PAD=0_DIL=1|5210.062241|6051.208379|16.144646629759908%| |N=1_H=14_W=14_C=1024_K=512_R=1_S=1_STR=2_PAD=0_DIL=1|2550.787202|3587.902938|40.65865373586739%| |N=1_H=7_W=7_C=512_K=512_R=3_S=3_STR=1_PAD=1_DIL=1|4350.626084|5432.788068|24.873706981617943%| |N=1_H=7_W=7_C=512_K=2048_R=1_S=1_STR=1_PAD=0_DIL=1|6672.068026|7663.725217|14.862815953549454%| |N=1_H=7_W=7_C=2048_K=512_R=1_S=1_STR=1_PAD=0_DIL=1|3142.564263|4297.988014|36.766909259541826%| Workload: GEMM NN |Shape|Mainline TVM|Mainline TVM with Async|Performance Boost| |-|-|-|-| |M=512_N=256_K=640|8678.46|10607.37|22.226408832903555%| |M=512_N=384_K=256|8109.13|10290.72|26.902886006267003%| |M=512_N=512_K=512|11419.83|14000.86|22.601299669084398%| |M=512_N=3072_K=768|19709.39|18351.61|-6.8890006235606425%| |M=512_N=768_K=3072|12844.59|13730.88|6.90010346768561%| |M=896_N=896_K=896|16149.91|16131.39|-0.11467556165947945%| |M=1024_N=1024_K=1024|18842.11|19662.8|4.355616223448428%| |M=1152_N=1152_K=1152|15386.79|16736.1|8.769275462913303%| |M=1536_N=1536_K=1536|18522.67|18872.06|1.88628313304725%| |M=2048_N=2048_K=2048|19515.42|18874.85|-3.282378754851291%| |M=3072_N=3072_K=3072|19233.9|19291.42|0.2990553137948975%| |M=4096_N=4096_K=4096|17122.17|19259.01|12.479960191961652%| * [TVMScript] Use op attribute to control whether to print dtype in TVMScript (#14111) This PR adds an op attribute `TScriptDtypePrintLocation`, and modifies the dtype printing logic of the builtin op to check this attribute. So that user defined operators can use it to specify how there dtype argument are printed by appending attributes instead of appending members to `dtype_first_arg`/`dtype_last_arg`. * [Fix][TVMScript] Fix index of metadata in printed script (#14130) Currently, if the same metadata object (e.g. a multi-line `tir.StringImm`) is referenced for more than one times in an IRModule, each reference will have different indices of the metadata array. For example, this code ``` str_imm = T.StringImm("aaa\nbbb\n") @I.ir_module class Module: @T.prim_func def foo() -> None: A = str_imm B = str_imm @T.prim_func def foo1() -> None: A = str_imm Module.show() ``` where `str_imm` is referenced three times, will generate such output: ``` @I.ir_module class Module: @T.prim_func def foo(): A: T.handle = metadata["tir.StringImm"][0] B: T.handle = metadata["tir.StringImm"][1] T.evaluate(0) @T.prim_func def foo1(): A: T.handle = metadata["tir.StringImm"][2] T.evaluate(0) ``` Each time has a different metadata index. This PR fixes this problem by detecting duplicate item in `IRDocsifierNode::AddMetadata`. * [Pytorch] frontend full_impl fix (#14122) Minor fix in pytorch frontend to compile gpt2 model, reproduce script. torch_version = 1.13.1 transformers_version = 4.26.1 ``` from transformers import GPT2LMHeadModel import torch import tvm from tvm import relay inp = torch.ones((1, 128)).to(torch.int64) input_shapes = [("input_ids", ((1, 128), "int64"))] model = GPT2LMHeadModel.from_pretrained('gpt2', return_dict=False) trace_model = torch.jit.trace(model, inp, strict=False) outputs = trace_model(inp) mod, params = relay.frontend.from_pytorch(trace_model, input_shapes) with tvm.transform.PassContext(opt_level=3): lib = relay.build(mod, target='llvm', params=params) runtime = tvm.contrib.graph_executor.GraphModule(lib["default"](tvm.device('cpu', 0))) runtime.set_input("input_ids", inp.numpy()) runtime.run() out = runtime.get_output(0).numpy() print(out) print('Done...') ``` Before the fix, the error message ``` Traceback (most recent call last): File "gpt2_compile.py", line 13, in <module> mod, params = relay.frontend.from_pytorch(trace_model, input_shapes) File "/home/ubuntu/apache_tvm/tvm/python/tvm/relay/frontend/pytorch.py", line 4791, in from_pytorch outputs = converter.convert_operators(_get_operator_nodes(graph.nodes()), outputs, ret_name) File "/home/ubuntu/apache_tvm/tvm/python/tvm/relay/frontend/pytorch.py", line 4164, in convert_operators relay_out = relay_op( File "/home/ubuntu/apache_tvm/tvm/python/tvm/relay/frontend/pytorch.py", line 841, in full return self.full_impl(data, fill_value, dtype) File "/home/ubuntu/apache_tvm/tvm/python/tvm/relay/frontend/pytorch.py", line 743, in full_impl fill_value = _expr.const(fill_value, dtype=dtype) File "/home/ubuntu/apache_tvm/tvm/python/tvm/relay/expr.py", line 707, in const raise ValueError("value has to be scalar or NDArray") ValueError: value has to be scalar or NDArray ``` because `fill_value` is ``` %0 = cast(64, dtype="float32"); power(%0, 0.5f) ``` * [DOCKER] Configurable NDK version support (#14000) Let the Android NDK version configurable as a command line argument * [Fix][TIR] SampleCategorical apply-to-schedule (#14133) This PR is another way to fix the issue described in #14118. Since we do not have a standard for json file on the format of float numbers (for example, we cannot require a json file producer to print the "integer" float numbers with at least one decimal), and the json parser is not responsible for determining if an integer in a json file should be parsed to a float or an int, the most convenient way of fixing the SampleCategorical issue will be allowing both FloatImms and IntImms as input, and converting all IntImms to FloatImms accordingly. This PR fixes the issue in this way. * [Arith] ConstIntBound was incorrectly assuming bounds were over int64… (#13918) [Arith] ConstIntBound was incorrectly assuming bounds were over int64_t range This commit improved the floormod and floordiv conversion check to be simpler for the negative range and adds a test to cover all integer data types. * [CMSIS-NN] Reduction in code size of AOT test runner binary (#13815) * [CMSIS-NN] Reduction in code size of AOT test runner binary Co-authored-by: Ashutosh Parkhi <[email protected]> * [CMSIS-NN] Add a runtime error message (#13643) [CMSIS-NN] Add a runtime error message APIs TVMAPISetLastError and TVMGetLastError are used to propagate CMSIS-NN errors caught in the backend. AOT test runner was improved to observe the contents of this global variable. A test was added to check for the last set error as part of this commit. * [CRT]Cleanup unused macros in crt_config.h.template (#14125) This PR removes old macros in crt_config.h.template. * [Fix][Relay] Fix axis transformation in squeeze shape function (#14135) * fix squeeze shape function issue and add testcase. * fix lint * [Unittest] merge test_cp_async_in_if_then_else into test_tir_transform_inject_ptx_async_copy (#14138) This PR merge two related unittests into one. * [Frontend][TFLite] Fix conv2d import bug (#14124) * Fix TFLite frontend bug and add test * lint * [ONNX][TORCH] Replace scatter op by scatter_elements (#14019) * remove scatter attr class * update pytorch: scatter was replaced by scatter_elements * remove scatter compute and strategy registration * remove scatter attrs registration * update onnx front-end: replace _op.scatter by _op.scatter_elements, add checks * update oneflow front-end * update paddlepaddle front-end * update pytorch utils * remove front-end scatter definition * fix scatter strategy for rocm * small update * remove scatter definition in back-end * remove scatter strategy for cuda, gpu. transfer special case to scatter_elements * fix test * small fix * upstream scatter with torch description * last upstream of scatter in pytorch front-end * fix reduction attribute in cuda strategy * set scalar to test instead of tensor. update check for dynamic dim * skip scalar source check in tests for scatter due to issue on torch side * remove scatter op implementation from topi/cuda * remove scatter op implementation from topi. small clean code --------- Co-authored-by: Valery Chernov <[email protected]> * [TVMScript][Printer] Remove relax prefix for now (#14140) Remove relax prefix for now This PR cleans up relax prefix in printer for now. While these setups are useful and do not cause any technical debts in the codebase. We remove it given requests. They can be added back to unity branch and later as part of upstream * [microNPU] Sum legalization support (#13997) Supports legalizing a relay sum operation to an equivalent series of NPU operations. It supports case with int8 output type and channel axis. * [Fix][MetaSchedule] Fix redundant stages in async pipeline for mlt (#14143) This PR fixes redundant stages if visiting `InitializeWithTuneContext` multiple times. * [COMMUNITY] Cheng Wen -> Reviewer (#14153) Please join me @chengven027-intellif as a new Reviewer in TVM. Cheng has contributed to ONNX/PyTorch frontend and Relay passes, making TVM support more input models. - [Commits History](https://github.com/apache/tvm/pulls?q=author%3Achengven027-intellif+) - [Code Review](https://github.com/apache/tvm/pulls?q=reviewed-by%3Achengven027-intellif+) * [Runtime] Fix high RAM usage when saving / loading paramters of big models (#14147) * add load_params_from_file * add save_params_to_file * avoid making another copy in save_params * black * add test * update doc * [Relay][Frontend] Span Filling PyTorch (#14050) * [Relay][Frontend] Span Filling PyTorch - Construct debug name of C graph instruction as the source name of span for pytorch model. - To get the reference of renamed nodes. Add a function to export the converted C graph after conversion. - Add structural_equal comparisons with and without set_span to the existing test cases. - Add span test cases for frequent conversions. - Add span test case for exporting model parameter. * [SpanFillingPyTorch] - Return TupleGetItem expr from TupleWrapper with the span of its Tuple. - Add None type symbol in set sapn for certain conversion. - Add current_op member varible to PyTorchOpConverter to track which op is converting for pytorch frontend. * [SpanFillingPyTorch] - Fix the error caused by the quantized params not found after renaming the debug name of C graph. --------- Co-authored-by: Joey Tsai <[email protected]> * [TRT][BYOC] allow strided_slice ops on selected dimensions (#14142) (#14144) * [ONNX][TOPI] Add `DFT` operator (#13999) * init convertor for DFT * init test for DFT * init DFT operator in Relay * update topi implementation for DFT * clean up * update ONNX frontend * support attribute * fix error: Expected Array[Tensor], but got Array[index 0: Array] * support inverse, onsided, dft_lenght * update tests for DFT * update TOPI test for DFT * add documentation * fix pylint * fix cpplint * fix cpplint * fix threshold for FP16 (ARM) * add CUDA compute * fix pylint * fix doc string * code review fixes for ONNX front-end * code review fixes for TOPI * rename: stft.py -> signal.py * pass input_shape and output_shape to verify_dft * [CRT][microTVM] Enable USMP by default for AoTExecutor + CRT runtime (#14107) This PR enables USMP by default when AoTExecutor and CRT runtime are selected. Check forum discussion about this change: https://discuss.tvm.apache.org/t/enable-usmp-by-default-in-aot-executor-with-runtime-crt/14406 As a result, the workspace memory in mlperftiny project type is removed since memory allocation is not required. If we keep this workspace, the model doesn't fit since some of the memory is allocated twice. * [Android] Fix using system libraries in Android apps (#14145) - Starting from API 31, using `uses-native-library` is required if we want to open system library: https://developer.android.com/about/versions/12/reference/compat-framework-changes#enforce_native_shared_library_dependencies We should specify OpenCL library in `user-native-library` in all applications where OpenCL backend might be used. - Updated README files and describe how to fix synchronization issues in Android Studio. * [microTVM]Enable TVMC micro with AoT Executor (#14077) This PR enables AoT Executor for tvmc micro compilation. * [bugfix] Fix the write buffer scope of `mma_store_impl` (#14174) fix * [Relay] Enhance EliminateCommonSubexpr to support Tuple argument (#14169) If an argument of a call is a Tuple, we should check its fields. Different tuples with the same fields should be treated as same inputs * [TIR] Fix typo in doc (#14178) * [microTVM] Use QNN schedules to give SOTA performance (#13752) In #13242, I rewrote microTVM's convolution schedules to give a major improvement in performance. While I demonstrated in tests that my changes worked, they could not be used with relay.build. This pull request expands the functionality of #13242 and adds new legalize and alter_op passes to take advantage of the quantized schedules. This dramatically improves performance on some models, dramatically cuts RAM usage, and removes the need for autotuning on microTVM. More specifically, for the vww model from MLPerf Tiny running on the nucleo_l4r5zi, this pull request: - Improves untuned performance from 1741 ms to 137 ms - a 6.8x improvement! - Improves tuned performance from 337 ms to 137 ms. - Sets a new state-of-the-art for MLPerf Tiny, beating Plumerai's previous 208 ms record - Reduces RAM consumption by 73 KB (a large amount on microcontrollers!) by eliminating intermediate buffers. - Reduces flash consumption for model weights by 5x - Slightly improves accuracy @mehrdadh has kindly tested these changes himself, and has confirmed my 137 ms figure. To enable the schedules that grant these performance improvements, this pull request: 1. Adds out_layout support to the regular and depthwise conv2d schedules from [microTVM] Modernize Arm Cortex-M convolution schedules #13242. 2. Generalizes the schedules from [microTVM] Modernize Arm Cortex-M convolution schedules #13242 to be more widely applicable. 3. Adds a layout alternation pass to ensure regular and depthwise conv2d schedules always get their desired input formats. 4. Adds a conv2d -> depthwise conv2d -> unpadded conv2d rewrite step to remove empty channels from conv2d operators. 5. Adds a conv2d -> average pool -> dense rewrite step to remove empty channels from conv2d operators. 6. Adds an alter_op pass to fold padding into a separate Relay operator. * Add v0.11.0 docs link to site (#14181) Update the version menu in TVM documentation to add a specific v0.11.0 release docs link. * [TIR] Allow TransformLayout with non-inversible index map (#14095) * [TIR] Allow TransformLayout with non-inversible index map TransformLayout requires the index map to have inverse map that can be calculated by the analyzer in order to check whether padding is added. However, such check doesn't always work for all cases because of limitation of the affine analysis that can only handle a set of supported patterns. In some cases, even if the index map doesn't introduce padding, the schedule primitive throws `TransformationIntroducesPaddingError` because it fails to calculate the inverse index map. It is safe to allow buffer being padded without providing pad_value because the original loop extent is not changed and the padded region is not accessed. This PR changes the behavior of `TransformLayout` to allow non-inversible index map. Previous discussion: https://discuss.tvm.apache.org/t/conflict-free-shared-memory-permutation-in-tensorir/13959/9 * add assume_injective_transform option * Apply suggestions from code review Co-authored-by: Siyuan Feng <[email protected]> --------- Co-authored-by: Siyuan Feng <[email protected]> * [TIR][Analysis] Implement IdentifyMemCpy analysis function (#13947) * [HotFix][MetaSchedule] Turn off database shash check (#14188) At this moment, the structural hash values of IR in TVM is platform dependent (e.g., the hash values of a String may differ on different platforms). In our recent practice, we found this an obstacle for us to apply one existing database on different platforms (say we tune an IRModule with MetaSchedule on Metal, and then apply the database on CUDA, etc.) To clear this obstacle, we decide to remove the shash value check. The purpose of that check is mainly to ensure safety, and thus turning it off will make no difference in terms of using MetaSchedule in most of the cases that we can imagine. Meanwhile, it is equally important that we need to make our structural hash platform independent. There are plans ongoing for this target. * [TOPI] Batch Norm Training Mode (#14190) Prior to this PR, TOPI batch_norm only supports inference. This PR adds training: bool flag and momentum: float argument to support training mode (update moving_mean / var and return), which aligns with torch.nn.functional.batch_norm. * [TOPI] Group normalization (#14193) As more and more ML models nowadays contain the group normalization computation, we find it beneficial to introduce this op to TOPI level. It will enable us to optimize the group normalization operation as a whole in a more convenient way. This PR introduces the group normalization op to TOPI. The group norm operation was introduced in https://arxiv.org/abs/1803.08494. The implementation uses tuple reduction, same as the implementation of layer norm. Implemented with tuple reduction, the corresponding generated TIR function can be optimized by cross-thread reduction or rfactor through MetaSchedule. Co-authored-by: Bohan Hou <[email protected]> * [Fix][TIR] LowerCrossThreadReduction with write-back predicate (#14199) Prior to this PR, the cross-thread reduction lowering pass does not add a store predicate to the write-back block. This is in consideration that for a certain write-back buffer position, all values being stored (by all the threads) in the write-back block are the same. Since all threads are writing the same value, we were assuming that not having a write-back block predicate is fine, because the result will not be wrong in any way. However, recently we noticed that some GPU backend compiler will capture this behavior (multiple threads writing a same position) as a race condition and thus throw compilation error. The compiler does not take the fact that all values being stored are the same, and insist on complaining. This means that we will still need the write-back block predicate to make things work. And this PR does this change. I have done integration tests locally to make sure that the generated kernels is right and produces the right results numerically. * [Unity] Relax VM (#13878) This PR implements a flexible register-based VM to execute relax programs with dynamic shape and control flow. Design: https://github.com/tlc-pack/relax/wiki/Relax-VM-Design. Co-Authored-by: Ziheng Jiang <[email protected]> Co-Authored-by: Ruihang Lai <[email protected]> Co-Authored-by: Sunghyun Park <[email protected]> Co-Authored-by: Junru Shao <[email protected]> Co-Authored-by: Prakalp Srivastava <[email protected]> Co-Authored-by: Yong Wu <[email protected]> Co-Authored-by: Steven S. Lyubomirsky <[email protected]> Co-Authored-by: Tianqi Chen <[email protected]> Co-Authored-by: Hongyi Jin <[email protected]> * [Unity] Relax expressions and types (#13901) * [Unity][IR] First-class StructInfo (#13907) * [Unity][IR] First-class StructInfo Relax tracks structural information (such as tensor shape) via `StructInfo` about the values in Relax. * Fix rust build --------- Co-authored-by: Junru Shao <[email protected]> * [Unity][CI] Unity specific jenkins setup (do not upstream to main) (#13910) This PR setup a unity specific jenkins with minimum jenkinsfile without sharding and disables most of the tests to reduce overall cost. We can add tests of unty branch by configuring the specific groovy file. * [Unity] Basic StructInfo Analysis and Expr construction (#13916) [Unity] Basic StructInfo Analysis and Expr construction. This PR adds struct info analysis and expr support. These are logics to construct the IR node and perform struct info related analysis. Testcases are added to cover the IR node construction and related struct info analysis checks. Co-authored-by: Tianqi Chen <[email protected]> Co-authored-by: Altan Haan <[email protected]> Co-authored-by: Andrew Liu <[email protected]> Co-authored-by: Hongyi Jin <[email protected]> Co-authored-by: Jiawei Liu <[email protected]> Co-authored-by: Junru Shao <[email protected]> Co-authored-by: Lesheng Jin <[email protected]> Co-authored-by: masahi <[email protected]> Co-authored-by: Prakalp Srivastava <[email protected]> Co-authored-by: Ruihang Lai <[email protected]> Co-authored-by: Siyuan Feng <[email protected]> Co-authored-by: Steven S. <Lyubomirsky [email protected]> Co-authored-by: Sunghyun Park <[email protected]> Co-authored-by: Yixin Dong <[email protected]> Co-authored-by: Yong Wu <[email protected]> Co-authored-by: Ziheng Jiang <[email protected]> * [Unity] Relax BlockBuilder and ExprMutator (#13926) This PR adds BlockBuilder: the core data structure to construct Relax AST, and ExprMutator: performs AST mutation for implementing transformation passes. Co-Authored-by: Tianqi Chen <[email protected]> Co-Authored-by: Altan Haan <[email protected]> Co-Authored-by: Andrew Liu <[email protected]> Co-Authored-by: Hongyi Jin <[email protected]> Co-Authored-by: Jiawei Liu <[email protected]> Co-Authored-by: Junru Shao <[email protected]> Co-Authored-by: Lesheng Jin <[email protected]> Co-Authored-by: masahi <[email protected]> Co-Authored-by: Prakalp Srivastava <[email protected]> Co-Authored-by: Ruihang Lai <[email protected]> Co-Authored-by: Siyuan Feng <[email protected]> Co-Authored-by: Steven S. <Lyubomirsky [email protected]> Co-Authored-by: Sunghyun Park <[email protected]> Co-Authored-by: Yixin Dong <[email protected]> Co-Authored-by: Yong Wu <[email protected]> Co-Authored-by: Ziheng Jiang <[email protected]> * [Unity] Relax TVMScript Parser. (#13932) This PR adds the TVMScript parser/ir_builder support based on the blockbuilder. Co-authored-by: Ruihang Lai <[email protected]> Co-authored-by: Junru Shao <[email protected]> Co-authored-by: Tianqi Chen <[email protected]> Co-authored-by: Yuchen Jin <[email protected]> Co-authored-by: Steven S. Lyubomirsky <[email protected]> Co-authored-by: Yong Wu <[email protected]> * [Unity] Relax TVMScript Printer (#13944) This PR introduces Relax as a dialect supported by the TVMScript Printer. Some caveats: - Needs to rebase to mainline before merging. - Some tests are skiped because some operators are not upstreamed to the unity branch yet. Co-authored-by: Tianqi Chen <[email protected]> Co-authored-by: Yuchen Jin <[email protected]> Co-authored-by: Steven S. Lyubomirsky <[email protected]> Co-authored-by: Yong Wu <[email protected]> Co-authored-by: Prakalp Srivastava <[email protected]> Co-authored-by: Sunghyun Park <[email protected]> Co-authored-by: Ruihang Lai <[email protected]> Co-authored-by: Hongyi Jin <[email protected]> Co-authored-by: Bohan Hou <[email protected]> Co-authored-by: Siyuan Feng <[email protected]> * [Unity] Relax VM codegen (#13954) * [Unity] Relax VM shape lowering pass (#13956) This PR introduces Relax `FunctionPass` and `DataflowBlockPass` API, and the `VMShapeLower` pass to lower the shape expression in Relax to TIR functions and VM shape heap builtin functions. Co-Authored-by: Ziheng Jiang <[email protected]> Co-Authored-by: Lesheng Jin <[email protected]> Co-Authored-by: Altan Haan <[email protected]> Co-Authored-by: Junru Shao <[email protected]> Co-Authored-by: Prakalp Srivastava <[email protected]> Co-Authored-by: Ruihang Lai <[email protected]> Co-Authored-by: Siyuan Feng <[email protected]> Co-Authored-by: Steven S. <Lyubomirsky [email protected]> Co-Authored-by: Sunghyun Park <[email protected]> Co-Authored-by: Tianqi Chen <[email protected]> Co-Authored-by: Yong Wu <[email protected]> * [Unity] e2e Relax minimum build flow (#13961) This PR introduces the e2e Relax lowering flow (`relax.vm.build`). Tests for each pass in the flow are added. Co-Authored-by: Altan Haan <[email protected]> Co-Authored-by: Andrew Liu <[email protected]> Co-Authored-by: Hongyi Jin <[email protected]> Co-Authored-by: Jiawei Liu <[email protected]> Co-Authored-by: Junru Shao <[email protected]> Co-Authored-by: Prakalp Srivastava <[email protected]> Co-Authored-by: Ruihang Lai <[email protected]> Co-Authored-by: Siyuan Feng <[email protected]> Co-Authored-by: Steven S. <Lyubomirsky [email protected]> Co-Authored-by: Sunghyun Park <[email protected]> Co-Authored-by: Tianqi Chen <[email protected]> Co-Authored-by: Yong Wu <[email protected]> Co-Authored-by: Ziheng Jiang <[email protected]> * [Unity][TVMScript] Use explicit `R.shape` in TVMScript (#13979) As we've introduced `arg_sinfo` in CallNode, implicit shape constructor is not widely used in TVMScript. This PR removes the implicit shape since it may cause confusion between shape and tuple. * [Unity] Relax op: index (#13987) This PR is about the high-level tensor computation operators in Relax. This PR includes the tensor indexing operators. * [Unity] Relax op: datatype (#13986) * [Unity] Relax op: set (#13990) This PR is about the high-level tensor computation operators in Relax. This PR includes the set operators. Co-authored-by: Prakalp Srivastava <[email protected]> * [Unity] Relax op: image (#13994) This PR is about the high-level tensor computation operators in Relax. This PR includes the image operators. * [Unity] Relax op: arithmetic, comparison (#13983) This PR is about the high-level tensor computation operators in Relax. This PR includes the unary, binary and ternary arithmetic and comparison operators. Co-authored-by: Siyuan Feng <[email protected]> Co-authored-by: Chaofan Lin <[email protected]> * [Unity] Relax op: statistical (#13991) This PR is about the high-level tensor computation operators in Relax. This PR includes the statistical operators. * [Unity] Relax op: neural networks (#13993) This PR is about the high-level tensor computation operators in Relax. This PR includes the neural network operators. * [Unity] Relax op: creation (#13984) This PR is about the high-level tensor computation operators in Relax. This PR includes the tensor creation operators. * [Unity] Relax op: linear algebra (#13988) This PR is about the high-level tensor computation operators in Relax. This PR includes the linear algebra operators. Co-authored-by: Siyuan Fneg <[email protected]> * [Unity] Relax op: search (#13992) This PR is about the high-level tensor computation operators in Relax. This PR includes the search operators. * [Unity] Relax op: manipulation (#13989) This PR is about the high-level tensor computation operators in Relax. This PR includes the tensor manipulation operators. Co-authored-by: Prakalp Srivastava <[email protected]> * [Unity] NestedMsg Support utility (#13995) This PR introduce NestedMsg to robustly handle nested-tuple analysis. Relax support nested tuple structures in the IR. Nested tuple structure is important to support advanced groupings in cases such as gradient calculation and other scenarios. The possible presence of nested tuple does mean that we need to to robustly handle analysis that contains nested tuple structures in a dataflow graph. This PR introduces a NestedMsg<T> class that corresponds to a possibly nested message tuple for a given leaf message class T. We also introduces various helper functions to compose and decompose messages. Co-authored-by: Bohan Hou <[email protected]> Co-authored-by: Yixin Dong <[email protected]> Co-authored-by: Ruihang Lai <[email protected]> * [Unity][Pass] Operator Fusion Passes (#14001) [Unity][Pass] Operator fusion passes This PR introduces three passes for operator fusion: 1. AnnotateTIROpPattern: analysis the operator kind from PrimFunc. 2. FuseOps: fuse operators for Relax functions, which adds a new fused relax primitive function. 3. FuseTIR: fuse corresponding TIR PrimFuncs for the fused relax. * [Unity][Pass] LambdaLift pass (#14012) * [Unity][VM] Supporting "compiled" exec mode. (#14015) [VM] Supporting "compiled" exec mode. This PR adds support of "compiled" mode to the VM. The compiled mode translate the relax function into TIR function and drive it through the TIR function. It is different from the micro AOT codegen, which generate TIR code that targets the micro C runtime environment and useful for resource limited settings with smaller set of features. Both leverages the low-level TIR build that is also shared with TensorIR. The current implementation targets full TVM (VM) runtime, that comes with PackedFunc, object, tuple, closure and all kinds of rich structure support. This also mean that we can leverage the full runtime support to handle things like allocation, dynamic shape, easy plugins and python interaction, which are not available in more limited runtime. The user directly use the same API to load the generated code regardless of compiled mode or bytecode. And just need to change one line ```python ex = relax.vm.build(mod, target, exec_mode="compiled") ``` The simplicity is thanks to the TVM runtime archiecture that allows us to compose things together in objects. The only difference is how the PackedFunc of high-level driving is being provided. In the case of bytecode it is normal interpretation and in the case of compiled mode it is TIR. It is a complete implementation Unit-testcases are added. All codegen build tests are updated to include two exec_modes and have passed locally. Co-authored-by: Junru Shao <[email protected]> * [Unity][Pass] BindParams pass, FoldConstant pass (#14016) This PR introduces FoldConstant/BindParam passes. Co-authored-by: Yong Wu <[email protected]> Co-Authored-by: Hongyi Jin <[email protected]> Co-Authored-by: Siyuan Feng <[email protected]> * [Unity][Pass][TuningAPI] Introduce TuningAPI and MetaSchedule pass (#14014) Add TuningAPI and MetaSchedule tuning pass * [Unity] Relay -> Relax translator (#14026) This PR implements a Relay to Relax translator, which allows us to import Relay workloads to Relax for benchmarking and development purposes (tests and examples are added). * [Unity][Pass] Normalize Pass (#14031) This PR implements relax `Normalize` Pass, which allows users to transform Relax IR to normal form, i.e., the expressions are normalized (no nesting and hence the AST is in ANF), and all `checked_type_` and `shape_` of expressions are available. (tests are added). Co-Authored-by: Yuchen Jin <[email protected]> Co-Authored-by: Ruihang Lai <[email protected]> Co-Authored-by: Siyuan Feng <[email protected]> Co-Authored-by: Tianqi Chen <[email protected]> * [Unity][BlockBuilder] CallTE convert PrimValue args (#14028) Prior to this PR, the `call_te` of BlockBuilder is not capable of converting PrimValue arguments and directly rejects PrimValues instead. This PR fixes this behavior with PrimValue conversion support and one regression test. Co-authored-by: Siyuan Feng <[email protected]> * [Unity][Pass] Wellformed Analysis (#14032) This PR implements relax wellformed analysis, which checks if the IRModule is well-formed. (tests and examples are added). Co-Authored-by: Ruihang Lai <[email protected]> Co-Authored-by: Siyuan Feng <[email protected]> Co-Authored-by: Tianqi Chen <[email protected]> Co-authored-by: Steven S. Lyubomirsky <[email protected]> Co-authored-by: Yong Wu <[email protected]> Co-Authored-by: Yuchen Jin <[email protected]> Co-Authored-by: Yixin Dong <[email protected]> Co-Authored-by: Chaofan Lin <[email protected]> Co-Authored-by: Prakalp Srivastava <[email protected]> Co-Authored-by: Junru Shao <[email protected]> * [Unity][TVMScript] Move tir/relax import in script out of __init__.py (#14033) Prior to this PR, `python/tvm/script/__init__.py` imports both tir and relax submodules. This leads to the phenomenum that when people does ```python from tvm.script import tir as T ``` , the relax submodule will be implicitly visited by `__init__.py` as well. Since TIR does not rely on Relax, it is good not to import both of them at the same time. (This can prevent cyclic imports sometimes.) This PR does this decoupling by introducing two files * `python/tvm/script/relax.py` * `python/tvm/script/tir.py` and removing the imports from `python/tvm/script/__init__.py` and `python/tvm/script/parser/__init__.py`. With this change, we force people to manually do `from tvm.script import tir` and `from tvm.script import relax` to use TVMScript parser, which is right our conventional way. * [Unity][Pass] Operator legalization (#14029) This PR is the operator legalization pass, which transforms high-level operator calls to `call_tir`s of corresponding low-level TIR PrimFuncs. - The legalization pass provides customizability, which enables people to pass in a customized legalization map to override the default legalization method. - The legalization supports symbolic shape. (At this moment only pooling does not support symbolic shape, as TOPI pooling does not support. This needs to be fixed in followup PRs.) Co-authored-by: Chaofan Lin <[email protected]> Co-authored-by: Yixin Dong <[email protected]> Co-authored-by: Siyuan Feng <[email protected]> * [Unity][Op] Add ShapeExpr Tests for Reshape Op (#14035) This PR specially checks the relax.reshape operator when the input is a ShapeExpr. * [Unity] Initial PyTorch Frontend (#14037) [Unity] Initial PyTorch Frontend This PR introduces initial pytorch frontend components of Relax, including - a FX translator that translates a Torch FX graph module to an TVM IRModule, - a Relax-backend of Torch Dynamo, which brings the mechanism to build PyTorch model using Relax compilation pipeline, - a pipeline prototype that contains the collection of pre-defined pipelines that optimizes and lower IRModule before passing to minimum build. Co-authored-by: Bohan Hou <[email protected]> Co-authored-by: Tianqi Chen <[email protected]> Co-authored-by: Siyuan Feng <[email protected]> * [Unity][Pass] Block-level static memory planning (#14038) This PR introduces the static memory planning pass on binding block level, as well as an analysis function that estimate the memory usage after the memory planning pass. It supports the following features: nested-tuples, reuse memory of the input of reshape ops, an estimator that returns total memory size needed to be allocated before and after memory planning, as well as the number of tensors / memory blocks to be allocated before and after memory planning. The estimation is static -- it does not consider control flows (such as “if” and cross-function calls). It simply accumulates the size of every alloc_tensor and alloc_storage. We will produce “`relax.memory.alloc_tensor/storage`” as the results produced by memory planning. * [Unity] Disallow inline prim_func in relax IR (#14040) Disallow inline prim_func in relax IR * [Unity] Update tests to adapt to latest TVMScript syntax (#14039) Given that some latest changes of TVMScript syntax have been merged, some test files are now containing deprecated uses of TVMScript syntax. This PR updates the test files with latest TVMScript syntax so that running the tests will not trigger deprecation warnings. Co-authored-by: Tianqi Chen <[email protected]> * [Unity] Relax dataflow pattern language (matching) (#14041) The dataflow pattern language for Relax (originally from https://github.com/tlc-pack/relax/pull/163). The implementation splits patterns into two parts: - Match an Expression: match an expression syntactically (MatchExprPattern, i.e., DFPatternMatcher); - Match a Graph: match a graph (cross multiple VarBinding) topologically (MatchGraphPattern); * [Unity] Statement rewriter for DataflowBlock (#14043) This PR implements a few APIs to quickly perform statement-level mutation: `add`/`remove_unused`/`remove_all_unused`/`replace_all_uses`. It also implements `remove_all_unused` to remove dead statements inside `DataflowBlock`. * [Unity][Pass] FuseOps FuseTIR fixes (#14044) This PR fixes two bugs of FuseOps and FuseTIR: It fixes FuseOps who only rewrites the "main" function of the IRModule. After the fix, FuseOps now goes through each non-primitive Relax function. Test cases for both FuseOps and FuseTIR sides are added so ensure that both of the two passes work for cases of multiple Relax functions. It also fixes FuseOps and FuseTIR who did not take "call_dps_packed" style "call_tir" into account. The previous behavior will directly downcast the first argument of "call_tir" to GlobalVar, which is not right when the "call_tir" is in "call_dps_packed" stype and the first argument is a PackedFunc. With this fix, FuseOps and FuseTIR will skip such "call_tir"s. Tests for both CallTIR and CallOps are added accordingly. * [Unity][TVMScript] Overload `__neg__` for relax expr (#14045) This PR overloads `__neg__` given that `relax.negative` is now supported. Besides, it adds `test_op_misc.py` and brings tests for calling overloaded operators. * [Unity][VM] Add per-op profiling support (#14053) Adds per-op profiling support to Relax VM, in a way similar to how Relay VM is instrumented via the common profiling infra in the runtime. Profiling over RPC is supported. Example output: ``` Name Duration (us) Percent Device Count Argument Shapes conv2d1 705,779.00 51.22 hexagon0 1 float32[1, 64, 56, 56], float32[1, 64, 54, 54] conv2d 669,589.00 48.60 hexagon0 1 float32[1, 64, 56, 56], float32[1, 64, 56, 56] relu 683.00 0.05 hexagon0 1 float32[1, 64, 56, 56], float32[1, 64, 56, 56] relu1 679.00 0.05 hexagon0 1 float32[1, 64, 54, 54], float32[1, 64, 54, 54] vm.builtin.check_tensor_info 28.00 0.00 hexagon0 1 float32[1, 64, 56, 56] vm.builtin.match_shape 25.00 0.00 hexagon0 1 float32[1, 64, 56, 56] ---------- Sum 1,376,783.00 99.93 6 Total 0.00 cpu0 1 Total 1,377,809.00 hexagon0 1 Configuration ------------- Number of threads: 4 Executor: VM ``` The original PR: https://github.com/tlc-pack/relax/pull/422 * [Unity][BYOC] Add pattern-based partitioning pass (#14054) This adds a new pass, FuseOpsByPattern, which applies pattern matching to each function in the given module, and groups matched expressions into a new function. The end result is similar to FuseOps, but fusion is driven completely by the provided patterns. The implementation also reuses OperatorFusor used by FuseOps to create grouped functions from partitioned groups, further illustrating the similarity between the two passes. The new pass will serve the same role the MergeComposite pass plays in Relay BYOC - grouped functions are annotated with the "composite" attribute to denote what operations a given function consists of, and offloaded to external backends. But it can be also useful in non-BYOC settings, for example to support advanced fusion that the op-kind based one doesn't handle (fused MHA, conv2d / gemm + reduction fusion, etc). The original PR: https://github.com/tlc-pack/relax/pull/366 * [Unity] Relax op: collapse sum (#14059) This PR brings high-level operators `relax.collapse_sum_like` and `relax.collapse_sum_to` which is useful when doing AD in Relax. To achieve this, it exposes the interface of `topi.collapse_sum`. Moreover, this PR also implements the legalization of these op and adds corresponding tests. * [Unity][Fix][Pass] Fix FuseOps for lack graph edges (#14058) This PR fixes a mistake of #14044. In #14044, in VisitLeaf of graph construction of FuseOps, we first check if the input node is Leaf and then check if it is Tuple. This is not right: as Tuple is not categorized as one leaf node, when the input node is a Tuple, the function will return since the input is not a LeafNode. And the check for Tuple will thereby never holds. It is quite interesting that our existing unit tests fail to filter this mistake out. I add a regression test for this case, which can ensure that the tuple is always visited. * [Unity][Pass] Remove Unused Function (#14061) This PR implements a pass to clean up unused functions. Co-authored-by: masahi <[email protected]> * [Unity][BYOC] Add pass to merge composite functions to offload large subgraphs (#14062) This PR adds a pass that merges neighboring calls to composite functions offloaded to the same external backend into one function. This is important for backends that want to receive as large subgraph as possible, for example TensorRT. It plays the same role as the MergeCompilerRegion pass in Relay BYOC does, and the algorithm follows the same idea described in https://discuss.tvm.apache.org/t/relay-improved-graph-partitioning-algorithm/5830. Original PR https://github.com/tlc-pack/relax/pull/372 Substantial improvement by @yelite https://github.com/tlc-pack/relax/pull/411 Related fix PR by @yelite https://github.com/tlc-pack/relax/pull/406 Co-authored-by: Lite Ye <[email protected]> * [Unity][Frontend] Annotate number of non-static input of FX function (#14067) * [Unity][Transform] Add LiftTransformParams pass (#14069) This PR added a pass `LiftTransformParams`. It allows to compile the end-to-end model without weights provided. The idea is annotate the input parameters that are weights, and identify and lift the transformations to weights, and compile it to a separate function `transform_params` that can be executed in runtime. Users can run `transform_params` with weights to get the weights for the optimized model as a prep step before the deployment. In this way, we perform the same optimizations and defer the weight transformations to the user side, while the overhead of the deferred weight transformation can be ignored as it only need to be run once. This pass is integrated with the default `vm.build`. It is optional and only necessary when the parameters are kept as inputs when importing the model from the frontend. * [Unity][BYOC][Pass] RunCodegen and TensorRT (#14078) This PR introduces the fundamental workflow for BYOC and integrate TensorRT as a demonstration. * [Unity][Pass] Canonicalize Bindings (#14079) It may be useful for some passes to collapse chains of definitions, particularly after other compiler transformations that may reduce or simplify some expressions. This pass will take chains of definitions and replace references to later definitions to the original one. It works by checking `LookupBinding` for each var use-site and replacing the var with its definition if the definition was another var. Additionally, `MatchCast` bindings where the LHS and the RHS are guaranteed to match at compile time are canonicalized into ordinary `VarBinding`s. Example: ```python y = x z = y w = z o = w p = o ``` Will be replaced with ```python y = x z = x w = x o = x p = x ``` Original PR: https://github.com/tlc-pack/relax/pull/233 Co-authored-by: Steven S. Lyubomirsky <[email protected]> * [Unity] Add testcases for `expr_args_converter` (#14080) This is a missing test file when we added the `expr_args_converter`. This PR adds it back. * [Unity][BYOC] Add CUTLASS backend (#14081) Co-authored-by: Lite Ye <[email protected]> * [Unity][BYOC] Add DNNL backend (#14082) This PR adds dnnl backend to the unity flow. * [Unity][Op] `log_softmax` and `cross_entropy_with_logits` (#14083) This PR introduces two high-level operators log_softmax and cross_entropy_with_logits, which are important when we are calculating CrossEntropyLoss (in torch). Co-authored-by: Yixin Dong <[email protected]> * [Unity][Analysis] TIR pattern kind analysis for multi-buffer write block (#14075) This PR supports TIR pattern kind analysis for TIR blocks which write to multiple buffers, which is helpful for normalization operators like layernorm, groupnorm, etc. Prior to this PR, the analyzer does not support a blocks which write to multiple buffers. On seeing such a block, the analyzer simply sets the analysis result to "opaque". With this PR, on seeing a block which writes multiple buffers, the analyzer will check if all the BufferStores have the same indices. And it will only set the result to "opaque" when the BufferStores have different indices. By doing this, the analysis works for common cases where a block may write to multiple buffers, like layernorm or groupnorm. Besides the unit test for the analysis itself, this PR also adds a unit test for FuseOps pass, make sure that a "layernorm + relu" pattern can be fused together. * [Unity][Fix][Pass] FoldConstant with DCE in dataflow block (#14087) The current FoldConstant pass does not support removing unused bindings in the post-folding function. Therefore, for large real-world models, the built executable will be overlarge because of the redundant unused constants. This PR removes the redundant unused constant bindings in FoldConstant by using the analysis function "RemoveAllUnused". Note that "RemoveAllUnused" only works at dataflow block level. Therefore FoldConstant will not remove unused bindings outside of dataflow block as well. * [Unity] Refactor Relax Build JIT UX (#14088) This PR refactors relax build so it get exposed at the opt-level. We also introduces an explicit jit functionality to handle live loading of compiled artifacts from cutlass. We also move relax vm to runtime so it can be clearly isolated from the rest of the compiler stack. * [Unity][Relax] Set Shape Function to Be Host Function (#14090) Set shape function to be host func. * [Unity] Fix typo in the comment (#14096) * [Unity] Lower `shape_of` to a builtin (#14093) This PR lowers shape_of op to a Relax VM builtin, and changes a utility function to take StructInfo as input. Co-authored-by: Steven S. Lyubomirsky <[email protected]> * [Unity] Relax Recursive function (#14092) This PR adds TVMScript local recursive function support. It also update lambda lifting pass. Removed CalledGlobalVars, it was not used anymore. It also updates well-form pass to allow un-defined vars for recursive call * [Unity][Layout] Add layout transformation analysis for PrimFunc (#14066) * [Layout] Add layout transformation analysis for PrimFunc. This change adds a PrimFunc level analysis to suggest layout transformations to block and buffers in the PrimFunc based on the layout transformations to PrimFunc outputs. * Add support for multiple blocks such as split op. * Add negative tests and increase coverage. * fix warning message * fix lint * remove unused header * Address comments. Moved some utility functions to support/array.h improve doc * fix deprecation warn T.var("int64") to T.int64() * address comments * [Unity] Remove attributes of relax.print, assert and unique (#14101) Remove the attributes of operators assert, print and unique. Use PrimValue as substitute. Co-authored-by: Steven S. Lyubomirsky [[email protected]](mailto:[email protected]) Co-authored-by: Prakalp Srivastava [[email protected]](mailto:[email protected]) * [Unity][BYOC]Add relax backend pattern registry (#14106) * Add relax backend pattern registry * Add doc * [Unity] Update tests again to adapt to latest TVMScript syntax (#14115) * finished * fix * rollback merge_composite_functions * [Unity][Fix] Fix bug in MergeCompositeFunctions (#14117) Currently `MergeCompositeFunctions` will modify the map while iterating over it, and that makes tests/python/relax/test_transform_merge_composite_functions.py does not pass. This PR fixes this bug. * [Unity][BlockBuilder] Add `name_hint` argument for `emit` and `emit_output` (#14126) This PR adds `name_hint` argument for `emit` and `emit_output` API of Relax blockbuilder. The argument exists in the C++ side but not exposed to Python side (So user who use the Python bb.emit will let `name_hint` be `""` by default). Co-authored-by: Yixin Dong <[email protected]> * [Unity][WEB] Relax vm on web runtime (#14131) This PR brings initial relax vm support on web runtime * [Unity] Add Global info (#14132) * [Unity][BYOC] Add transposed matmul support to Relax CUTLASS BYOC (#14128) Add transposed matmul support for Relax CUTLASS * [Unity][TVMScript] emit_te sugar (#14123) This PR adds R.emit_te meta-programming mechanism to emit a topi operator from TVMScript * [Unity][BYOC] Assign group to unused bindings and ignroe PrimFunc (#14139) * [Unity][BYOC] Assign group to unused bindings and ignroe PrimFunc * Update fuse_ops.cc * [Unity] Add callback to FuseOpsByPattern to check match result is accepted (#14109) * [Unity] Add callback to FuseOpsByPattern to check match result is accepted * add callnode to callback args * update pattern registry * fix * [Unity][Legalize] Fix Scalar Constant Legalization (#14127) This PR fixes the issue of loss of data type during Legalization. Previously, if we use a constant scalar in operators like `multiply`, it will automatically be converted to a python data type variable, which may lose its original data type. For example, `float16` may become python `float` and be interpreted as `float32` later. This is now fixed by avoiding scalar value conversion. The conversion could be added back once we have better support for scalar prim value. Co-authored-by: Sunghyun Park <[email protected]> Co-authored-by: Wuwei Lin <[email protected]> * [Unity][Pass] Enhance constant folding to fold relax ops by evaluating them. (#14146) * [Unity][Pass] Enhance constant folding to fold relax ops by evaluating them. This uses the registered legalization function attached to the op to lower it to call_tir and uses the existing call_tir folding mechanism to fold it. This kind of op folding is only allowed within dataflow block as ops could have side-effects. Limitations: * This currently does not support folding ops that could lower to multiple call_tir bindings. * Folding by evaluating ops is not always beneficial. We need a heuristic to check if it is useful. This is not implemented yet and folding is always allowed by evaluating expressions. * fix ci error * fix doc * fix bug * [Unity][Debugging] AST printer (#14152) This PR transfers over the AST printer from tlc-pack/relax. The AST printer is a debugging tool that prints out a Relax AST in a precise and human-readable format, which can be helpful for debugging the parser or various passes. Co-authored-by: Yuchen Jin <[email protected]> Co-authored-by: Lesheng Jin <[email protected]> Co-authored-by: Siyuan Feng <[email protected]> Co-authored-by: Ruihang Lai <[email protected]> Co-authored-by: Tianqi Chen <[email protected]> * [Unity][Pass] Support Symbolic Shape Deduction during BindParam (#14154) `BindParam` replace function params to constant nodes. However, it will drop the shape information of the params, considering the following case: ```python @R.function def main( x: R.Tensor(("batch", "m"), dtype="float32"), w0: R.Tensor(("n", "m"), dtype="float32"), b0: R.Tensor(("n",), dtype="float32"), w1: R.Tensor(("k", "n"), dtype="float32"), b1: R.Tensor(("k",), dtype="float32"), ) -> R.Tensor(("batch", "k"), dtype="float32"): batch = T.Var("batch", "int64") k = T.Var("k", "int64") m = T.Var("m", "int64") n = T.Var("n", "int64") with R.dataflow(): lv0 = R.call_tir("linear0", (x, w0, b0), out_sinfo=R.Tensor((batch, n), dtype="float32")) out = R.call_tir("linear1", (lv0, w1, b1), out_sinfo=R.Tensor((batch, k), dtype="float32")) R.output(out) return out ``` The current pass will simply drop the symbolic var `n`, `k` and cause undefined vars during build as ```python @R.function def main(x: R.Tensor((1, "m"), dtype="float32")) -> R.Tensor(dtype="float32", ndim=2): m = T.Var("m", "int64") n = T.Var("n", "int64") k = T.Var("k", "int64") with R.dataflow(): lv0 = R.call_tir("linear0", (x, metadata["relax.expr.Constant"][0], metadata["relax.expr.Constant"][1]), out_sinfo=R.Tensor((1, n), dtype="float32")) out = R.call_tir("linear1", (lv0, metadata["relax.expr.Constant"][2], metadata["relax.expr.Constant"][3]), out_sinfo=R.Tensor((1, k), dtype="float32")) R.output(out) return out ``` This PR updates the pass to bind the symbolic shape during binding. * [Unity][Analysis] Checking function return struct info in well-formed check (#14155) The current well-formed misses the check of function return struct info, which may mistakenly pass the check if there are undefined vars in the function return struct info. * [Unity][BYOC] Use Relax legalize + CPU build for reference in tests (#14162) * clean dnnl test * clean trt test * clean cutlass test * fix gelu legalize for fp16 * use memoize in dnnl and trt tests * [Unity] Add bind_constants option to FuseOpsByPattern (#14151) * [Unity] Add lift_constatns option to FuseOpsByPattern * lift_constants -> bind_constants * [Unity][Analysis] Analysis for detecting recursion in Relax (#14149) * DFS based attempt to detect mutual recursion * Use Johnson's circuit-detecting algorithm instead * Fix control flow test * Detect all recursion anyway * Add new test cases for simple recursion * Fix mistake in test case * Include missing dependencies * Remove trailing whitespace * Dependencies are simply references, not necessarily calls * More trailing whitespace * Newline at end of file * Fix spacing in docstring Co-authored-by: Siyuan Feng <[email protected]> --------- Co-authored-by: Siyuan Feng <[email protected]> * [Unity][BYOC] Add batch matmul support to Relax CUTLASS BYOC (#14166) * Add batch matmul support to Relax CUTLASS BYOC * Allow more dtypes * Fix tests * Revert how to get batch attr * [Unity][Op] Full support of Relax op `power` (#14171) This PR provides a full support of `R.power` including op registering, legalization, overloading `__power__` for Expr and torch fx frontend. * [Unity][Analysis] Restore Python bindings for var analyses (#14180) Restore Python bindings for var analyses * [Unity][OP] Add an operator for fused multi head attention (#14150) * [Unity][OP] Add an operator for fused multi head attention This PR introduces the new relax operator `R.nn.attention` for fused multi head attention, and the support of fused multi head attention to relax cutlass BYOC. The input of the operator are query, key and value tensor, with `BSNH` layout, namely `[batch size, sequence length, number of heads, dimension of heads]`. And the output shares the same layout with all input tensor. * remove useless codes, remove attrs and add memoize * add more dispatches * nit and fix rebase * fix linter * add support for bias * fix lint * BNSS layout for bias * update doc * fix typo * support bias broadcast * [Unity][WEBGPU] Codegen improvements and WebRuntime (#14187) This PR makes various improvements web codegen in relax web runtime. Correct support of shift operators. Update relax vm to make most use of internal allocators. Update the webgpu API to the latest spec. * [Unity][Transform] LiftTransformParams handling multiple functions (#14192) Previously, the LiftTransformParams pass only works on function `"main"`. This is a bit restrictive as in our recent practice on stable diffusion, there are cases where multiple Relax functions inside an IRModule all need to be transformed. Therefore, this PR enhances the LiftTransformParams pass, so that it will now transform **all** functions **with attribute `num_input`**. For functions without this attribute, the pass will simply skip them. * [Unity][Op] Group normalization (#14194) * [TOPI] Group normalization As more and more ML models nowadays contain the group normalization computation, we find it beneficial to introduce this op to TOPI level. It will enable us to optimize the group normalization operation as a whole in a more convenient way. This PR introduces the group normalization op to TOPI. The group norm operation was introduced in https://arxiv.org/abs/1803.08494. The implementation uses tuple reduction, same as the implementation of layer norm. Implemented with tuple reduction, the corresponding generated TIR function can be optimized by cross-thread reduction or rfactor through MetaSchedule. Prior to this PR, the group normalization operations in frontend models are translated to a series of operations, which brings inconvenience when we want to optimize the group norm op as a whole. With the TOPI implementation of group norm being introduced by #14193, we can now use it to legalize the high-level group norm op and optimize it using cross-thread reduction or rfactor via MetaSchedule. Co-authored-by: Bohan Hou <[email protected]> * [Unity][Op] Argmax and argmin (#14195) This PR introduces full support to the argmax and argmin op to the unity branch, including the structure info inference, the legalization, and the translation from Torch FX. * [Unity][Op] Legalize `round`, `floor`, `ceil`, `sign` (#14198) This PR implements the legalization for four unary operators: * round, * floor, * ceil, * sign. Unit tests are provided accordingly. * [Unity][Frontend] FX translator supporting more ops (#14196) This PR improves the torch FX translator in the following perspectives: * support unary op `sigmoid` and `round`, * support in-place `fill`, `triu` and `tril`, * support `tensor`, `arange`, `empty`, * support `bmm` (batch matrix multiplication), * support `astype`, * support `chunk` and `squeeze`. This PR also fixes `Embedding`. Previously the translation assumes that the input to Embedding will only be 1-dimensional, and will throw exception when the input has more than one dimension (i.e., batched). This PR brings the support. * [Unity][Frontend] FX translator returning weights with `keep_params_as_input` (#14197) PR #14067 introduces the flag `keep_params_as_input` to the FX translator, in the purpose to handle to model weights outside of the translated Relax function. This PR takes a further step, by returning the model weights as NDArrays when the flag `keep_params_as_input` is true. With this PR, the translator now can return back the weights upon requested. Otherwise, after the import we will lose the model weights in the given PyTorch model. * [Unity][Fix] FX translating dtype (#14201) This PR fixes a bug of the current FX translator when dealing with dtype. Previously, the translator does not take the cases ```python dtype = x.getattr("dtype") ``` into consideration. In this case, the dtype will be a fx.Node object, while the translator assumes that the dtype is either a string or a torch native datatype (e.g., torch.float32). This PR fixes this by doing an environment table lookup before for all dtypes. * [Unity][TIR][Pass] ForceNarrowIndexToInt32 (#14203) [TIR][Pass] ForceNarrowIndexToInt32 This PR introduces a pass which forces every index expression in a PrimFunc to have dtype int32. Meanwhile, it also checks if all integer buffers in the PrimFunc have int32 dtype, and report error if some integer buffer has dtype other than int32. In terms of implementation, this pass leverages the IndexDataTypeNormalizer, with the target dtype being int32. This PR contains a few basic tests that come from `test_tir_transform_narrow_datatype.py`, and contains some negative tests as well. * [Unity][Frontend] FX translator support torch.baddbmm (#14202) This PR brings the support of translating `torch.baddbmm` into combination of operators (matmul, add, multiply). Unit tests are provided accordingly. This PR also fixes the kwarg fetching issue of `torch.interpolate`. * [CI] Point cpu ci to dep with onnx (#40) Point cpu ci to dep with onnx * [Unity] Introduce Default GPU Schedule Pass (#14182) * Implement default schedule. * Add test. * Add tests. * Fix linting. * Skip scheduled blocks. * Address issues. * Use target current. * Minor fixes. * Remove Mutator. * Move pas…
…subgraphs (#14062) This PR adds a pass that merges neighboring calls to composite functions offloaded to the same external backend into one function. This is important for backends that want to receive as large subgraph as possible, for example TensorRT. It plays the same role as the MergeCompilerRegion pass in Relay BYOC does, and the algorithm follows the same idea described in https://discuss.tvm.apache.org/t/relay-improved-graph-partitioning-algorithm/5830. Original PR tlc-pack/relax#372 Substantial improvement by @yelite tlc-pack/relax#411 Related fix PR by @yelite tlc-pack/relax#406 Co-authored-by: Lite Ye <[email protected]>
…subgraphs (#14062) This PR adds a pass that merges neighboring calls to composite functions offloaded to the same external backend into one function. This is important for backends that want to receive as large subgraph as possible, for example TensorRT. It plays the same role as the MergeCompilerRegion pass in Relay BYOC does, and the algorithm follows the same idea described in https://discuss.tvm.apache.org/t/relay-improved-graph-partitioning-algorithm/5830. Original PR tlc-pack/relax#372 Substantial improvement by @yelite tlc-pack/relax#411 Related fix PR by @yelite tlc-pack/relax#406 Co-authored-by: Lite Ye <[email protected]>
…subgraphs (#14062) This PR adds a pass that merges neighboring calls to composite functions offloaded to the same external backend into one function. This is important for backends that want to receive as large subgraph as possible, for example TensorRT. It plays the same role as the MergeCompilerRegion pass in Relay BYOC does, and the algorithm follows the same idea described in https://discuss.tvm.apache.org/t/relay-improved-graph-partitioning-algorithm/5830. Original PR tlc-pack/relax#372 Substantial improvement by @yelite tlc-pack/relax#411 Related fix PR by @yelite tlc-pack/relax#406 Co-authored-by: Lite Ye <[email protected]>
…subgraphs (#14062) This PR adds a pass that merges neighboring calls to composite functions offloaded to the same external backend into one function. This is important for backends that want to receive as large subgraph as possible, for example TensorRT. It plays the same role as the MergeCompilerRegion pass in Relay BYOC does, and the algorithm follows the same idea described in https://discuss.tvm.apache.org/t/relay-improved-graph-partitioning-algorithm/5830. Original PR tlc-pack/relax#372 Substantial improvement by @yelite tlc-pack/relax#411 Related fix PR by @yelite tlc-pack/relax#406 Co-authored-by: Lite Ye <[email protected]>
…subgraphs (#14062) This PR adds a pass that merges neighboring calls to composite functions offloaded to the same external backend into one function. This is important for backends that want to receive as large subgraph as possible, for example TensorRT. It plays the same role as the MergeCompilerRegion pass in Relay BYOC does, and the algorithm follows the same idea described in https://discuss.tvm.apache.org/t/relay-improved-graph-partitioning-algorithm/5830. Original PR tlc-pack/relax#372 Substantial improvement by @yelite tlc-pack/relax#411 Related fix PR by @yelite tlc-pack/relax#406 Co-authored-by: Lite Ye <[email protected]>
…subgraphs (#14062) This PR adds a pass that merges neighboring calls to composite functions offloaded to the same external backend into one function. This is important for backends that want to receive as large subgraph as possible, for example TensorRT. It plays the same role as the MergeCompilerRegion pass in Relay BYOC does, and the algorithm follows the same idea described in https://discuss.tvm.apache.org/t/relay-improved-graph-partitioning-algorithm/5830. Original PR tlc-pack/relax#372 Substantial improvement by @yelite tlc-pack/relax#411 Related fix PR by @yelite tlc-pack/relax#406 Co-authored-by: Lite Ye <[email protected]>
…subgraphs (#14062) This PR adds a pass that merges neighboring calls to composite functions offloaded to the same external backend into one function. This is important for backends that want to receive as large subgraph as possible, for example TensorRT. It plays the same role as the MergeCompilerRegion pass in Relay BYOC does, and the algorithm follows the same idea described in https://discuss.tvm.apache.org/t/relay-improved-graph-partitioning-algorithm/5830. Original PR tlc-pack/relax#372 Substantial improvement by @yelite tlc-pack/relax#411 Related fix PR by @yelite tlc-pack/relax#406 Co-authored-by: Lite Ye <[email protected]>
…subgraphs (#14062) This PR adds a pass that merges neighboring calls to composite functions offloaded to the same external backend into one function. This is important for backends that want to receive as large subgraph as possible, for example TensorRT. It plays the same role as the MergeCompilerRegion pass in Relay BYOC does, and the algorithm follows the same idea described in https://discuss.tvm.apache.org/t/relay-improved-graph-partitioning-algorithm/5830. Original PR tlc-pack/relax#372 Substantial improvement by @yelite tlc-pack/relax#411 Related fix PR by @yelite tlc-pack/relax#406 Co-authored-by: Lite Ye <[email protected]>
A part of #364
This PR adds a pass that merges neighboring calls to composite functions offloaded to the same external backend into one function. This is important for backends that want to receive as large subgraph as possible, for example TensorRT. It plays the same role as the
MergeCompilerRegion
pass in Relay BYOC does, and the algorithm follows the same idea described in https://discuss.tvm.apache.org/t/relay-improved-graph-partitioning-algorithm/5830. As you can imagine, it is a tricky problem if branch diverge / merge are involved.Before
After
An interesting thing about my implementation is that this new pass is also making use of the same function-grouping mutator pass that
FuseOps
andFuseOpsByPattern
use - the only difference between these passes is, again, the way to partition subexpressions into groups. Since the new pass is supposed to run afterFuseOpsByPattern
, we are essentially running one fusion pass on the output of another fusion pass (i.e., fusion of subgraphs, each of which is a fusion of ops). For now, the new pass is namedMergeCompositeFunctions
and the function-grouping mutator (OperatorFusor
infuse_ops.cc
) is made reusable from outside asMakeGroupedFunctions
function, but I welcome suggestions for better names for these functions.A bug in
OperatorFusor
when a tuple-producing function is involvedThis was found while I was working on the complicated example from https://discuss.tvm.apache.org/t/relay-improved-graph-partitioning-algorithm/5830. Currently, bindings in
DataflowBlockNode
are processed in the original order, but this is incorrect ifSee the example below. The group B2 depends on the group A1 that produces a tuple. So the new grouped function A1 must be emitted before the one for B2. Depending on where the binding for the node in B2 is defined in the original order, the grouped function for B2 may be emitted before A1, consuming a variable in A1 that will become invalid after it is remapped to the result of TupleGetItem.
This is fixed by processing bindings in the order of the topological sort of the group dependency relations. cc @Hzfengsy
cc @sunggg @psrivas2 @mbaret @gigiblender @mikepapadim