From cddbbf94749f0b49241aa566776ec13e3d67b3f1 Mon Sep 17 00:00:00 2001 From: Peter Bell Date: Tue, 1 Feb 2022 17:11:02 +0000 Subject: [PATCH 1/2] RFC-0025 Improving incremental builds --- RFC-0025-improving-incremental-builds.md | 217 +++++++++++++++++++++++ 1 file changed, 217 insertions(+) create mode 100644 RFC-0025-improving-incremental-builds.md diff --git a/RFC-0025-improving-incremental-builds.md b/RFC-0025-improving-incremental-builds.md new file mode 100644 index 00000000..f5d59464 --- /dev/null +++ b/RFC-0025-improving-incremental-builds.md @@ -0,0 +1,217 @@ +# Improving incremental builds +**Authors:** +@peterbell10 + +## **Summary** + +This RFC proposes changes to ATen that will allow more granular header +dependencies and tools to enforce their usage across the codebase. +This should greatly improve incremental and cached build performance. + +## **Motivation** + +In ATen, certain generated headers files include the definitions of +every operator in PyTorch. Consequently any new operator, or update +to any of ATen's over 1000 existing operators in +`native_functions.yaml`, requires a full rebuild of ATen and torch. +These problematic headers include `Tensor.h`, `Functions.h`, +`NativeFunctions.h`, which are all widely used and currently +unavoidable. + +For example, adding a new operator that isn't referenced anywhere +results in 1000+ files being rebuilt over many hours of CPU time +(single threaded). After the proposed changes you can expect at most a +small handful of files to be rebuilt, most of which should contain +code generated specifically for that new operator. + +## **Proposed Implementation** + +This RFC proposes two changes to allow breaking up these header +dependencies, and avoid unnecessary header dependencies in ATen. + +1. Add a base class to `Tensor` called `TensorBase` which has no + method operators but is otherwise fully interoperable. This is + particularly useful for kernel code which often doesn't call any + operators but still needs to pass `Tensor`s around. +2. Split code generated header files up on a per-operator basis, so + only necessary operators need to be included. These headers still + depend on `Tensor.h` so will depend on method operators. + +Additionally, two macros are introduced which can be defined to +enforce strict adoption of these more granular headers: + +- `TORCH_ASSERT_NO_OPERATORS` when defined, will make inclusion of any + code generated from the operator list into a compiler error. In + practice this limits you to only use `TensorBase` and `TensorIterator`. +- `TORCH_ASSERT_ONLY_METHOD_OPERATORS` which allows `Tensor.h` to be + included, but not the other umbrella headers. Instead specific + operators must be included directly as they are used. + +These will be implemented by inserting preprocessor guards into the +relevant header files, with `ATen/core/TensorBody.h` as an example +```cpp +#ifdef TORCH_ASSERT_NO_OPERATORS +#error This change adds a dependency on native_functions.yaml, \ + meaning the file will need to be re-compiled every time an operator \ + is changed or added. Consider if your change would be better placed in \ + another file, or if a more specific header might achieve the same goal. \ + See NOTE: [Tensor vs. TensorBase] +#endif +``` + +This assists in first detecting problematic includes, even when +buried in layers of transitive includes. Then further with +maintaining the header hygiene into the future, as developers get +immediate feedback when undesirable headers are added in the future. + +### TensorBase +`Tensor` serves two related but separate functions. It is both a +user-friendly handle to the underlying `TensorImpl` data structure, +and an interface to call operators with the `variant: method` +property. `TensorBase` acts only as a handle without any of the +code-generated methods and lives in it's own header +(`ATen/core/TensorBase.h`). + +`Tensor` inherits publicly from `TensorBase` to make `TensorBase&` +arguments implicitly callable with `Tensor` without touching the +reference count. Similarly, an implicit move constructor +`Tensor(TensorBase &&)` allows `Tensor` to be implicitly created from +`TensorBase` so long as there is no reference count change. This way +the two types cleanly inter-operate without silently degrading +performance with reference count increments. + +The primary use case for `TensorBase` is kernel code which is often +the most expensive to compile yet often don't reference operators. +Many kernels simply hand tensors off to `TensorIterator` or call +`data_ptr<>` and access the data directly. To enable this use case, +`TensorIterator.h` and the `cpu/Loops.h` and `cuda/Loops.cuh` headers +must be kept clean of any `Tensor` includes (forward declarations are +fine though). + + +### Per-Operator Headers + +The contents of code-generated "umbrella" headers will be split up +into separate headers so all they do is include from one header per +operator. The actual generated code will not be changed, only moved to +a new header. These new headers will live in the `ATen/ops` folder. +`at::add` for example will generate 8 new headers, + +```cpp +#include // Split from Functions.h +#include // Split from NativeMetaFunctions.h +#include // Split from NativeFunctions.h +#include // Split from Operators.h + +// Plus static dispatch functions split from {DispatchKey}Functions.h +#include +#include +#include +#include +``` + +Note that `add` has 4 overloads which are included in the same header +files. This way the caller doesn't need to know which overload is +relevant to their use case, only the basic operator name. + +Unfortunately, per-operator headers aren't compatible with Meta's +internal `buck` builds because the codegen output list depends +dynamically on the contents of an input file (namely +`native_functions.yaml`). So these headers can only be included when +`AT_PER_OPERATOR_HEADERS` is defined which is the default for CMake +and bazel builds. So, they must be included behind an `#ifdef` guard, +e.g. + +```cpp +#ifndef AT_PER_OPERATOR_HEADERS +#include +#else +#include +#include +#endif +``` + +### Enforcing adoption +These tools can only improve incremental builds if they are used +widely throughout the codebase and continue to be enforced as the code +develops. So, adoption will be enforced through the +`TORCH_ASSERT_NO_OPERATORS` and `TORCH_ASSERT_ONLY_METHOD_OPERATORS` +macros. + +First, `ATen` and `torch` will be migrated to using the per-operator +headers, defining `TORCH_ASSERT_ONLY_METHOD_OPERATORS` in each file as +they are updated. Then, once most of the codebase is compliant, the +build system will be changed to define +`TORCH_ASSERT_ONLY_METHOD_OPERATORS` by default in all new files. In +the rare occasion where umbrella headers are strictly necessary, these +files can opt-out by explicitly `#undef`ing the macro making it +`grep`-able and clearly visible in code-review. One such example is +`autograd/generated/Functions.cpp` which includes verbatim code from +`derivatives.yaml` so doesn't know which operators are being called. +This should be extremely rare though. + +The initial migration can be assisted by +[`include-what-you-use`](https://include-what-you-use.org/) which is a +clang-based tool to find the necessary headers based on what symbols +are actually used in a file. For most cpp files, it is as simple as +copying the `` headers from the tool output. + +`TORCH_ASSERT_NO_OPERATORS` cannot be enforced as strictly because +compliance often requires significantly restructuring the code. +However, where the cost-benefit is favorable, this macro should be +defined manually in the source file. This will include files that +already use `TensorIterator` exclusively and just need their includes +pruned; and more costly to compile files where taking time to +restructure them is worthwhile. + +## **Metrics** + +Some key metrics that should be improved by these changes: +- Time to rebuild after adding editing a functional operator +- Time to rebuild after adding editing a method operator +- `sccache` miss rate in open source CI + +## **Drawbacks** +There is no denying that it's more convenient to just include +`` at the top of the file and not have to think about it. +This will require additional effort on the part of developers to +maintain the list of operator includes. + +## **Alternatives** +[pytorch/pytorch#61915](https://github.com/pytorch/pytorch/pull/61915) +proposed generating new operators into separate headers called +`*_Delta.h` (e.g. `Functions_Delta.h`) during local development. This +improves incremental builds locally with comparatively fewer changes +to the ATen codebase. However, + +1. It only helps when adding operators, not changing existing ones +2. It does nothing for clean builds using a cache (like in CI) +3. Incremental builds no long reproduce a from-scratch build + +[pytorch/pytorch#62258](https://github.com/pytorch/pytorch/issues/62258) +proposed removing operator header dependencies from CUDA kernels +specifically. This is aligned with the aims of `TensorBase`, but +leaves out several hundred non-kernel source files. Individually these +files take around 10 seconds each to compile, but together account for +2-3 hours of (serial) CPU time. + + +## **Prior Art** +Not aware of any relevant prior art. + +## **How we teach this** +This change doesn't effect end-users. + +For PyTorch developers, documentation should be written on how to +write code in compliance with the two enforcement macros and linked in +the corresponding error messages. + +## **Unresolved questions** +- Is it possible/desirable to enforce `TORCH_ASSERT_NO_OPERATORS` automatically? + e.g. for all `.cu` files, or all files over a certain compile time. +- Can `include-what-you-use` completely automate operator includes? + Tools exist for strictly managing _all_ includes, but that would be + a significant change from existing include style. +- Is it worth adopting `TensorBase` more widely than just kernels? + e.g. if the operators themselves used `TensorBase` then you no + longer have to include hundreds of method operators. From 4d13fef3fd94ff891af58ddc7e1229967556703d Mon Sep 17 00:00:00 2001 From: Peter Bell Date: Tue, 8 Feb 2022 22:49:59 +0000 Subject: [PATCH 2/2] Add incremental build measurements and more on documentation --- RFC-0025-improving-incremental-builds.md | 32 +++++++++++++++++++++--- 1 file changed, 29 insertions(+), 3 deletions(-) diff --git a/RFC-0025-improving-incremental-builds.md b/RFC-0025-improving-incremental-builds.md index f5d59464..4c3924be 100644 --- a/RFC-0025-improving-incremental-builds.md +++ b/RFC-0025-improving-incremental-builds.md @@ -171,12 +171,31 @@ Some key metrics that should be improved by these changes: - Time to rebuild after adding editing a method operator - `sccache` miss rate in open source CI +For example, we can compare against a branch with the +`ATen/native/cpu` and `ATen/native/cuda` folders migrated to +per-operator headers, and some key files edited to use `TensorBase`. +Despite only partial migration, the table below shows ~2x speedup in +incremental build for functional operators and ~1.4x for method +operators. + +| Variant | Jobs (before) | Jobs (after) | Time (before) | Time (after) | +|----------|:-------------:|--------------|---------------|--------------| +| Function | 1403 | 775 | 10 m 00 s | 4 m 50 s | +| Method | 1403 | 1184 | 10 m 00 s | 7 m 15 s | + +We can expect even further improvements once all of ATen and torch +have adopted per-operator headers. + ## **Drawbacks** There is no denying that it's more convenient to just include `` at the top of the file and not have to think about it. This will require additional effort on the part of developers to maintain the list of operator includes. +C++ users who include internal header files without including +`ATen/ATen.h` may also find their compilation failing as unnecessary +includes are trimmed from internal headers. + ## **Alternatives** [pytorch/pytorch#61915](https://github.com/pytorch/pytorch/pull/61915) proposed generating new operators into separate headers called @@ -202,9 +221,16 @@ Not aware of any relevant prior art. ## **How we teach this** This change doesn't effect end-users. -For PyTorch developers, documentation should be written on how to -write code in compliance with the two enforcement macros and linked in -the corresponding error messages. +For PyTorch developers, a developer wiki entry should outline the +motivation from this RFC and describe how to write code in compliance +with the two enforcement macros: +- How to include the new per-operator headers +- Using the include-what-you-use tool +- Structuring headers to minimize unnecessary dependencies +- Modifying kernels to be `TORCH_ASSERT_NO_OPERATORS` compliant + +This wiki entry should also be linked from the enforcement macro error +messages to increase visibility. ## **Unresolved questions** - Is it possible/desirable to enforce `TORCH_ASSERT_NO_OPERATORS` automatically?