RFC: affected-based incremental #8243

ahabhgk · 2024-10-28T11:11:26Z

ahabhgk
Oct 28, 2024
Collaborator

Motivation

Faster rebuild, optimize rebuild time from O(project) to O(changes).

User Guide

We will introduce experiments.incremental to allow users opt-in incremental rebuild, we will have three phases to stabilize it:

proof of concept, implement for some stages
more tests, enable for bytedance internal projects, stabilize for some stages (👈 we are here)
stabilize all stages and enable by default

Currently, incremental is enabled by default in make and emitAssets stages, since Rspack v1.0, and we will enable more stages by default as incremental become stable in more stages.

The detailed config is:

type Incremental = boolean | {
  make?: boolean;
  emitAssets?: boolean;
  inferAsyncModules?: boolean;
  providedExports?: boolean;
  dependenciesDiagnostics?: boolean;
  buildChunkGraph?: boolean;
  modulesHashes?: boolean;
  modulesCodegen?: boolean;
  modulesRuntimeRequirements?: boolean;
}

const config = {
  // ...
  // we will move it out from `experiments` once it is stable
  experiments: {
    // `true` -> enable for all stages
    // `false` -> disable for all stages
    // `undefined` -> default value, for now it only enable for make and emitAssets stages, we will enable more stages as incremental become stable in more stages
    incremental: true,
  }
}

Detailed Design

Reads and Writes

If you consider the bundling process as reading and writing data on the module graph and chunk graph, distinguishing between read tasks and write tasks, where write tasks cause data changes, then dependent read tasks need to be updated, and read tasks dependent on those also need updating, until the task’s input hits the cache, and there is no need to continue bubbling up (cut off). This is a very direct idea for implementing incremental builds. The process needs to be abstracted into tasks, where the process is no longer simple functions, but requires storing the dependency relationships between tasks in a structure. The first run generates a task graph, and subsequent task updates only affect dependent tasks.

Turbopack does this, leading to most functions needing the #[turbo_task] macro, and the overall architecture is pull-based (or query-based) because it needs to declare the dependency relationships between tasks. The chunk rendering task and the module rendering tasks within the chunk have explicit dependency relationships; thus, a different rendering result for a module would cause a re-render of the corresponding chunk. Additionally, if a task is a global effect, then turbo_task cannot save it; at most, it can utilize the cache provided by turbo_task. Therefore, incremental builds still need optimized algorithms to avoid global effects.

However, Rspack does not consider doing this. Rspack, like webpack, is push-based (or pass-based), where each stage advances one after another, and chunk rendering tasks and module rendering tasks within the chunk do not have clear dependency relationships. The module rendering results in the earlier stage are unperceived by the chunk rendering in the later stage. There is a significant architectural difference, and it would introduce too many macros, and making the incremental computation engine itself is extremely complex.

If Turbopack implements full automatic incremental builds through the turbo-tasks incremental computation engine, our goal could be a simpler, semi-automatic incremental build more aligned with existing architecture, using tasks in the hook as the granularity.

"As a bonus, here’s a post about incremental in Kotlin, which follows an interesting worse-as-better approach to the problem: https://blog.jetbrains.com/kotlin/2020/09/the-dark-secrets-of-fast-compilation-for-kotlin/"
--- https://ziggit.dev/t/how-zig-incremental-compilation-is-implemented-internally/3543

The Design

Distinguish between reading and writing data, record data writes as mutations, and before executing tasks in each hook, calculate the data affected by these mutations. Only re-execute tasks for those affected data. The core implementation lies in identifying and updating the affected data at each stage. Therefore, the switch on the config controls whether to find the affected data at this stage. If enabled, it searches for affected data; if not, it uses all the data.

watcher
- Start the rebuild, identify changed files based on fileDependencies, contextDependencies, and missingDependencies.
make
- The changed files provided by the watcher are the affected data for the make stage. Re-add them to the queue as inputs for updating.
- Record these update (buildModule) as mutations.
infer async modules
- Based on the mutations recorded by the make stage (buildModule), calculate the modules that need to be updated.
- Record these update (setAsync) as mutations.
- For example, we have a -> b -> c -> d, if we add a top level await in c, then the affected modules that need to be updated should be a, b, and c, only d is not affected, since adding a top level await will change all the generated code from the module to its root module.
provided exports
- Based on the mutations recorded by the make stage (buildModule), calculate the modules that need to be updated.
- Record these update (processModuleExports) as mutations.
- For example, we have a -> b -> c -> d, if we change the exports of module d, then the affected modules that need to be updated should be c, and d, and if the module c has re-export, then b is also affected.
module codegen
- Based on the mutations recorded by the before stages (buildModule, setAsync, and processModuleExports), calculate the modules that need to be updated.
- Record these update (moduleCodegen) as mutations.
- For example, we have a -> b -> c -> d -> e -> f, if we add a top level await in c and change the exports of module f, then we need to update the codegen result of module a, b, c, e, and f, where a, b, c are caused by mutations recorded by infer async modules, and e and f are caused by mutations recorded by provided exports.
We omit the subsequent stages...

In the example above, most of the modules need to be updated, but in actual projects, most modules do not need to be updated, only a small number of modules need to be updated. This will significantly reduce the computation during the rebuild and improve rebuild performance.

And as you can see, this affected-based incremental is not related about cache at all, it's just about finding the affected data and then update them, so this can be used with cache or without cache.

Summary

If you have read "Build Systems à la Carte: Theory and Practice", you will find that previously Rspack achieved the feature of 'minimality' through cache, but did not achieve 'early cutoff'. The lack of early cutoff led to insufficient minimality, which is a common problem in many existing bundlers and one of the reasons why bundler rebuilds are slow. Affected-based incremental collects mutations from each stage and connects the stages. This allows later stages to be aware of what actions were taken in earlier stages, thus enabling early cutoff of unrelated tasks in subsequent stages, and make Rspack rebuild faster.

ahabhgk · 2024-10-28T11:12:18Z

ahabhgk
Oct 28, 2024
Collaborator Author

Tracking issue: #8106

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: affected-based incremental #8243

{{title}}

Replies: 1 comment

{{title}}

Select a reply

RFC: affected-based incremental #8243

ahabhgk Oct 28, 2024 Collaborator

Motivation

User Guide

Detailed Design

Reads and Writes

The Design

Summary

Replies: 1 comment

ahabhgk Oct 28, 2024 Collaborator Author

ahabhgk
Oct 28, 2024
Collaborator

ahabhgk
Oct 28, 2024
Collaborator Author