Pre computed file facts as options #246

tarun-dugar · 2024-09-10T16:30:27Z

Motivation

Make oxc-resolver work faster by using pre-computed facts to avoid reading the file system.

Background

We have two types of pre-computed facts in a cache:

Package facts:

"package_name": {
  path: String
  main_fields_and_values: {
    "main": String,
    ...
  },
  "export_fields_and_values": {
     "exports": String // JSON string
  },
}

List of files: HashSet

We calculate these facts by using oxc-parser and store them in a remote cache for our repo (which has ~200k files and growing). Anytime someone creates a branch and changes some files, we only recompute the facts for the changed files and reuse the other facts. Because of this process, computing facts is very cheap once we have a cache in place.

Using these facts, we speed up the resolution process by:

Avoiding read_to_string for package.json files by using values from package facts
speeding up the is_file function by asserting against the list of files
avoiding is_dir calls

Benchmarks

Resolution for 430k .resolve calls for relative imports and package imports (packages that exist in package facts):

Using facts: ~2s
Without using facts: ~8s

Boshen · 2024-09-11T01:27:29Z

More context: #236 (comment)

Pasting here for quick read.

@Boshen, for reference this is an implementation of the same Facts pattern used in Hacklang and used (more complexly) in turborepo and is behind how Haste operates at Meta

The overall structure is as follows:

You get a (Path, SHA) index of the repo from an existing updated-on-write source (git, watchman)
You parse each file in raw form (e.g. imports stay as "../../foo.ts" and are explicitly explicitly not resolved) to get just the important bits of info
You re-use the previous cache for unchanged (cached by Path|Sha keys), and parse any newly seen files
You write out the cache

Depending what info you extract out, this leaves us with ~50MB of serialisable state which can answer all the cross-file questions we have on our large repo.

Locally, as devs are rotating this cache (cache -> update -> cache). We also have our CI output this (just a dumb zstd S3 store of it, which packs to ~15MB), which allows our devs to start from their mergebase-with-master's cache when they switch branches or take large jumps across many commits (as described in watchman's SCMQuery docs)
The same state can be updated in a delta form in a watch mode / from a daemon, as we find that just reading the index is the slowest part.

For some numbers, with oxc_parser we can do this in 8 seconds (without a cache), 2 seconds (with a cache) and 200ms for a delta update.

Downstream steps (resolution, linting etc) can read that cache and do the work they need, already knowing the most importing info for them, and hence why we're now limited by resolution (which we do at run-time for each downstream step, due to it not being cachable)

We use (/ plan to use) this for:

Understanding the module graph
quickly identifying code cycles
extracting feature flags used in files
extracting translatable messages
extracting atomic CSS styles and building a single stylesheet
assertions on uniqueness of certain things (event names, error message keys etc)

Boshen · 2024-09-11T02:43:33Z

@tarun-dugar

Question: I assume you are using oxc-resolver as a crate?

I understood the code and requirements. The next step is to decide whether exposing these as a trait / plugin API or a feature flag.

We'll make the decision together with Tom once we understand the broader picture.

tarun-dugar · 2024-09-11T04:39:47Z

@Boshen yes, we are using it as a crate. I would be keen to follow the conversation you folks have if its possible. FWIW there are even more optimisations possible using this approach:

After eliminating fs, a big chunk of time is spent in hashing when creating cached values. We already have a (sha, path) tuple and we could potentially reuse the sha as the hash instead of FxHasher.
We can compute a list of valid directories by traversing the repo and pass it as an option. The is_dir checks can use this information to skip fs again.

For cases where we don't want to resolve external dependencies, the whole thing can work without even cloning the repo.

facts as options

5e81ded

tarun-dugar mentioned this pull request Sep 10, 2024

[Question] Supporting custom facts in oxc-resolver to make it work without needing fs #236

Open

Boshen self-requested a review September 10, 2024 16:51

Boshen self-assigned this Sep 10, 2024

Boshen marked this pull request as draft September 10, 2024 16:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pre computed file facts as options #246

Pre computed file facts as options #246

tarun-dugar commented Sep 10, 2024

Boshen commented Sep 11, 2024 •

edited

Loading

Boshen commented Sep 11, 2024

tarun-dugar commented Sep 11, 2024 •

edited

Loading

Pre computed file facts as options #246

Are you sure you want to change the base?

Pre computed file facts as options #246

Conversation

tarun-dugar commented Sep 10, 2024

Motivation

Background

Benchmarks

Boshen commented Sep 11, 2024 • edited Loading

Boshen commented Sep 11, 2024

tarun-dugar commented Sep 11, 2024 • edited Loading

Boshen commented Sep 11, 2024 •

edited

Loading

tarun-dugar commented Sep 11, 2024 •

edited

Loading