Parallelize bytecode compilation ✨ #13247

ichard26 · 2025-02-27T00:10:38Z

Towards #12712.

Note

I would appreciate additional testing of this feature to ensure it's decently robust.

To test this patch locally, you can install from my branch using this command:

 pip install https://github.com/ichard26/pip/archive/perf/parallel-compile.zip

If you're curious to whether parallelization is being used, pass -vv and look for these logs:

Installing collected packages: pluggy, packaging, iniconfig, pytest
Detected CPU count: 16
Configured worker count: auto
Bytecode will be compiled using at most 8 workers

Bytecode compilation is slow. It's often one of the biggest contributors to the install step's sluggishness. For better or worse, we can't really enable --no-compile by default as it has the potential to render certain workflows permanently slower in a subtle way.¹

To improve the situation, bytecode compilation can be parallelized across a pool of processes (or sub-interpreters on Python 3.14). I've observed a 1.1x to 3x improvement in install step times locally.²

This patch has been written to be relatively comprehensible, but for posterity, these are the high-level implementation notes:

We can't use compileall.compile_dir() because it spins up a new worker pool on every invocation. If it's used as a "drop-in" replacement for compileall.compile_file(), then the pool creation overhead will be paid for every package installed. This is bad and kills most of the gains. Redesigning the installation logic to compile everything at the end was rejected for being too invasive (a key goal was to avoid affecting the package installation order).
A bytecode compiler is created right before package installation starts and reused for all packages. Depending on platform and workload, either a serial (in-process) compiler or parallel compiler will be used. They both have the same interface, accepting a batch of Python filepaths to compile.
This patch was designed to as low-risk as reasonably possible. pip does not contain any parallelized code, thus introducing any sort of parallelization poses a nontrivial risk. To minimize this risk, the only code parallelized is the bytecode compilation code itself (~10 LOC). In addition, the package install order is unaffected and pip will fall back to serial compilation if parallelization is unsupported.

Important

I've tried my best to document my code and design this patch in such a way to keep it reviewable and maintainable. However, this is a large PR nonetheless, so I expect there are parts which are confusing or poorly designed. If you have concerns or questions, please speak up!

The criteria for parallelization are:

There are at least 2 CPUs available. The process CPU count is used if available, otherwise the system CPU count. If there is only one CPU, serial compilation will always be used because even a parallel compiler with one worker will add extra overhead.
The maximum amount of workers is at least 2. This is controlled by the --install-jobs option.³ It defaults to "auto" which uses the process/system CPU count.⁴
There is "enough" code for parallelization to be "worth it". This criteria exists so pip won't waste (say) 100ms on spinning up a parallel compiler when compiling serially would only take 20ms.⁵ The limit is set to 1 MB of Python code. This is admittedly rather crude, but it seems to work well enough having tested on a variety of systems.

Basically, if the Python files are installed to a read-only directory, then importing those files will be permanently slower as the .pyc files will never be cached. This is quite subtle, enough so that we can't really expect newbies to recognise and know how to address this (there is the PYTHONPYCACHEPREFIX envvar, but if you're advanced enough to use it, then you are also advanced enough to know when to use uv or pip's --no-compile). ↩
The 1.1x was on a painfully slow dual-core/HDD-equipped Windows install installing simply setuptools. The 3x was observed on my main 8-core Ryzen 5800HS Windows machine while installing pip's own test dependencies. ↩
Yes, this is probably not the best name, but adding an option for just bytecode compilation seems silly. Anyway, this will give us room if we ever parallelize more parts of the install step. ↩
Up to a hard-coded limit of 8 to avoid resource exhaustion. This number was chosen arbitrarily, but is definitely high enough to net a major improvement. ↩
This is important because I don't want to slow down tiny installs (e.g., pip install six ... or our own test suite). Creating a new process is prohibitively expensive on Windows (and to a lesser degree on macOS) for various reasons, so parallelization can't be simply used all of time. ↩

ichard26

Here are some specific review notes. @pfmoore if you have time, I'd especially appreciate a review from you. I know this is a 400+ LOC change, but a good portion of it is the unit tests and comments.

ichard26 · 2025-02-27T00:12:53Z

src/pip/_internal/operations/install/wheel.py

-                    success = compileall.compile_file(path, force=True, quiet=True)
-                    if success:
-                        pyc_path = pyc_output_path(path)
-                        assert os.path.exists(pyc_path)


I've removed this assert as part of this PR as I don't think we've ever received a bug report about the .pyc file not existing. If it were to be missing, that sounds like a stdlib bug. I'd be happy to add it back if desired, though.