-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallelize bytecode compilation ✨ #13247
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here are some specific review notes. @pfmoore if you have time, I'd especially appreciate a review from you. I know this is a 400+ LOC change, but a good portion of it is the unit tests and comments.
success = compileall.compile_file(path, force=True, quiet=True) | ||
if success: | ||
pyc_path = pyc_output_path(path) | ||
assert os.path.exists(pyc_path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've removed this assert as part of this PR as I don't think we've ever received a bug report about the .pyc file not existing. If it were to be missing, that sounds like a stdlib bug. I'd be happy to add it back if desired, though.
pyc_path = importlib.util.cache_from_source(py_path) # type: ignore[arg-type] | ||
return CompileResult( | ||
str(py_path), pyc_path, success, stdout.getvalue() # type: ignore[arg-type] | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These type ignores are due to incorrect type annotations in typeshed. I've filed PRs to fix them, but they haven't been incorporated in a mypy release yet.
# compile_file() returns True silently even if the source file is nonexistent. | ||
if not os.path.exists(py_path): | ||
raise FileNotFoundError(f"Python file '{py_path!s}' does not exist") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a new check. While writing unit tests for this function, I discovered that compileall.compile_file()
silently returns True even if the source Python file is nonexistent. This makes sense when you consider the compileall CLI, but this is unideal for us.
I'm hoping that there isn't a pre-existing bug where pip attempts to compile a nonexistent file...
"""Compile a set of Python modules using a pool of workers.""" | ||
|
||
def __init__(self, workers: int) -> None: | ||
from concurrent import futures |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This import is done here so I can catch any import errors and fall back to serial compilation.
# The concurrent executors will spin up new workers on a "on-demand basis", | ||
# which helps to avoid wasting time on starting new workers that won't be | ||
# used. (** This isn't true for the fork start method, but forking is | ||
# fast enough that it doesn't really matter.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be smarter as the stdlib obviously doesn't consider the overhead of creating a new worker, but this on-demand behaviour is good enough (and manually scaling would be way too complicated).
needs_parallel_compiler = pytest.mark.skipif( | ||
not parallel_supported, reason="ParallelCompiler is unavailable" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be me being simply paranoid, but I wanted the tests to continue to work even on systems where multiprocessing is broken, hence this marker.
@contextmanager | ||
def _patch_main_module_hack() -> Iterator[None]: | ||
"""Temporarily replace __main__ to reduce the worker startup overhead. | ||
|
||
concurrent.futures imports the main module while initializing new workers | ||
so any global state is retained in the workers. Unfortunately, when pip | ||
is run from a console script wrapper, the wrapper unconditionally imports | ||
pip._internal.cli.main and everything else it requires. This is *slow*. | ||
|
||
The compilation code does not depend on any global state, thus the costly | ||
re-import of pip can be avoided by replacing __main__ with any random | ||
module that does nothing. | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is ugly, but this does reduce worker startup overhead significantly. I couldn't measure an improvement on Windows, but Linux and macOS both saw major drops in startup overhead. See also: #12712 (comment).
def _does_python_size_surpass_threshold( | ||
requirements: Iterable[InstallRequirement], threshold: int | ||
) -> bool: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This receives and checks against the threshold instead of simply returning the total code size to avoid inspecting more zip files than necessary (to minimize the small, but real time cost).
Ah, of course, I forgot to run the integration tests locally (oops). |
14286fd
to
b87d3d3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I only added unit tests as this code will be exercised in a large portion of our end-to-end tests anyway. In addition, there are already a few functional test for bytecode compilation.
b87d3d3
to
a587e4f
Compare
Bytecode compilation is slow. It's often one of the biggest contributors to the install step's sluggishness. For better or worse, we can't really enable --no-compile by default as it has the potential to render certain workflows permanently slower in a subtle way.[^1] To improve the situation, bytecode compilation can be parallelized across a pool of processes (or sub-interpreters on Python 3.14). I've observed a 1.1x to 3x improvement in install step times locally.[^2] This patch has been written to be relatively comprehensible, but for posterity, these are the high-level implementation notes: - We can't use compileall.compile_dir() because it spins up a new worker pool on every invocation. If it's used as a "drop-in" replacement for compileall.compile_file(), then the pool creation overhead will be paid for every package installed. This is bad and kills most of the gains. Redesigning the installation logic to compile everything at the end was rejected for being too invasive (a key goal was to avoid affecting the package installation order). - A bytecode compiler is created right before package installation starts and reused for all packages. Depending on platform and workload, either a serial (in-process) compiler or parallel compiler will be used. They both have the same interface, accepting a batch of Python filepaths to compile. - This patch was designed to as low-risk as reasonably possible. pip does not contain any parallelized code, thus introducing any sort of parallelization poses a nontrivial risk. To minimize this risk, the only code parallelized is the bytecode compilation code itself (~10 LOC). In addition, the package install order is unaffected and pip will fall back to serial compilation if parallelization is unsupported. The criteria for parallelization are: 1. There are at least 2 CPUs available. The process CPU count is used if available, otherwise the system CPU count. If there is only one CPU, serial compilation will always be used because even a parallel compiler with one worker will add extra overhead. 2. The maximum amount of workers is at least 2. This is controlled by the --install-jobs option.[^3] It defaults to "auto" which uses the process/system CPU count.[^4] 3. There is "enough" code for parallelization to be "worth it". This criterion exists so pip won't waste (say) 100ms on spinning up a parallel compiler when compiling serially would only take 20ms.[^5] The limit is set to 1 MB of Python code. This is admittedly rather crude, but it seems to work well enough having tested on a variety of systems. [^1]: Basically, if the Python files are installed to a read-only directory, then importing those files will be permanently slower as the .pyc files will never be cached. This is quite subtle, enough so that we can't really expect newbies to recognise and know how to address this (there is the PYTHONPYCACHEPREFIX envvar, but if you're advanced enough to use it, then you are also advanced enough to know when to use uv or pip's --no-compile). [^2]: The 1.1x was on a painfully slow dual-core/HDD-equipped Windows install installing simply setuptools. The 3x was observed on my main 8-core Ryzen 5800HS Windows machine while installing pip's own test dependencies. [^3]: Yes, this is probably not the best name, but adding an option for just bytecode compilation seems silly. Anyway, this will give us room if we ever parallelize more parts of the install step. [^4]: Up to a hard-coded limit of 8 to avoid resource exhaustion. This number was chosen arbitrarily, but is definitely high enough to net a major improvement. [^5]: This is important because I don't want to slow down tiny installs (e.g., pip install six ... or our own test suite). Creating a new process is prohibitively expensive on Windows (and to a lesser degree on macOS) for various reasons, so parallelization can't be simply used all of time.
a587e4f
to
15ce2bf
Compare
if modifying_pip: | ||
# Parallelization will re-import pip when starting new workers | ||
# during installation which is unsafe if pip is being modified. | ||
options.install_jobs = 1 | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the concurrent executors may spin up new workers midway through package installation, a newly installed (and incompatible) pip may be imported which could result in a crash (as seen in CI). This is a shame as parallelization does improve self-install time by ~200ms locally, but alas, pip should not crash even if we're trying to install 10 year old pip.
Technically, we could avoid this by patching __main__
to point a module that's completely outside of the pip namespace (it has to be outside because it's the import of pip.__init__
that's blowing up the tests) but I'm strongly opposed to adding a new top-level module such as __pip_empty.py
to our distributions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To avoid the global impact, we could special-case pip and use a serial compiler no matter what for pip, but parallelization is generally unsafe when modifying pip so this isn't a good idea either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh wait, this is even more problematic than I initially thought because the workers could import a new pip was maliciously overwritten. We had a security report similar to this for the self- version check. Hmm, I'll need to go back to the drawing board. Hopefully there is a way to avoid the extra imports on worker startup, otherwise, I'm going to have to use multiprocessing.Pool directly and pay the worker startup overhead all at once... :(
This may be a dumb question, but why do we care about people installing to a readonly directory? They must have done something special in order to install there, so they can hardly be considered "newbies". The most obvious case I can think of is the admin of a HPC cluster installing a shared package, but I definitely wouldn't call a HPC cluster admin a "newbie"! Expecting such a user to change their process to add As uv appear to have demonstrated, for most people, not compiling on install simply isn't an issue. Maybe we could just deprecate automatic byte compilation? We could document1 it as
Unfortunately, I can't think of a good way to add a runtime warning (short of warning on all invocations that don't include either
I'll make some time to take a look, but I assume from this comment:
that I should hold off for a while... 🙂 Footnotes
|
A perhaps more common scenario than read-only directories, is ephemeral storage - such as when running Python in a Docker container. If the If bytecode compilation were disabled in pip by default:
Disabling bytecode generation by default might actually mean the implementation in this PR could be simplified, since the feature would now be opt-in and less disruptive? e.g. The heuristic for whether to use parallel compilation could perhaps be removed, since |
I think this is a reference to the motivation that was stated here: #12920 (comment) |
Yes, but I don't think I agree with that motivation1 - the discussion was closed saying we'd reached consensus, but IMO that was a bit of an overstatement. Furthermore, whereas I was mostly neutral when the question was "default to I agree that we should be considering the experience for unsophisticated users, and not putting them in a situation where they get permanently bad performance because the language's "compile on first use" mechanism can't work. But how often does an unsophisticated user use So I still think the question is a valid one. Is the added complexity and maintenance cost of this code worthwhile in comparison to simply switching the default to My view is that no, it probably isn't3. To give some indicative figures, on my PC, Other maintainers may have different views. That's fine. But ideally we should have a discussion and come to a consensus. If we can't reach consensus, I don't know what the best way forward is - at the moment, it feels frustratingly like every "big" change is blocked because we can't agree, and this could easily end up in the same situation. Footnotes
|
Fair enough, I've re-opened that issue, my view at the time was I didn’t want to create a protracted discussion, because in general changing a default requires there to be almost unanimous consent, or at least no vocal detractor, whereas leaving a default the same requires no action. Can we move discussion of the default compile option to that issue please. IMO this PR should be reviewed on complexity it introduces vs. the improvements it makes (reduced compile time). Whether compiling is default can be treated as an orthogonal choice, and having a lengthy discussion about it here make it difficult for reviewing this PR on its own merits. |
Towards #12712.
Note
I would appreciate additional testing of this feature to ensure it's decently robust.
To test this patch locally, you can install from my branch using this command:
If you're curious to whether parallelization is being used, pass
-vv
and look for these logs:Bytecode compilation is slow. It's often one of the biggest contributors to the install step's sluggishness. For better or worse, we can't really enable
--no-compile
by default as it has the potential to render certain workflows permanently slower in a subtle way.1To improve the situation, bytecode compilation can be parallelized across a pool of processes (or sub-interpreters on Python 3.14). I've observed a 1.1x to 3x improvement in install step times locally.2
This patch has been written to be relatively comprehensible, but for posterity, these are the high-level implementation notes:
We can't use
compileall.compile_dir()
because it spins up a new worker pool on every invocation. If it's used as a "drop-in" replacement forcompileall.compile_file()
, then the pool creation overhead will be paid for every package installed. This is bad and kills most of the gains. Redesigning the installation logic to compile everything at the end was rejected for being too invasive (a key goal was to avoid affecting the package installation order).A bytecode compiler is created right before package installation starts and reused for all packages. Depending on platform and workload, either a serial (in-process) compiler or parallel compiler will be used. They both have the same interface, accepting a batch of Python filepaths to compile.
This patch was designed to as low-risk as reasonably possible. pip does not contain any parallelized code, thus introducing any sort of parallelization poses a nontrivial risk. To minimize this risk, the only code parallelized is the bytecode compilation code itself (~10 LOC). In addition, the package install order is unaffected and pip will fall back to serial compilation if parallelization is unsupported.
Important
I've tried my best to document my code and design this patch in such a way to keep it reviewable and maintainable. However, this is a large PR nonetheless, so I expect there are parts which are confusing or poorly designed. If you have concerns or questions, please speak up!
The criteria for parallelization are:
There are at least 2 CPUs available. The process CPU count is used if available, otherwise the system CPU count. If there is only one CPU, serial compilation will always be used because even a parallel compiler with one worker will add extra overhead.
The maximum amount of workers is at least 2. This is controlled by the
--install-jobs
option.3 It defaults to "auto" which uses the process/system CPU count.4There is "enough" code for parallelization to be "worth it". This criteria exists so pip won't waste (say) 100ms on spinning up a parallel compiler when compiling serially would only take 20ms.5 The limit is set to 1 MB of Python code. This is admittedly rather crude, but it seems to work well enough having tested on a variety of systems.
Footnotes
Basically, if the Python files are installed to a read-only directory, then importing those files will be permanently slower as the .pyc files will never be cached. This is quite subtle, enough so that we can't really expect newbies to recognise and know how to address this (there is the
PYTHONPYCACHEPREFIX
envvar, but if you're advanced enough to use it, then you are also advanced enough to know when to use uv or pip's --no-compile). ↩The 1.1x was on a painfully slow dual-core/HDD-equipped Windows install installing simply setuptools. The 3x was observed on my main 8-core Ryzen 5800HS Windows machine while installing pip's own test dependencies. ↩
Yes, this is probably not the best name, but adding an option for just bytecode compilation seems silly. Anyway, this will give us room if we ever parallelize more parts of the install step. ↩
Up to a hard-coded limit of 8 to avoid resource exhaustion. This number was chosen arbitrarily, but is definitely high enough to net a major improvement. ↩
This is important because I don't want to slow down tiny installs (e.g.,
pip install six
... or our own test suite). Creating a new process is prohibitively expensive on Windows (and to a lesser degree on macOS) for various reasons, so parallelization can't be simply used all of time. ↩