Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallelize bytecode compilation ✨ #13247

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

ichard26
Copy link
Member

Towards #12712.

Note

I would appreciate additional testing of this feature to ensure it's decently robust.

To test this patch locally, you can install from my branch using this command:

 pip install https://github.com/ichard26/pip/archive/perf/parallel-compile.zip

If you're curious to whether parallelization is being used, pass -vv and look for these logs:

Installing collected packages: pluggy, packaging, iniconfig, pytest
Detected CPU count: 16
Configured worker count: auto
Bytecode will be compiled using at most 8 workers

Bytecode compilation is slow. It's often one of the biggest contributors to the install step's sluggishness. For better or worse, we can't really enable --no-compile by default as it has the potential to render certain workflows permanently slower in a subtle way.1

To improve the situation, bytecode compilation can be parallelized across a pool of processes (or sub-interpreters on Python 3.14). I've observed a 1.1x to 3x improvement in install step times locally.2

This patch has been written to be relatively comprehensible, but for posterity, these are the high-level implementation notes:

  • We can't use compileall.compile_dir() because it spins up a new worker pool on every invocation. If it's used as a "drop-in" replacement for compileall.compile_file(), then the pool creation overhead will be paid for every package installed. This is bad and kills most of the gains. Redesigning the installation logic to compile everything at the end was rejected for being too invasive (a key goal was to avoid affecting the package installation order).

  • A bytecode compiler is created right before package installation starts and reused for all packages. Depending on platform and workload, either a serial (in-process) compiler or parallel compiler will be used. They both have the same interface, accepting a batch of Python filepaths to compile.

  • This patch was designed to as low-risk as reasonably possible. pip does not contain any parallelized code, thus introducing any sort of parallelization poses a nontrivial risk. To minimize this risk, the only code parallelized is the bytecode compilation code itself (~10 LOC). In addition, the package install order is unaffected and pip will fall back to serial compilation if parallelization is unsupported.

Important

I've tried my best to document my code and design this patch in such a way to keep it reviewable and maintainable. However, this is a large PR nonetheless, so I expect there are parts which are confusing or poorly designed. If you have concerns or questions, please speak up!

The criteria for parallelization are:

  1. There are at least 2 CPUs available. The process CPU count is used if available, otherwise the system CPU count. If there is only one CPU, serial compilation will always be used because even a parallel compiler with one worker will add extra overhead.

  2. The maximum amount of workers is at least 2. This is controlled by the --install-jobs option.3 It defaults to "auto" which uses the process/system CPU count.4

  3. There is "enough" code for parallelization to be "worth it". This criteria exists so pip won't waste (say) 100ms on spinning up a parallel compiler when compiling serially would only take 20ms.5 The limit is set to 1 MB of Python code. This is admittedly rather crude, but it seems to work well enough having tested on a variety of systems.

Footnotes

  1. Basically, if the Python files are installed to a read-only directory, then importing those files will be permanently slower as the .pyc files will never be cached. This is quite subtle, enough so that we can't really expect newbies to recognise and know how to address this (there is the PYTHONPYCACHEPREFIX envvar, but if you're advanced enough to use it, then you are also advanced enough to know when to use uv or pip's --no-compile).

  2. The 1.1x was on a painfully slow dual-core/HDD-equipped Windows install installing simply setuptools. The 3x was observed on my main 8-core Ryzen 5800HS Windows machine while installing pip's own test dependencies.

  3. Yes, this is probably not the best name, but adding an option for just bytecode compilation seems silly. Anyway, this will give us room if we ever parallelize more parts of the install step.

  4. Up to a hard-coded limit of 8 to avoid resource exhaustion. This number was chosen arbitrarily, but is definitely high enough to net a major improvement.

  5. This is important because I don't want to slow down tiny installs (e.g., pip install six ... or our own test suite). Creating a new process is prohibitively expensive on Windows (and to a lesser degree on macOS) for various reasons, so parallelization can't be simply used all of time.

@ichard26 ichard26 added the type: performance Commands take too long to run label Feb 27, 2025
Copy link
Member Author

@ichard26 ichard26 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are some specific review notes. @pfmoore if you have time, I'd especially appreciate a review from you. I know this is a 400+ LOC change, but a good portion of it is the unit tests and comments.

success = compileall.compile_file(path, force=True, quiet=True)
if success:
pyc_path = pyc_output_path(path)
assert os.path.exists(pyc_path)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've removed this assert as part of this PR as I don't think we've ever received a bug report about the .pyc file not existing. If it were to be missing, that sounds like a stdlib bug. I'd be happy to add it back if desired, though.

Comment on lines +63 to +66
pyc_path = importlib.util.cache_from_source(py_path) # type: ignore[arg-type]
return CompileResult(
str(py_path), pyc_path, success, stdout.getvalue() # type: ignore[arg-type]
)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These type ignores are due to incorrect type annotations in typeshed. I've filed PRs to fix them, but they haven't been incorporated in a mypy release yet.

Comment on lines +56 to +58
# compile_file() returns True silently even if the source file is nonexistent.
if not os.path.exists(py_path):
raise FileNotFoundError(f"Python file '{py_path!s}' does not exist")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a new check. While writing unit tests for this function, I discovered that compileall.compile_file() silently returns True even if the source Python file is nonexistent. This makes sense when you consider the compileall CLI, but this is unideal for us.

I'm hoping that there isn't a pre-existing bug where pip attempts to compile a nonexistent file...

"""Compile a set of Python modules using a pool of workers."""

def __init__(self, workers: int) -> None:
from concurrent import futures
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This import is done here so I can catch any import errors and fall back to serial compilation.

Comment on lines +158 to +161
# The concurrent executors will spin up new workers on a "on-demand basis",
# which helps to avoid wasting time on starting new workers that won't be
# used. (** This isn't true for the fork start method, but forking is
# fast enough that it doesn't really matter.)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be smarter as the stdlib obviously doesn't consider the overhead of creating a new worker, but this on-demand behaviour is good enough (and manually scaling would be way too complicated).

Comment on lines +27 to +29
needs_parallel_compiler = pytest.mark.skipif(
not parallel_supported, reason="ParallelCompiler is unavailable"
)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be me being simply paranoid, but I wanted the tests to continue to work even on systems where multiprocessing is broken, hence this marker.

Comment on lines +27 to +39
@contextmanager
def _patch_main_module_hack() -> Iterator[None]:
"""Temporarily replace __main__ to reduce the worker startup overhead.

concurrent.futures imports the main module while initializing new workers
so any global state is retained in the workers. Unfortunately, when pip
is run from a console script wrapper, the wrapper unconditionally imports
pip._internal.cli.main and everything else it requires. This is *slow*.

The compilation code does not depend on any global state, thus the costly
re-import of pip can be avoided by replacing __main__ with any random
module that does nothing.
"""
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is ugly, but this does reduce worker startup overhead significantly. I couldn't measure an improvement on Windows, but Linux and macOS both saw major drops in startup overhead. See also: #12712 (comment).

Comment on lines +40 to +42
def _does_python_size_surpass_threshold(
requirements: Iterable[InstallRequirement], threshold: int
) -> bool:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This receives and checks against the threshold instead of simply returning the total code size to avoid inspecting more zip files than necessary (to minimize the small, but real time cost).

@ichard26
Copy link
Member Author

ichard26 commented Feb 27, 2025

Ah, of course, I forgot to run the integration tests locally (oops). I've already spent way too much time on this today, so I'll take a look tomorrow. Nevermind, this is a stupid and trivially easy to fix bug. I pushed a fix 🤞

@ichard26 ichard26 force-pushed the perf/parallel-compile branch 2 times, most recently from 14286fd to b87d3d3 Compare February 27, 2025 00:41
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only added unit tests as this code will be exercised in a large portion of our end-to-end tests anyway. In addition, there are already a few functional test for bytecode compilation.

@ichard26 ichard26 force-pushed the perf/parallel-compile branch from b87d3d3 to a587e4f Compare February 27, 2025 00:53
Bytecode compilation is slow. It's often one of the biggest contributors
to the install step's sluggishness. For better or worse, we can't really
enable --no-compile by default as it has the potential to render certain
workflows permanently slower in a subtle way.[^1]

To improve the situation, bytecode compilation can be parallelized
across a pool of processes (or sub-interpreters on Python 3.14). I've
observed a 1.1x to 3x improvement in install step times locally.[^2]

This patch has been written to be relatively comprehensible, but for
posterity, these are the high-level implementation notes:

- We can't use compileall.compile_dir() because it spins up a new worker
  pool on every invocation. If it's used as a "drop-in" replacement for
  compileall.compile_file(), then the pool creation overhead will be
  paid for every package installed. This is bad and kills most of the
  gains. Redesigning the installation logic to compile everything at the
  end was rejected for being too invasive (a key goal was to avoid
  affecting the package installation order).

- A bytecode compiler is created right before package installation
  starts and reused for all packages. Depending on platform and
  workload, either a serial (in-process) compiler or parallel compiler
  will be used. They both have the same interface, accepting a batch of
  Python filepaths to compile.

- This patch was designed to as low-risk as reasonably possible. pip
  does not contain any parallelized code, thus introducing any sort of
  parallelization poses a nontrivial risk. To minimize this risk, the
  only code parallelized is the bytecode compilation code itself (~10
  LOC). In addition, the package install order is unaffected and pip
  will fall back to serial compilation if parallelization is unsupported.

The criteria for parallelization are:

1. There are at least 2 CPUs available. The process CPU count is used
   if available, otherwise the system CPU count. If there is only one
   CPU, serial compilation will always be used because even a parallel
   compiler with one worker will add extra overhead.

2. The maximum amount of workers is at least 2. This is controlled by
   the --install-jobs option.[^3] It defaults to "auto" which uses the
   process/system CPU count.[^4]

3. There is "enough" code for parallelization to be "worth it". This
   criterion exists so pip won't waste (say) 100ms on spinning up a
   parallel compiler when compiling serially would only take 20ms.[^5]
   The limit is set to 1 MB of Python code. This is admittedly rather
   crude, but it seems to work well enough having tested on a variety of
   systems.

[^1]: Basically, if the Python files are installed to a read-only
      directory, then importing those files will be permanently slower
      as the .pyc files will never be cached. This is quite subtle,
      enough so that we can't really expect newbies to recognise and
      know how to address this (there is the PYTHONPYCACHEPREFIX
      envvar, but if you're advanced enough to use it, then you are
      also advanced enough to know when to use uv or pip's
      --no-compile).

[^2]: The 1.1x was on a painfully slow dual-core/HDD-equipped Windows
      install installing simply setuptools. The 3x was observed on my
      main 8-core Ryzen 5800HS Windows machine while installing pip's
      own test dependencies.

[^3]: Yes, this is probably not the best name, but adding an option for
      just bytecode compilation seems silly. Anyway, this will give us
      room if we ever parallelize more parts of the install step.

[^4]: Up to a hard-coded limit of 8 to avoid resource exhaustion. This
      number was chosen arbitrarily, but is definitely high enough to
      net a major improvement.

[^5]: This is important because I don't want to slow down tiny installs
      (e.g., pip install six ... or our own test suite). Creating a new
      process is prohibitively expensive on Windows (and to a lesser
      degree on macOS) for various reasons, so parallelization can't be
      simply used all of time.
@ichard26 ichard26 force-pushed the perf/parallel-compile branch from a587e4f to 15ce2bf Compare February 27, 2025 04:15
Comment on lines +421 to 425
if modifying_pip:
# Parallelization will re-import pip when starting new workers
# during installation which is unsafe if pip is being modified.
options.install_jobs = 1

Copy link
Member Author

@ichard26 ichard26 Feb 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the concurrent executors may spin up new workers midway through package installation, a newly installed (and incompatible) pip may be imported which could result in a crash (as seen in CI). This is a shame as parallelization does improve self-install time by ~200ms locally, but alas, pip should not crash even if we're trying to install 10 year old pip.

Technically, we could avoid this by patching __main__ to point a module that's completely outside of the pip namespace (it has to be outside because it's the import of pip.__init__ that's blowing up the tests) but I'm strongly opposed to adding a new top-level module such as __pip_empty.py to our distributions.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid the global impact, we could special-case pip and use a serial compiler no matter what for pip, but parallelization is generally unsafe when modifying pip so this isn't a good idea either.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh wait, this is even more problematic than I initially thought because the workers could import a new pip was maliciously overwritten. We had a security report similar to this for the self- version check. Hmm, I'll need to go back to the drawing board. Hopefully there is a way to avoid the extra imports on worker startup, otherwise, I'm going to have to use multiprocessing.Pool directly and pay the worker startup overhead all at once... :(

@pfmoore
Copy link
Member

pfmoore commented Feb 27, 2025

Basically, if the Python files are installed to a read-only directory, then importing those files will be permanently slower as the .pyc files will never be cached. This is quite subtle, enough so that we can't really expect newbies to recognise and know how to address this

This may be a dumb question, but why do we care about people installing to a readonly directory? They must have done something special in order to install there, so they can hardly be considered "newbies". The most obvious case I can think of is the admin of a HPC cluster installing a shared package, but I definitely wouldn't call a HPC cluster admin a "newbie"! Expecting such a user to change their process to add --compile to the install command shouldn't be too much of an ask.

As uv appear to have demonstrated, for most people, not compiling on install simply isn't an issue.

Maybe we could just deprecate automatic byte compilation? We could document1 it as

As of pip X.Y, byte compilation will no longer be the default - from that point, use --compile if you want to continue byte compiling. Before then, if you want to try out the change, you can use --no-compile.

Unfortunately, I can't think of a good way to add a runtime warning (short of warning on all invocations that don't include either --compile or --no-compile), but again, I'd hope that someone in an environment where compiling or not mattered would be reading the pip release announcements, at a minimum...

@pfmoore if you have time, I'd especially appreciate a review from you.

I'll make some time to take a look, but I assume from this comment:

Hmm, I'll need to go back to the drawing board.

that I should hold off for a while... 🙂

Footnotes

  1. For added coverage, we could include this note in every release through the deprecation period.

@edmorley
Copy link
Contributor

edmorley commented Feb 27, 2025

Basically, if the Python files are installed to a read-only directory, then importing those files will be permanently slower as the .pyc files will never be cached.

A perhaps more common scenario than read-only directories, is ephemeral storage - such as when running Python in a Docker container.

If the pip install in a user's Dockerfile doesn't compile bytecode, then every time a new container is started from the image, the bytecode will need to be recreated - i.e.: at every app boot/restart/scale up event (and duplicated across all containers for apps that horizontally scale). This is why the uv Docker docs mention enabling bytecode as an optimisation for production images:
https://docs.astral.sh/uv/guides/integration/docker/#compiling-bytecode

If bytecode compilation were disabled in pip by default:

  • It would be helpful to add a note to the pip docs about using --compile in Dockerfiles, similar to uv.
  • I'd like to see this parallel compilation PR land regardless, since it will still benefit those who need to use --compile.

Disabling bytecode generation by default might actually mean the implementation in this PR could be simplified, since the feature would now be opt-in and less disruptive? e.g. The heuristic for whether to use parallel compilation could perhaps be removed, since --compile would then be something run sparingly and typically only in system/production deploy scenarios (where the number of dependencies being installed is likely larger and so would more often benefit from concurrency, plus any potential 100ms overhead in the "few/small packages" case is something such users are less likely to care about).

@notatallshaw
Copy link
Member

notatallshaw commented Feb 27, 2025

This may be a dumb question, but why do we care about people installing to a readonly directory?

I think this is a reference to the motivation that was stated here: #12920 (comment)

@pfmoore
Copy link
Member

pfmoore commented Feb 27, 2025

Yes, but I don't think I agree with that motivation1 - the discussion was closed saying we'd reached consensus, but IMO that was a bit of an overstatement. Furthermore, whereas I was mostly neutral when the question was "default to --no-compile" vs "leave things as they are", the question is now "default to --no-compile" vs "add a bunch of complex code to do parallel compiles". Or to put it another way, this PR is based on the assumption that "do nothing" isn't a valid solution, so the conclusion from the other thread isn't applicable anyway...

I agree that we should be considering the experience for unsophisticated users, and not putting them in a situation where they get permanently bad performance because the language's "compile on first use" mechanism can't work. But how often does an unsophisticated user use pip install to install a package to a read-only directory anyway? On Windows, the standard ways of installing Python all give you a read-write site-packages. On Linux, we're back to the never-ending battle of "you shouldn't manage the system Python using pip"2. And if you're using a Python installation provided for you by your system administrator, it's the admin's responsibility to set things up so that either you can install to a read-write directory, or everything you need is installed and precompiled.

So I still think the question is a valid one. Is the added complexity and maintenance cost of this code worthwhile in comparison to simply switching the default to --no-compile (with a suitable transition period and user education)? Particularly when --no-compile will be even faster than parallel compiles.

My view is that no, it probably isn't3.

To give some indicative figures, on my PC, pip install numpy --no-compile takes around 7 seconds with a warmed up cache. pip uninstall numpy -y takes around 4 seconds, implying that over half the time of the install is related to pure shuffling files around, which is unavoidable unless we start doing things like symlinking to a cached unpacked wheel that uv does. In comparison, pip install numpy --compile takes roughly 17 seconds, well over double the time4.

Other maintainers may have different views. That's fine. But ideally we should have a discussion and come to a consensus. If we can't reach consensus, I don't know what the best way forward is - at the moment, it feels frustratingly like every "big" change is blocked because we can't agree, and this could easily end up in the same situation.

Footnotes

  1. Looking back, I don't understand why I didn't ask "how is the naive user installing into a read-only directory in the first place?" That's my mistake, as it seems to undermine the whole argument that --no-compile by default would impact inexperienced users...

  2. Yes, there's plenty of bad information on the internet discussing this. No matter how much we tell them not to, people do sudo pip install --break-system-packages. But again, that's not our issue. People taking bad advice need to understand the consequences. And while sudo pip install --break-system-packages --compile is still bad advice, it's a viable option for people who thought sudo pip install --break-system-packages was reasonable...

  3. I'll still review this change, @ichard26 - I don't want the "big picture" question to block that.

  4. As a side note, uv pip install numpy --compile takes around 2 seconds to byte-compile, as compared to our 10 seconds. Which suggests that optimising the compile step is worthwhile, and can give useful benefits, even if we don't compile by default.

@notatallshaw
Copy link
Member

Yes, but I don't think I agree with that motivation1 - the discussion was closed saying we'd reached consensus, but IMO that was a bit of an overstatement

Fair enough, I've re-opened that issue, my view at the time was I didn’t want to create a protracted discussion, because in general changing a default requires there to be almost unanimous consent, or at least no vocal detractor, whereas leaving a default the same requires no action.

Can we move discussion of the default compile option to that issue please. IMO this PR should be reviewed on complexity it introduces vs. the improvements it makes (reduced compile time). Whether compiling is default can be treated as an orthogonal choice, and having a lengthy discussion about it here make it difficult for reviewing this PR on its own merits.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants