Skip to content

Commit

Permalink
Parallel post transform and write_doc_serialized
Browse files Browse the repository at this point in the history
- Introduce config.enable_parallel_post_transform
- html: merge images after parallel write_doc_serialized
- html: merge back search indexer
- Updated CHANGES and AUTHORS
  • Loading branch information
ubmarco committed Nov 16, 2023
1 parent 3596590 commit 8f32239
Show file tree
Hide file tree
Showing 12 changed files with 225 additions and 24 deletions.
1 change: 1 addition & 0 deletions AUTHORS.rst
Original file line number Diff line number Diff line change
Expand Up @@ -68,6 +68,7 @@ Contributors
* Lars Hupfeldt Nielsen - OpenSSL FIPS mode md5 bug fix
* Łukasz Langa -- partial support for autodoc
* Marco Buttu -- doctest extension (pyversion option)
* Marco Heinemann -- multiprocessing improvements
* Martin Hans -- autodoc improvements
* Martin Larralde -- additional napoleon admonitions
* Martin Mahner -- nature theme
Expand Down
5 changes: 5 additions & 0 deletions CHANGES.rst
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,11 @@ Features added

.. _`<search>`: https://developer.mozilla.org/en-US/docs/Web/HTML/Element/search

* #10779 and #11448: Parallel execution of post-transformation and
write_doc_serialized as an experimental feature.
Speeds up builds featuring expensive post-transforms by a factor of at least 2.
Patch by Marco Heinemann.

Bugs fixed
----------

Expand Down
2 changes: 2 additions & 0 deletions doc/extdev/builderapi.rst
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@ Builder API
.. autoattribute:: supported_remote_images
.. autoattribute:: supported_data_uri_images
.. autoattribute:: default_translator_class
.. autoattribute:: post_transform_merge_attr

These methods are predefined and will be called from the application:

Expand All @@ -37,6 +38,7 @@ Builder API
.. automethod:: get_target_uri
.. automethod:: prepare_writing
.. automethod:: write_doc
.. automethod:: merge_builder_post_transform
.. automethod:: finish

**Attributes**
Expand Down
72 changes: 72 additions & 0 deletions doc/usage/configuration.rst
Original file line number Diff line number Diff line change
Expand Up @@ -787,6 +787,78 @@ General configuration

.. versionadded:: 5.1

.. confval:: enable_parallel_post_transform

Default is ``False``.
This experimental feature flag activates parallel post-transformation during
the parallel write phase. When active, the event :event:`doctree-resolved` as
well as the builder function ``write_doc_serialized`` also run in parallel.
Parallel post-transformation can greatly improve build time for extensions that
do heave computation in that phase. Depending on machine core count and project
size, a build time reduction by a factor of 2 to 4 and even more was observed.
The feature flag does nothing in case parallel writing is not used.

*Background*

By default, if parallel writing is active (that is no extensions inhibits it by
their :ref:`metadata <ext-metadata>`) the following logic applies:

.. code-block:: text
For each chunk of docnames:
main process: post-transform including doctree-resolved, encapsulated by
BuildEnvironment.get_and_resolve_doctree()
main process: Builder.write_doc_serialized()
sub process: Builder.write_doc()
This means only the ``write_doc()`` function is executed in parallel. However,
the subprocess waits for the main process preparing the chunk. This is a
serious bottleneck that practically inhibits parallel execution when extensions
are used that do CPU intensive calculations during post-transformation.

Activating this feature flag changes the logic as follows:

.. code-block:: text
For each chunk of docnames:
sub process: post-transform including doctree-resolved, encapsulated by
BuildEnvironment.get_and_resolve_doctree()
sub process: Builder.write_doc_serialized()
sub process: Builder.write_doc()
sub process: pickle and return certain Builder attributes
main process: merge attributes back to main process builder
This effectively resolves the main process bottleneck as post-transformations
run in parallel now. The expected core logic for doctrees of
``post-transform > write_doc_serialized > write_doc`` is still intact. The
approach can however lead to issues in case extensions write to the environment
or the builder during the post-transformation phase or in
``write_doc_serialized`` and expect that information to be available after the
subprocess has ended. Each subprocess has a completely separated memory space
and it is lost when the process ends. For Sphinx builders and also custom
builders, specific attributes can be returned to the main process.
See below note for details.

.. note::
Be sure all active extensions support parallel post-transformation before
using this flag.

Extensions writing on :py:class:`sphinx.environment.BuildEnvironment` and
expecting the data to be available at a later build stage
(e.g. in :event:`build-finished`) are *not* supported.
For the builder object, a mechanism exists to return data to the main process:
The builder class may implement the attribute
:py:attr:`sphinx.builders.Builder.post_transform_merge_attr` to define a
list of attributes to be returned to the main process after parallel
post-transformation and writing. This data is passed to the builder method
:py:meth:`sphinx.builders.Builder.merge_builder_post_transform` to do the
actual merging. In case this is not enough for any of the active extensions,
the feature flag cannot be used.

.. versionadded:: 7.3

.. note:: This configuration is still experimental.

.. _intl-options:

Options for internationalization
Expand Down
80 changes: 67 additions & 13 deletions sphinx/builders/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -75,6 +75,15 @@ class Builder:
supported_remote_images = False
#: The builder supports data URIs or not.
supported_data_uri_images = False
#: Builder attributes that should be returned from parallel
#: post transformation, to be merged to the main builder in
#: :py:class:`~sphinx.builders.Builder.merge_builder_post_transform`.
#: Attributes in the list must be pickleable.
#: The approach improves performance when
#: pickling and sending data over pipes because only a
#: subset of the builder attributes are commonly needed for merging
#: to the main process builder instance.
post_transform_merge_attr: list[str] = []

def __init__(self, app: Sphinx, env: BuildEnvironment) -> None:
self.srcdir = app.srcdir
Expand Down Expand Up @@ -125,6 +134,24 @@ def init(self) -> None:
"""
pass

def merge_builder_post_transform(self, new_attrs: dict[str, Any]) -> None:
"""Give builders the option to merge any post-transform information
coming from a parallel sub-process back to the main process builder.
This can be useful for extensions that consume that information
in the build-finish phase.
The function is called once for each finished subprocess.
Builders that implement this function must also define
the class attribute
:py:attr:`~sphinx.builders.Builder.post_transform_merge_attr` as it defines
which builder attributes shall be returned to the main process for merging.
The default implementation does nothing.
:param new_attrs: the attributes from the parallel subprocess to be
updated in the main builder
"""
pass

def create_template_bridge(self) -> None:
"""Return the template bridge configured."""
if self.config.template_bridge:
Expand Down Expand Up @@ -564,7 +591,9 @@ def write(

if self.parallel_ok:
# number of subprocesses is parallel-1 because the main process
# is busy loading doctrees and doing write_doc_serialized()
# is busy loading and post-transforming doctrees and doing write_doc_serialized();
# in case the global configuration enable_parallel_post_transform
# is active the main process does nothing
self._write_parallel(sorted(docnames),
nproc=self.app.parallel - 1)
else:
Expand All @@ -581,11 +610,6 @@ def _write_serial(self, docnames: Sequence[str]) -> None:
self.write_doc(docname, doctree)

def _write_parallel(self, docnames: Sequence[str], nproc: int) -> None:
def write_process(docs: list[tuple[str, nodes.document]]) -> None:
self.app.phase = BuildPhase.WRITING
for docname, doctree in docs:
self.write_doc(docname, doctree)

# warm up caches/compile templates using the first document
firstname, docnames = docnames[0], docnames[1:]
self.app.phase = BuildPhase.RESOLVING
Expand All @@ -594,25 +618,55 @@ def write_process(docs: list[tuple[str, nodes.document]]) -> None:
self.write_doc_serialized(firstname, doctree)
self.write_doc(firstname, doctree)

def write_process(docs: list[tuple[str, nodes.document]]) -> bytes | None:
self.app.phase = BuildPhase.WRITING
if self.env.config.enable_parallel_post_transform:
# run post-transform, doctree-resolved and write_doc_serialized in parallel
for docname, _ in docs:
doctree = self.env.get_and_resolve_doctree(docname, self)
# write_doc_serialized is assumed to be safe for all Sphinx
# internal builders. Some builders merge information from post-transform
# and write_doc_serialized back to the main process using
# Builder.post_transform_merge_attr and
# Builder.merge_builder_post_transform
self.write_doc_serialized(docname, doctree)
self.write_doc(docname, doctree)
merge_attr = {
attr: getattr(self, attr, None)
for attr in self.post_transform_merge_attr
}
return pickle.dumps(merge_attr, pickle.HIGHEST_PROTOCOL)
for docname, doctree in docs:
# doctree has been post-transformed (incl. write_doc_serialized)
# in the main process, only write_doc is needed here
self.write_doc(docname, doctree)
return None

tasks = ParallelTasks(nproc)
chunks = make_chunks(docnames, nproc)

# create a status_iterator to step progressbar after writing a document
# (see: ``on_chunk_done()`` function)
# (see: ``merge_builder()`` function)
progress = status_iterator(chunks, __('writing output... '), "darkgreen",
len(chunks), self.app.verbosity)

def on_chunk_done(args: list[tuple[str, NoneType]], result: NoneType) -> None:
def merge_builder(args: list[tuple[str, NoneType]], new_attrs_pickle: bytes) -> None:
if self.env.config.enable_parallel_post_transform:
new_attrs: dict[str, Any] = pickle.loads(new_attrs_pickle)
self.merge_builder_post_transform(new_attrs)
next(progress)

self.app.phase = BuildPhase.RESOLVING
for chunk in chunks:
arg = []
arg: list[tuple[str, nodes.document | None]] = []
for docname in chunk:
doctree = self.env.get_and_resolve_doctree(docname, self)
self.write_doc_serialized(docname, doctree)
arg.append((docname, doctree))
tasks.add_task(write_process, arg, on_chunk_done)
if not self.env.config.enable_parallel_post_transform:
doctree = self.env.get_and_resolve_doctree(docname, self)
self.write_doc_serialized(docname, doctree)
arg.append((docname, doctree))
else:
arg.append((docname, None))
tasks.add_task(write_process, arg, merge_builder)

# make sure all threads have finished
tasks.join()
Expand Down
12 changes: 12 additions & 0 deletions sphinx/builders/_epub_base.py
Original file line number Diff line number Diff line change
Expand Up @@ -155,6 +155,7 @@ class EpubBuilder(StandaloneHTMLBuilder):
refuri_re = REFURI_RE
template_dir = ""
doctype = ""
post_transform_merge_attr = ['images']

def init(self) -> None:
super().init()
Expand All @@ -167,6 +168,17 @@ def init(self) -> None:
self.use_index = self.get_builder_config('use_index', 'epub')
self.refnodes: list[dict[str, Any]] = []

def merge_builder_post_transform(self, new_attrs: dict[str, Any]) -> None:
"""Merge images back to the main builder after parallel
post-transformation.
param new_attrs: the attributes from the parallel subprocess to be
udpated in the main builder (self)
"""
for filepath, filename in new_attrs['images'].items():
if filepath not in self.images:
self.images[filepath] = filename

def create_build_info(self) -> BuildInfo:
return BuildInfo(self.config, self.tags, ['html', 'epub'])

Expand Down
28 changes: 27 additions & 1 deletion sphinx/builders/html/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@
from sphinx.errors import ConfigError, ThemeError
from sphinx.highlighting import PygmentsBridge
from sphinx.locale import _, __
from sphinx.search import js_index
from sphinx.search import IndexBuilder, js_index
from sphinx.theming import HTMLThemeFactory
from sphinx.util import isurl, logging
from sphinx.util.display import progress_message, status_iterator
Expand Down Expand Up @@ -181,6 +181,7 @@ class StandaloneHTMLBuilder(Builder):

imgpath: str = ''
domain_indices: list[DOMAIN_INDEX_TYPE] = []
post_transform_merge_attr = ['images', 'indexer']

def __init__(self, app: Sphinx, env: BuildEnvironment) -> None:
super().__init__(app, env)
Expand All @@ -206,6 +207,7 @@ def __init__(self, app: Sphinx, env: BuildEnvironment) -> None:
op = pub.setup_option_parser(output_encoding='unicode', traceback=True)
pub.settings = op.get_default_values()
self._publisher = pub
self.indexer: IndexBuilder | None = None

def init(self) -> None:
self.build_info = self.create_build_info()
Expand Down Expand Up @@ -233,6 +235,30 @@ def init(self) -> None:

self.use_index = self.get_builder_config('use_index', 'html')

def merge_builder_post_transform(self, new_attrs: dict[str, Any]) -> None:
"""Merge images and search indexer back to the main builder after parallel
post-transformation.
param new_attrs: the attributes from the parallel subprocess to be
udpated in the main builder (self)
"""
# handle indexer
if self.indexer is None:
lang = self.config.html_search_language or self.config.language
self.indexer = IndexBuilder(self.env, lang,
self.config.html_search_options,
self.config.html_search_scorer)
self.indexer._all_titles.update(new_attrs['indexer']._all_titles)
self.indexer._filenames.update(new_attrs['indexer']._filenames)
self.indexer._index_entries.update(new_attrs['indexer']._index_entries)
self.indexer._mapping.update(new_attrs['indexer']._mapping)
self.indexer._title_mapping.update(new_attrs['indexer']._title_mapping)
self.indexer._titles.update(new_attrs['indexer']._titles)
# handle images
for filepath, filename in new_attrs['images'].items():
if filepath not in self.images:
self.images[filepath] = filename

def create_build_info(self) -> BuildInfo:
return BuildInfo(self.config, self.tags, ['html'])

Expand Down
13 changes: 13 additions & 0 deletions sphinx/builders/linkcheck.py
Original file line number Diff line number Diff line change
Expand Up @@ -60,12 +60,25 @@ class CheckExternalLinksBuilder(DummyBuilder):
epilog = __('Look for any errors in the above output or in '
'%(outdir)s/output.txt')

post_transform_merge_attr = ['hyperlinks']

def init(self) -> None:
self.broken_hyperlinks = 0
self.hyperlinks: dict[str, Hyperlink] = {}
# set a timeout for non-responding servers
socket.setdefaulttimeout(5.0)

def merge_builder_post_transform(self, new_attrs: dict[str, Any]) -> None:
"""Merge hyperlinks back to the main builder after parallel
post-transformation.
param new_attrs: the attributes from the parallel subprocess to be
udpated in the main builder (self)
"""
for hyperlink, value in new_attrs['hyperlinks'].items():
if hyperlink not in self.hyperlinks:
self.hyperlinks[hyperlink] = value

def finish(self) -> None:
checker = HyperlinkAvailabilityChecker(self.config)
logger.info('')
Expand Down
1 change: 1 addition & 0 deletions sphinx/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -153,6 +153,7 @@ class Config:
'builders': ['man', 'text']},
'env', []),
'option_emphasise_placeholders': (False, 'env', []),
'enable_parallel_post_transform': (False, 'html', []),
}

def __init__(self, config: dict[str, Any] | None = None,
Expand Down
21 changes: 12 additions & 9 deletions sphinx/environment/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -612,6 +612,16 @@ def get_doctree(self, docname: str) -> nodes.document:
def master_doctree(self) -> nodes.document:
return self.get_doctree(self.config.root_doc)

def get_doctree_write(self, docname: str) -> nodes.document:
"""Read the doctree from pickle for the write phase."""
try:
doctree = self._write_doc_doctree_cache.pop(docname)
doctree.settings.env = self
doctree.reporter = LoggingReporter(self.doc2path(docname))
except KeyError:
doctree = self.get_doctree(docname)
return doctree

def get_and_resolve_doctree(
self,
docname: str,
Expand All @@ -620,16 +630,9 @@ def get_and_resolve_doctree(
prune_toctrees: bool = True,
includehidden: bool = False,
) -> nodes.document:
"""Read the doctree from the pickle, resolve cross-references and
toctrees and return it.
"""
"""Get the doctree, resolve cross-references and toctrees and return it."""
if doctree is None:
try:
doctree = self._write_doc_doctree_cache.pop(docname)
doctree.settings.env = self
doctree.reporter = LoggingReporter(self.doc2path(docname))
except KeyError:
doctree = self.get_doctree(docname)
doctree = self.get_doctree_write(docname)

# resolve all pending cross-references
self.apply_post_transforms(doctree, docname)
Expand Down
Loading

0 comments on commit 8f32239

Please sign in to comment.