closes bpo-31650: PEP 552 (Deterministic pycs) implementation (#4575)
Python now supports checking bytecode cache up-to-dateness with a hash of the
source contents rather than volatile source metadata. See the PEP for details.
While a fairly straightforward idea, quite a lot of code had to be modified due
to the pervasiveness of pyc implementation details in the codebase. Changes in
this commit include:
- The core changes to importlib to understand how to read, validate, and
regenerate hash-based pycs.
- Support for generating hash-based pycs in py_compile and compileall.
- Modifications to our siphash implementation to support passing a custom
key. We then expose it to importlib through _imp.
- Updates to all places in the interpreter, standard library, and tests that
manually generate or parse pyc files to grok the new format.
- Support in the interpreter command line code for long options like
--check-hash-based-pycs.
- Tests and documentation for all of the above.
diff --git a/Doc/glossary.rst b/Doc/glossary.rst
index ba4d300..e875e1f 100644
--- a/Doc/glossary.rst
+++ b/Doc/glossary.rst
@@ -458,6 +458,12 @@
is believed that overcoming this performance issue would make the
implementation much more complicated and therefore costlier to maintain.
+
+ hash-based pyc
+ A bytecode cache file that uses the the hash rather than the last-modified
+ time of the corresponding source file to determine its validity. See
+ :ref:`pyc-invalidation`.
+
hashable
An object is *hashable* if it has a hash value which never changes during
its lifetime (it needs a :meth:`__hash__` method), and can be compared to
diff --git a/Doc/library/compileall.rst b/Doc/library/compileall.rst
index c1af02b..7b3963d 100644
--- a/Doc/library/compileall.rst
+++ b/Doc/library/compileall.rst
@@ -83,6 +83,16 @@
If ``0`` is used, then the result of :func:`os.cpu_count()`
will be used.
+.. cmdoption:: --invalidation-mode [timestamp|checked-hash|unchecked-hash]
+
+ Control how the generated pycs will be invalidated at runtime. The default
+ setting, ``timestamp``, means that ``.pyc`` files with the source timestamp
+ and size embedded will be generated. The ``checked-hash`` and
+ ``unchecked-hash`` values cause hash-based pycs to be generated. Hash-based
+ pycs embed a hash of the source file contents rather than a timestamp. See
+ :ref:`pyc-invalidation` for more information on how Python validates bytecode
+ cache files at runtime.
+
.. versionchanged:: 3.2
Added the ``-i``, ``-b`` and ``-h`` options.
@@ -91,6 +101,9 @@
was changed to a multilevel value. ``-b`` will always produce a
byte-code file ending in ``.pyc``, never ``.pyo``.
+.. versionchanged:: 3.7
+ Added the ``--invalidation-mode`` parameter.
+
There is no command-line option to control the optimization level used by the
:func:`compile` function, because the Python interpreter itself already
@@ -99,7 +112,7 @@
Public functions
----------------
-.. function:: compile_dir(dir, maxlevels=10, ddir=None, force=False, rx=None, quiet=0, legacy=False, optimize=-1, workers=1)
+.. function:: compile_dir(dir, maxlevels=10, ddir=None, force=False, rx=None, quiet=0, legacy=False, optimize=-1, workers=1, invalidation_mode=py_compile.PycInvalidationMode.TIMESTAMP)
Recursively descend the directory tree named by *dir*, compiling all :file:`.py`
files along the way. Return a true value if all the files compiled successfully,
@@ -140,6 +153,10 @@
then sequential compilation will be used as a fallback. If *workers* is
lower than ``0``, a :exc:`ValueError` will be raised.
+ *invalidation_mode* should be a member of the
+ :class:`py_compile.PycInvalidationMode` enum and controls how the generated
+ pycs are invalidated at runtime.
+
.. versionchanged:: 3.2
Added the *legacy* and *optimize* parameter.
@@ -156,7 +173,10 @@
.. versionchanged:: 3.6
Accepts a :term:`path-like object`.
-.. function:: compile_file(fullname, ddir=None, force=False, rx=None, quiet=0, legacy=False, optimize=-1)
+ .. versionchanged:: 3.7
+ The *invalidation_mode* parameter was added.
+
+.. function:: compile_file(fullname, ddir=None, force=False, rx=None, quiet=0, legacy=False, optimize=-1, invalidation_mode=py_compile.PycInvalidationMode.TIMESTAMP)
Compile the file with path *fullname*. Return a true value if the file
compiled successfully, and a false value otherwise.
@@ -184,6 +204,10 @@
*optimize* specifies the optimization level for the compiler. It is passed to
the built-in :func:`compile` function.
+ *invalidation_mode* should be a member of the
+ :class:`py_compile.PycInvalidationMode` enum and controls how the generated
+ pycs are invalidated at runtime.
+
.. versionadded:: 3.2
.. versionchanged:: 3.5
@@ -193,7 +217,10 @@
The *legacy* parameter only writes out ``.pyc`` files, not ``.pyo`` files
no matter what the value of *optimize* is.
-.. function:: compile_path(skip_curdir=True, maxlevels=0, force=False, quiet=0, legacy=False, optimize=-1)
+ .. versionchanged:: 3.7
+ The *invalidation_mode* parameter was added.
+
+.. function:: compile_path(skip_curdir=True, maxlevels=0, force=False, quiet=0, legacy=False, optimize=-1, invalidation_mode=py_compile.PycInvalidationMode.TIMESTAMP)
Byte-compile all the :file:`.py` files found along ``sys.path``. Return a
true value if all the files compiled successfully, and a false value otherwise.
@@ -213,6 +240,9 @@
The *legacy* parameter only writes out ``.pyc`` files, not ``.pyo`` files
no matter what the value of *optimize* is.
+ .. versionchanged:: 3.7
+ The *invalidation_mode* parameter was added.
+
To force a recompile of all the :file:`.py` files in the :file:`Lib/`
subdirectory and all its subdirectories::
diff --git a/Doc/library/importlib.rst b/Doc/library/importlib.rst
index 3d350e8..3cafb41 100644
--- a/Doc/library/importlib.rst
+++ b/Doc/library/importlib.rst
@@ -67,6 +67,9 @@
:pep:`489`
Multi-phase extension module initialization
+ :pep:`552`
+ Deterministic pycs
+
:pep:`3120`
Using UTF-8 as the Default Source Encoding
@@ -1327,6 +1330,14 @@
.. versionchanged:: 3.6
Accepts a :term:`path-like object`.
+.. function:: source_hash(source_bytes)
+
+ Return the hash of *source_bytes* as bytes. A hash-based ``.pyc`` file embeds
+ the :func:`source_hash` of the corresponding source file's contents in its
+ header.
+
+ .. versionadded:: 3.7
+
.. class:: LazyLoader(loader)
A class which postpones the execution of the loader of a module until the
diff --git a/Doc/library/py_compile.rst b/Doc/library/py_compile.rst
index 0af8fb1..a4f06de 100644
--- a/Doc/library/py_compile.rst
+++ b/Doc/library/py_compile.rst
@@ -27,7 +27,7 @@
Exception raised when an error occurs while attempting to compile the file.
-.. function:: compile(file, cfile=None, dfile=None, doraise=False, optimize=-1)
+.. function:: compile(file, cfile=None, dfile=None, doraise=False, optimize=-1, invalidation_mode=PycInvalidationMode.TIMESTAMP)
Compile a source file to byte-code and write out the byte-code cache file.
The source code is loaded from the file named *file*. The byte-code is
@@ -53,6 +53,10 @@
:func:`compile` function. The default of ``-1`` selects the optimization
level of the current interpreter.
+ *invalidation_mode* should be a member of the :class:`PycInvalidationMode`
+ enum and controls how the generated ``.pyc`` files are invalidated at
+ runtime.
+
.. versionchanged:: 3.2
Changed default value of *cfile* to be :PEP:`3147`-compliant. Previous
default was *file* + ``'c'`` (``'o'`` if optimization was enabled).
@@ -65,6 +69,41 @@
caveat that :exc:`FileExistsError` is raised if *cfile* is a symlink or
non-regular file.
+ .. versionchanged:: 3.7
+ The *invalidation_mode* parameter was added as specified in :pep:`552`.
+
+
+.. class:: PycInvalidationMode
+
+ A enumeration of possible methods the interpreter can use to determine
+ whether a bytecode file is up to date with a source file. The ``.pyc`` file
+ indicates the desired invalidation mode in its header. See
+ :ref:`pyc-invalidation` for more information on how Python invalidates
+ ``.pyc`` files at runtime.
+
+ .. versionadded:: 3.7
+
+ .. attribute:: TIMESTAMP
+
+ The ``.pyc`` file includes the timestamp and size of the source file,
+ which Python will compare against the metadata of the source file at
+ runtime to determine if the ``.pyc`` file needs to be regenerated.
+
+ .. attribute:: CHECKED_HASH
+
+ The ``.pyc`` file includes a hash of the source file content, which Python
+ will compare against the source at runtime to determine if the ``.pyc``
+ file needs to be regenerated.
+
+ .. attribute:: UNCHECKED_HASH
+
+ Like :attr:`CHECKED_HASH`, the ``.pyc`` file includes a hash of the source
+ file content. However, Python will at runtime assume the ``.pyc`` file is
+ up to date and not validate the ``.pyc`` against the source file at all.
+
+ This option is useful when the ``.pycs`` are kept up to date by some
+ system external to Python like a build system.
+
.. function:: main(args=None)
diff --git a/Doc/reference/import.rst b/Doc/reference/import.rst
index 881e0ae..45d4172 100644
--- a/Doc/reference/import.rst
+++ b/Doc/reference/import.rst
@@ -675,6 +675,33 @@
:meth:`~importlib.abc.Loader.module_repr` method, if defined, before
trying either approach described above. However, the method is deprecated.
+.. _pyc-invalidation:
+
+Cached bytecode invalidation
+----------------------------
+
+Before Python loads cached bytecode from ``.pyc`` file, it checks whether the
+cache is up-to-date with the source ``.py`` file. By default, Python does this
+by storing the source's last-modified timestamp and size in the cache file when
+writing it. At runtime, the import system then validates the cache file by
+checking the stored metadata in the cache file against at source's
+metadata.
+
+Python also supports "hash-based" cache files, which store a hash of the source
+file's contents rather than its metadata. There are two variants of hash-based
+``.pyc`` files: checked and unchecked. For checked hash-based ``.pyc`` files,
+Python validates the cache file by hashing the source file and comparing the
+resulting hash with the hash in the cache file. If a checked hash-based cache
+file is found to be invalid, Python regenerates it and writes a new checked
+hash-based cache file. For unchecked hash-based ``.pyc`` files, Python simply
+assumes the cache file is valid if it exists. Hash-based ``.pyc`` files
+validation behavior may be overridden with the :option:`--check-hash-based-pycs`
+flag.
+
+.. versionchanged:: 3.7
+ Added hash-based ``.pyc`` files. Previously, Python only supported
+ timestamp-based invalidation of bytecode caches.
+
The Path Based Finder
=====================
diff --git a/Doc/using/cmdline.rst b/Doc/using/cmdline.rst
index d110ae3..716bc82 100644
--- a/Doc/using/cmdline.rst
+++ b/Doc/using/cmdline.rst
@@ -210,6 +210,20 @@
import of source modules. See also :envvar:`PYTHONDONTWRITEBYTECODE`.
+.. cmdoption:: --check-hash-based-pycs default|always|never
+
+ Control the validation behavior of hash-based ``.pyc`` files. See
+ :ref:`pyc-invalidation`. When set to ``default``, checked and unchecked
+ hash-based bytecode cache files are validated according to their default
+ semantics. When set to ``always``, all hash-based ``.pyc`` files, whether
+ checked or unchecked, are validated against their corresponding source
+ file. When set to ``never``, hash-based ``.pyc`` files are not validated
+ against their corresponding source files.
+
+ The semantics of timestamp-based ``.pyc`` files are unaffected by this
+ option.
+
+
.. cmdoption:: -d
Turn on parser debugging output (for expert only, depending on compilation
diff --git a/Doc/whatsnew/3.7.rst b/Doc/whatsnew/3.7.rst
index 9363730..3487662 100644
--- a/Doc/whatsnew/3.7.rst
+++ b/Doc/whatsnew/3.7.rst
@@ -197,6 +197,33 @@
See :option:`-X` ``dev`` for the details.
+Hash-based pycs
+---------------
+
+Python has traditionally checked the up-to-dateness of bytecode cache files
+(i.e., ``.pyc`` files) by comparing the source metadata (last-modified timestamp
+and size) with source metadata saved in the cache file header when it was
+generated. While effective, this invalidation method has its drawbacks. When
+filesystem timestamps are too coarse, Python can miss source updates, leading to
+user confusion. Additionally, having a timestamp in the cache file is
+problematic for `build reproduciblity <https://reproducible-builds.org/>`_ and
+content-based build systems.
+
+:pep:`552` extends the pyc format to allow the hash of the source file to be
+used for invalidation instead of the source timestamp. Such ``.pyc`` files are
+called "hash-based". By default, Python still uses timestamp-based invalidation
+and does not generate hash-based ``.pyc`` files at runtime. Hash-based ``.pyc``
+files may be generated with :mod:`py_compile` or :mod:`compileall`.
+
+Hash-based ``.pyc`` files come in two variants: checked and unchecked. Python
+validates checked hash-based ``.pyc`` files against the corresponding source
+files at runtime but doesn't do so for unchecked hash-based pycs. Unchecked
+hash-based ``.pyc`` files are a useful performance optimization for environments
+where a system external to Python (e.g., the build system) is responsible for
+keeping ``.pyc`` files up-to-date.
+
+See :ref:`pyc-invalidation` for more information.
+
Other Language Changes
======================