bpo-42236: Enhance init and encoding documentation (GH-23109)
Enhance the documentation of the Python startup, filesystem encoding
and error handling, locale encoding. Add a new "Python UTF-8 Mode"
section.
* Add "locale encoding" and "filesystem encoding and error handler"
to the glossary
* Remove documentation from Include/cpython/initconfig.h: move it to
Doc/c-api/init_config.rst.
* Doc/c-api/init_config.rst:
* Document command line options and environment variables
* Document default values.
* Add a new "Python UTF-8 Mode" section in Doc/library/os.rst.
* Add warnings to Py_DecodeLocale() and Py_EncodeLocale() docs.
* Document how Python selects the filesystem encoding and error
handler at a single place: PyConfig.filesystem_encoding and
PyConfig.filesystem_errors.
* PyConfig: move orig_argv member at the right place.
diff --git a/Doc/library/devmode.rst b/Doc/library/devmode.rst
index d5a40cd..e6ed594 100644
--- a/Doc/library/devmode.rst
+++ b/Doc/library/devmode.rst
@@ -93,6 +93,9 @@
option from removing :keyword:`assert` statements nor from setting
:const:`__debug__` to ``False``.
+The Python Development Mode can only be enabled at the Python startup. Its
+value can be read from :data:`sys.flags.dev_mode <sys.flags>`.
+
.. versionchanged:: 3.8
The :class:`io.IOBase` destructor now logs ``close()`` exceptions.
diff --git a/Doc/library/exceptions.rst b/Doc/library/exceptions.rst
index df2cda9..8fb25a5 100644
--- a/Doc/library/exceptions.rst
+++ b/Doc/library/exceptions.rst
@@ -313,8 +313,8 @@
.. versionchanged:: 3.4
The :attr:`filename` attribute is now the original file name passed to
the function, instead of the name encoded to or decoded from the
- filesystem encoding. Also, the *filename2* constructor argument and
- attribute was added.
+ :term:`filesystem encoding and error handler`. Also, the *filename2*
+ constructor argument and attribute was added.
.. exception:: OverflowError
diff --git a/Doc/library/locale.rst b/Doc/library/locale.rst
index 678148a..0a77be4 100644
--- a/Doc/library/locale.rst
+++ b/Doc/library/locale.rst
@@ -315,21 +315,25 @@
.. function:: getpreferredencoding(do_setlocale=True)
- Return the encoding used for text data, according to user preferences. User
- preferences are expressed differently on different systems, and might not be
- available programmatically on some systems, so this function only returns a
- guess.
+ Return the :term:`locale encoding` used for text data, according to user
+ preferences. User preferences are expressed differently on different
+ systems, and might not be available programmatically on some systems, so
+ this function only returns a guess.
- On some systems, it is necessary to invoke :func:`setlocale` to obtain the user
- preferences, so this function is not thread-safe. If invoking setlocale is not
- necessary or desired, *do_setlocale* should be set to ``False``.
+ On some systems, it is necessary to invoke :func:`setlocale` to obtain the
+ user preferences, so this function is not thread-safe. If invoking setlocale
+ is not necessary or desired, *do_setlocale* should be set to ``False``.
- On Android or in the UTF-8 mode (:option:`-X` ``utf8`` option), always
- return ``'UTF-8'``, the locale and the *do_setlocale* argument are ignored.
+ On Android or if the :ref:`Python UTF-8 Mode <utf8-mode>` is enabled, always
+ return ``'UTF-8'``, the :term:`locale encoding` and the *do_setlocale*
+ argument are ignored.
+
+ The :ref:`Python preinitialization <c-preinit>` configures the LC_CTYPE
+ locale. See also the :term:`filesystem encoding and error handler`.
.. versionchanged:: 3.7
- The function now always returns ``UTF-8`` on Android or if the UTF-8 mode
- is enabled.
+ The function now always returns ``UTF-8`` on Android or if the
+ :ref:`Python UTF-8 Mode <utf8-mode>` is enabled.
.. function:: normalize(localename)
diff --git a/Doc/library/os.rst b/Doc/library/os.rst
index 718d981..f9f35b3 100644
--- a/Doc/library/os.rst
+++ b/Doc/library/os.rst
@@ -68,8 +68,13 @@
In Python, file names, command line arguments, and environment variables are
represented using the string type. On some systems, decoding these strings to
and from bytes is necessary before passing them to the operating system. Python
-uses the file system encoding to perform this conversion (see
-:func:`sys.getfilesystemencoding`).
+uses the :term:`filesystem encoding and error handler` to perform this
+conversion (see :func:`sys.getfilesystemencoding`).
+
+The :term:`filesystem encoding and error handler` are configured at Python
+startup by the :c:func:`PyConfig_Read` function: see
+:c:member:`~PyConfig.filesystem_encoding` and
+:c:member:`~PyConfig.filesystem_errors` members of :c:type:`PyConfig`.
.. versionchanged:: 3.1
On some systems, conversion using the file system encoding may fail. In this
@@ -79,9 +84,70 @@
original byte on encoding.
-The file system encoding must guarantee to successfully decode all bytes
-below 128. If the file system encoding fails to provide this guarantee, API
-functions may raise UnicodeErrors.
+The :term:`file system encoding <filesystem encoding and error handler>` must
+guarantee to successfully decode all bytes below 128. If the file system
+encoding fails to provide this guarantee, API functions can raise
+:exc:`UnicodeError`.
+
+See also the :term:`locale encoding`.
+
+
+.. _utf8-mode:
+
+Python UTF-8 Mode
+-----------------
+
+.. versionadded:: 3.7
+ See :pep:`540` for more details.
+
+The Python UTF-8 Mode ignores the :term:`locale encoding` and forces the usage
+of the UTF-8 encoding:
+
+* Use UTF-8 as the :term:`filesystem encoding <filesystem encoding and error
+ handler>`.
+* :func:`sys.getfilesystemencoding()` returns ``'UTF-8'``.
+* :func:`locale.getpreferredencoding()` returns ``'UTF-8'`` (the *do_setlocale*
+ argument has no effect).
+* :data:`sys.stdin`, :data:`sys.stdout`, and :data:`sys.stderr` all use
+ UTF-8 as their text encoding, with the ``surrogateescape``
+ :ref:`error handler <error-handlers>` being enabled for :data:`sys.stdin`
+ and :data:`sys.stdout` (:data:`sys.stderr` continues to use
+ ``backslashreplace`` as it does in the default locale-aware mode)
+
+Note that the standard stream settings in UTF-8 mode can be overridden by
+:envvar:`PYTHONIOENCODING` (just as they can be in the default locale-aware
+mode).
+
+As a consequence of the changes in those lower level APIs, other higher
+level APIs also exhibit different default behaviours:
+
+* Command line arguments, environment variables and filenames are decoded
+ to text using the UTF-8 encoding.
+* :func:`os.fsdecode()` and :func:`os.fsencode()` use the UTF-8 encoding.
+* :func:`open()`, :func:`io.open()`, and :func:`codecs.open()` use the UTF-8
+ encoding by default. However, they still use the strict error handler by
+ default so that attempting to open a binary file in text mode is likely
+ to raise an exception rather than producing nonsense data.
+
+The :ref:`Python UTF-8 Mode <utf8-mode>` is enabled if the LC_CTYPE locale is
+``C`` or ``POSIX`` at Python startup (see the :c:func:`PyConfig_Read`
+function).
+
+It can be enabled or disabled using the :option:`-X utf8 <-X>` command line
+option and the :envvar:`PYTHONUTF8` environment variable.
+
+If the :envvar:`PYTHONUTF8` environment variable is not set at all, then the
+interpreter defaults to using the current locale settings, *unless* the current
+locale is identified as a legacy ASCII-based locale (as described for
+:envvar:`PYTHONCOERCECLOCALE`), and locale coercion is either disabled or
+fails. In such legacy locales, the interpreter will default to enabling UTF-8
+mode unless explicitly instructed not to do so.
+
+The Python UTF-8 Mode can only be enabled at the Python startup. Its value
+can be read from :data:`sys.flags.utf8_mode <sys.flags>`.
+
+See also the :ref:`UTF-8 mode on Windows <win-utf8-mode>`
+and the :term:`filesystem encoding and error handler`.
.. _os-procinfo:
@@ -165,9 +231,9 @@
.. function:: fsencode(filename)
- Encode :term:`path-like <path-like object>` *filename* to the filesystem
- encoding with ``'surrogateescape'`` error handler, or ``'strict'`` on
- Windows; return :class:`bytes` unchanged.
+ Encode :term:`path-like <path-like object>` *filename* to the
+ :term:`filesystem encoding and error handler`; return :class:`bytes`
+ unchanged.
:func:`fsdecode` is the reverse function.
@@ -181,8 +247,8 @@
.. function:: fsdecode(filename)
Decode the :term:`path-like <path-like object>` *filename* from the
- filesystem encoding with ``'surrogateescape'`` error handler, or ``'strict'``
- on Windows; return :class:`str` unchanged.
+ :term:`filesystem encoding and error handler`; return :class:`str`
+ unchanged.
:func:`fsencode` is the reverse function.
@@ -3246,7 +3312,7 @@
Removes the extended filesystem attribute *attribute* from *path*.
*attribute* should be bytes or str (directly or indirectly through the
:class:`PathLike` interface). If it is a string, it is encoded
- with the filesystem encoding.
+ with the :term:`filesystem encoding and error handler`.
This function can support :ref:`specifying a file descriptor <path_fd>` and
:ref:`not following symlinks <follow_symlinks>`.
@@ -3262,7 +3328,7 @@
Set the extended filesystem attribute *attribute* on *path* to *value*.
*attribute* must be a bytes or str with no embedded NULs (directly or
indirectly through the :class:`PathLike` interface). If it is a str,
- it is encoded with the filesystem encoding. *flags* may be
+ it is encoded with the :term:`filesystem encoding and error handler`. *flags* may be
:data:`XATTR_REPLACE` or :data:`XATTR_CREATE`. If :data:`XATTR_REPLACE` is
given and the attribute does not exist, ``EEXISTS`` will be raised.
If :data:`XATTR_CREATE` is given and the attribute already exists, the
diff --git a/Doc/library/sys.rst b/Doc/library/sys.rst
index f0acfcf..0f13adc 100644
--- a/Doc/library/sys.rst
+++ b/Doc/library/sys.rst
@@ -627,21 +627,24 @@
.. function:: getfilesystemencoding()
- Return the name of the encoding used to convert between Unicode
- filenames and bytes filenames.
+ Get the :term:`filesystem encoding <filesystem encoding and error handler>`:
+ the encoding used with the :term:`filesystem error handler <filesystem
+ encoding and error handler>` to convert between Unicode filenames and bytes
+ filenames. The filesystem error handler is returned from
+ :func:`getfilesystemencoding`.
For best compatibility, str should be used for filenames in all cases,
although representing filenames as bytes is also supported. Functions
accepting or returning filenames should support either str or bytes and
internally convert to the system's preferred representation.
- This encoding is always ASCII-compatible.
-
:func:`os.fsencode` and :func:`os.fsdecode` should be used to ensure that
the correct encoding and errors mode are used.
- The filesystem encoding is initialized from
- :c:member:`PyConfig.filesystem_encoding`.
+ The :term:`filesystem encoding and error handler` are configured at Python
+ startup by the :c:func:`PyConfig_Read` function: see
+ :c:member:`~PyConfig.filesystem_encoding` and
+ :c:member:`~PyConfig.filesystem_errors` members of :c:type:`PyConfig`.
.. versionchanged:: 3.2
:func:`getfilesystemencoding` result cannot be ``None`` anymore.
@@ -651,20 +654,25 @@
and :func:`_enablelegacywindowsfsencoding` for more information.
.. versionchanged:: 3.7
- Return 'utf-8' in the UTF-8 mode.
+ Return ``'utf-8'`` if the :ref:`Python UTF-8 Mode <utf8-mode>` is
+ enabled.
.. function:: getfilesystemencodeerrors()
- Return the name of the error mode used to convert between Unicode filenames
- and bytes filenames. The encoding name is returned from
+ Get the :term:`filesystem error handler <filesystem encoding and error
+ handler>`: the error handler used with the :term:`filesystem encoding
+ <filesystem encoding and error handler>` to convert between Unicode
+ filenames and bytes filenames. The filesystem encoding is returned from
:func:`getfilesystemencoding`.
:func:`os.fsencode` and :func:`os.fsdecode` should be used to ensure that
the correct encoding and errors mode are used.
- The filesystem error handler is initialized from
- :c:member:`PyConfig.filesystem_errors`.
+ The :term:`filesystem encoding and error handler` are configured at Python
+ startup by the :c:func:`PyConfig_Read` function: see
+ :c:member:`~PyConfig.filesystem_encoding` and
+ :c:member:`~PyConfig.filesystem_errors` members of :c:type:`PyConfig`.
.. versionadded:: 3.6
@@ -1457,8 +1465,9 @@
.. function:: _enablelegacywindowsfsencoding()
- Changes the default filesystem encoding and errors mode to 'mbcs' and
- 'replace' respectively, for consistency with versions of Python prior to 3.6.
+ Changes the :term:`filesystem encoding and error handler` to 'mbcs' and
+ 'replace' respectively, for consistency with versions of Python prior to
+ 3.6.
This is equivalent to defining the :envvar:`PYTHONLEGACYWINDOWSFSENCODING`
environment variable before launching Python.
@@ -1488,9 +1497,8 @@
returned by the :func:`open` function. Their parameters are chosen as
follows:
- * The character encoding is platform-dependent. Non-Windows
- platforms use the locale encoding (see
- :meth:`locale.getpreferredencoding()`).
+ * The encoding and error handling are is initialized from
+ :c:member:`PyConfig.stdio_encoding` and :c:member:`PyConfig.stdio_errors`.
On Windows, UTF-8 is used for the console device. Non-character
devices such as disk files and pipes use the system locale
@@ -1498,7 +1506,7 @@
devices such as NUL (i.e. where ``isatty()`` returns ``True``) use the
value of the console input and output codepages at startup,
respectively for stdin and stdout/stderr. This defaults to the
- system locale encoding if the process is not initially attached
+ system :term:`locale encoding` if the process is not initially attached
to a console.
The special behaviour of the console can be overridden