bpo-42236: Use UTF-8 encoding if nl_langinfo(CODESET) fails (GH-23086)
If the nl_langinfo(CODESET) function returns an empty string, Python
now uses UTF-8 as the filesystem encoding.
In May 2010 (commit b744ba1d14c5487576c95d0311e357b707600b47), I
modified Python to log a warning and use UTF-8 as the filesystem
encoding (instead of None) if nl_langinfo(CODESET) returns an empty
string.
In August 2020 (commit 94908bbc1503df830d1d615e7b57744ae1b41079), I
modified Python startup to fail with a fatal error and a specific
error message if nl_langinfo(CODESET) returns an empty string. The
intent was to prevent guessing the encoding and also investigate user
configuration where this case happens.
In 10 years (2010 to 2020), I saw zero user report about the error
message related to nl_langinfo(CODESET) returning an empty string.
Today, UTF-8 became the defacto standard and it's safe to make the
assumption that the user expects UTF-8. For example,
nl_langinfo(CODESET) can return an empty string on macOS if the
LC_CTYPE locale is not supported, and UTF-8 is the default encoding
on macOS.
While this change is likely to not affect anyone in practice, it
should make UTF-8 lover happy ;-)
Rewrite also the documentation explaining how Python selects the
filesystem encoding and error handler.
diff --git a/Include/cpython/initconfig.h b/Include/cpython/initconfig.h
index bbe8387..dd5ca61 100644
--- a/Include/cpython/initconfig.h
+++ b/Include/cpython/initconfig.h
@@ -156,36 +156,13 @@ typedef struct {
/* Python filesystem encoding and error handler:
sys.getfilesystemencoding() and sys.getfilesystemencodeerrors().
- Default encoding and error handler:
+ The Doc/c-api/init_config.rst documentation explains how Python selects
+ the filesystem encoding and error handler.
- * if Py_SetStandardStreamEncoding() has been called: they have the
- highest priority;
- * PYTHONIOENCODING environment variable;
- * The UTF-8 Mode uses UTF-8/surrogateescape;
- * If Python forces the usage of the ASCII encoding (ex: C locale
- or POSIX locale on FreeBSD or HP-UX), use ASCII/surrogateescape;
- * locale encoding: ANSI code page on Windows, UTF-8 on Android and
- VxWorks, LC_CTYPE locale encoding on other platforms;
- * On Windows, "surrogateescape" error handler;
- * "surrogateescape" error handler if the LC_CTYPE locale is "C" or "POSIX";
- * "surrogateescape" error handler if the LC_CTYPE locale has been coerced
- (PEP 538);
- * "strict" error handler.
-
- Supported error handlers: "strict", "surrogateescape" and
- "surrogatepass". The surrogatepass error handler is only supported
- if Py_DecodeLocale() and Py_EncodeLocale() use directly the UTF-8 codec;
- it's only used on Windows.
-
- initfsencoding() updates the encoding to the Python codec name.
- For example, "ANSI_X3.4-1968" is replaced with "ascii".
-
- On Windows, sys._enablelegacywindowsfsencoding() sets the
- encoding/errors to mbcs/replace at runtime.
-
-
- See Py_FileSystemDefaultEncoding and Py_FileSystemDefaultEncodeErrors.
- */
+ _PyUnicode_InitEncodings() updates the encoding name to the Python codec
+ name. For example, "ANSI_X3.4-1968" is replaced with "ascii". It also
+ sets Py_FileSystemDefaultEncoding to filesystem_encoding and
+ sets Py_FileSystemDefaultEncodeErrors to filesystem_errors. */
wchar_t *filesystem_encoding;
wchar_t *filesystem_errors;
diff --git a/Include/internal/pycore_fileutils.h b/Include/internal/pycore_fileutils.h
index 1ab554f..9281f4e 100644
--- a/Include/internal/pycore_fileutils.h
+++ b/Include/internal/pycore_fileutils.h
@@ -50,7 +50,7 @@ PyAPI_FUNC(int) _Py_GetLocaleconvNumeric(
PyAPI_FUNC(void) _Py_closerange(int first, int last);
-PyAPI_FUNC(wchar_t*) _Py_GetLocaleEncoding(const char **errmsg);
+PyAPI_FUNC(wchar_t*) _Py_GetLocaleEncoding(void);
PyAPI_FUNC(PyObject*) _Py_GetLocaleEncodingObject(void);
#ifdef __cplusplus
diff --git a/Include/pyport.h b/Include/pyport.h
index 7137006..79fc7c4 100644
--- a/Include/pyport.h
+++ b/Include/pyport.h
@@ -841,12 +841,16 @@ extern _invalid_parameter_handler _Py_silent_invalid_parameter_handler;
#endif
#if defined(__ANDROID__) || defined(__VXWORKS__)
- /* Ignore the locale encoding: force UTF-8 */
+ // Use UTF-8 as the locale encoding, ignore the LC_CTYPE locale.
+ // See _Py_GetLocaleEncoding(), PyUnicode_DecodeLocale()
+ // and PyUnicode_EncodeLocale().
# define _Py_FORCE_UTF8_LOCALE
#endif
#if defined(_Py_FORCE_UTF8_LOCALE) || defined(__APPLE__)
- /* Use UTF-8 as filesystem encoding */
+ // Use UTF-8 as the filesystem encoding.
+ // See PyUnicode_DecodeFSDefaultAndSize(), PyUnicode_EncodeFSDefault(),
+ // Py_DecodeLocale() and Py_EncodeLocale().
# define _Py_FORCE_UTF8_FS_ENCODING
#endif