blob: 992672e806f336bb347c263c88f7d94c6089ad2f [file] [log] [blame]
Georg Brandl116aa622007-08-15 14:28:22 +00001:mod:`codecs` --- Codec registry and base classes
2=================================================
3
4.. module:: codecs
5 :synopsis: Encode and decode data and streams.
Terry Jan Reedyfa089b92016-06-11 15:02:54 -04006
Antoine Pitroufbd4f802012-08-11 16:51:50 +02007.. moduleauthor:: Marc-André Lemburg <mal@lemburg.com>
8.. sectionauthor:: Marc-André Lemburg <mal@lemburg.com>
Georg Brandl116aa622007-08-15 14:28:22 +00009.. sectionauthor:: Martin v. Löwis <martin@v.loewis.de>
10
Andrew Kuchling2e3743c2014-03-19 16:23:01 -040011**Source code:** :source:`Lib/codecs.py`
Georg Brandl116aa622007-08-15 14:28:22 +000012
13.. index::
14 single: Unicode
15 single: Codecs
16 pair: Codecs; encode
17 pair: Codecs; decode
18 single: streams
19 pair: stackable; streams
20
Terry Jan Reedyfa089b92016-06-11 15:02:54 -040021--------------
22
Georg Brandl116aa622007-08-15 14:28:22 +000023This module defines base classes for standard Python codecs (encoders and
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +100024decoders) and provides access to the internal Python codec registry, which
25manages the codec and error handling lookup process. Most standard codecs
26are :term:`text encodings <text encoding>`, which encode text to bytes,
27but there are also codecs provided that encode text to text, and bytes to
28bytes. Custom codecs may encode and decode between arbitrary types, but some
29module features are restricted to use specifically with
30:term:`text encodings <text encoding>`, or with codecs that encode to
31:class:`bytes`.
Georg Brandl116aa622007-08-15 14:28:22 +000032
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +100033The module defines the following functions for encoding and decoding with
34any codec:
Georg Brandl116aa622007-08-15 14:28:22 +000035
Nick Coghlan6cb2b5b2013-10-14 00:22:13 +100036.. function:: encode(obj, encoding='utf-8', errors='strict')
37
38 Encodes *obj* using the codec registered for *encoding*.
39
40 *Errors* may be given to set the desired error handling scheme. The
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +100041 default error handler is ``'strict'`` meaning that encoding errors raise
Nick Coghlan6cb2b5b2013-10-14 00:22:13 +100042 :exc:`ValueError` (or a more codec specific subclass, such as
43 :exc:`UnicodeEncodeError`). Refer to :ref:`codec-base-classes` for more
44 information on codec error handling.
45
46.. function:: decode(obj, encoding='utf-8', errors='strict')
47
48 Decodes *obj* using the codec registered for *encoding*.
49
50 *Errors* may be given to set the desired error handling scheme. The
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +100051 default error handler is ``'strict'`` meaning that decoding errors raise
Nick Coghlan6cb2b5b2013-10-14 00:22:13 +100052 :exc:`ValueError` (or a more codec specific subclass, such as
53 :exc:`UnicodeDecodeError`). Refer to :ref:`codec-base-classes` for more
54 information on codec error handling.
Georg Brandl116aa622007-08-15 14:28:22 +000055
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +100056The full details for each codec can also be looked up directly:
Georg Brandl116aa622007-08-15 14:28:22 +000057
58.. function:: lookup(encoding)
59
60 Looks up the codec info in the Python codec registry and returns a
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +100061 :class:`CodecInfo` object as defined below.
Georg Brandl116aa622007-08-15 14:28:22 +000062
63 Encodings are first looked up in the registry's cache. If not found, the list of
64 registered search functions is scanned. If no :class:`CodecInfo` object is
65 found, a :exc:`LookupError` is raised. Otherwise, the :class:`CodecInfo` object
66 is stored in the cache and returned to the caller.
67
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +100068.. class:: CodecInfo(encode, decode, streamreader=None, streamwriter=None, incrementalencoder=None, incrementaldecoder=None, name=None)
Georg Brandl116aa622007-08-15 14:28:22 +000069
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +100070 Codec details when looking up the codec registry. The constructor
71 arguments are stored in attributes of the same name:
72
73
74 .. attribute:: name
75
76 The name of the encoding.
77
78
79 .. attribute:: encode
80 decode
81
82 The stateless encoding and decoding functions. These must be
83 functions or methods which have the same interface as
84 the :meth:`~Codec.encode` and :meth:`~Codec.decode` methods of Codec
85 instances (see :ref:`Codec Interface <codec-objects>`).
86 The functions or methods are expected to work in a stateless mode.
87
88
89 .. attribute:: incrementalencoder
90 incrementaldecoder
91
92 Incremental encoder and decoder classes or factory functions.
93 These have to provide the interface defined by the base classes
94 :class:`IncrementalEncoder` and :class:`IncrementalDecoder`,
95 respectively. Incremental codecs can maintain state.
96
97
98 .. attribute:: streamwriter
99 streamreader
100
101 Stream writer and reader classes or factory functions. These have to
102 provide the interface defined by the base classes
103 :class:`StreamWriter` and :class:`StreamReader`, respectively.
104 Stream codecs can maintain state.
105
106To simplify access to the various codec components, the module provides
107these additional functions which use :func:`lookup` for the codec lookup:
Georg Brandl116aa622007-08-15 14:28:22 +0000108
109.. function:: getencoder(encoding)
110
111 Look up the codec for the given encoding and return its encoder function.
112
113 Raises a :exc:`LookupError` in case the encoding cannot be found.
114
115
116.. function:: getdecoder(encoding)
117
118 Look up the codec for the given encoding and return its decoder function.
119
120 Raises a :exc:`LookupError` in case the encoding cannot be found.
121
122
123.. function:: getincrementalencoder(encoding)
124
125 Look up the codec for the given encoding and return its incremental encoder
126 class or factory function.
127
128 Raises a :exc:`LookupError` in case the encoding cannot be found or the codec
129 doesn't support an incremental encoder.
130
Georg Brandl116aa622007-08-15 14:28:22 +0000131
132.. function:: getincrementaldecoder(encoding)
133
134 Look up the codec for the given encoding and return its incremental decoder
135 class or factory function.
136
137 Raises a :exc:`LookupError` in case the encoding cannot be found or the codec
138 doesn't support an incremental decoder.
139
Georg Brandl116aa622007-08-15 14:28:22 +0000140
141.. function:: getreader(encoding)
142
Berker Peksag732ba822016-05-21 14:56:35 +0300143 Look up the codec for the given encoding and return its :class:`StreamReader`
144 class or factory function.
Georg Brandl116aa622007-08-15 14:28:22 +0000145
146 Raises a :exc:`LookupError` in case the encoding cannot be found.
147
148
149.. function:: getwriter(encoding)
150
Berker Peksag732ba822016-05-21 14:56:35 +0300151 Look up the codec for the given encoding and return its :class:`StreamWriter`
152 class or factory function.
Georg Brandl116aa622007-08-15 14:28:22 +0000153
154 Raises a :exc:`LookupError` in case the encoding cannot be found.
155
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000156Custom codecs are made available by registering a suitable codec search
157function:
Georg Brandl116aa622007-08-15 14:28:22 +0000158
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000159.. function:: register(search_function)
Georg Brandl116aa622007-08-15 14:28:22 +0000160
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000161 Register a codec search function. Search functions are expected to take one
162 argument, being the encoding name in all lower case letters, and return a
163 :class:`CodecInfo` object. In case a search function cannot find
164 a given encoding, it should return ``None``.
Georg Brandl116aa622007-08-15 14:28:22 +0000165
166 .. note::
167
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000168 Search function registration is not currently reversible,
169 which may cause problems in some cases, such as unit testing or
170 module reloading.
171
172While the builtin :func:`open` and the associated :mod:`io` module are the
173recommended approach for working with encoded text files, this module
174provides additional utility functions and classes that allow the use of a
175wider range of codecs when working with binary files:
176
Alexey Izbysheva2670562018-10-20 03:22:31 +0300177.. function:: open(filename, mode='r', encoding=None, errors='strict', buffering=-1)
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000178
179 Open an encoded file using the given *mode* and return an instance of
180 :class:`StreamReaderWriter`, providing transparent encoding/decoding.
181 The default file mode is ``'r'``, meaning to open the file in read mode.
Georg Brandl116aa622007-08-15 14:28:22 +0000182
Christian Heimes18c66892008-02-17 13:31:39 +0000183 .. note::
184
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000185 Underlying encoded files are always opened in binary mode.
186 No automatic conversion of ``'\n'`` is done on reading and writing.
187 The *mode* argument may be any binary mode acceptable to the built-in
188 :func:`open` function; the ``'b'`` is automatically added.
Christian Heimes18c66892008-02-17 13:31:39 +0000189
Georg Brandl116aa622007-08-15 14:28:22 +0000190 *encoding* specifies the encoding which is to be used for the file.
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000191 Any encoding that encodes to and decodes from bytes is allowed, and
192 the data types supported by the file methods depend on the codec used.
Georg Brandl116aa622007-08-15 14:28:22 +0000193
194 *errors* may be given to define the error handling. It defaults to ``'strict'``
195 which causes a :exc:`ValueError` to be raised in case an encoding error occurs.
196
Alexey Izbysheva2670562018-10-20 03:22:31 +0300197 *buffering* has the same meaning as for the built-in :func:`open` function.
198 It defaults to -1 which means that the default buffer size will be used.
Georg Brandl116aa622007-08-15 14:28:22 +0000199
200
Georg Brandl0d8f0732009-04-05 22:20:44 +0000201.. function:: EncodedFile(file, data_encoding, file_encoding=None, errors='strict')
Georg Brandl116aa622007-08-15 14:28:22 +0000202
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000203 Return a :class:`StreamRecoder` instance, a wrapped version of *file*
204 which provides transparent transcoding. The original file is closed
205 when the wrapped version is closed.
Georg Brandl116aa622007-08-15 14:28:22 +0000206
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000207 Data written to the wrapped file is decoded according to the given
208 *data_encoding* and then written to the original file as bytes using
209 *file_encoding*. Bytes read from the original file are decoded
210 according to *file_encoding*, and the result is encoded
211 using *data_encoding*.
Georg Brandl116aa622007-08-15 14:28:22 +0000212
Georg Brandl0d8f0732009-04-05 22:20:44 +0000213 If *file_encoding* is not given, it defaults to *data_encoding*.
Georg Brandl116aa622007-08-15 14:28:22 +0000214
Georg Brandl0d8f0732009-04-05 22:20:44 +0000215 *errors* may be given to define the error handling. It defaults to
216 ``'strict'``, which causes :exc:`ValueError` to be raised in case an encoding
217 error occurs.
Georg Brandl116aa622007-08-15 14:28:22 +0000218
219
Georg Brandl0d8f0732009-04-05 22:20:44 +0000220.. function:: iterencode(iterator, encoding, errors='strict', **kwargs)
Georg Brandl116aa622007-08-15 14:28:22 +0000221
222 Uses an incremental encoder to iteratively encode the input provided by
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000223 *iterator*. This function is a :term:`generator`.
224 The *errors* argument (as well as any
Georg Brandl9afde1c2007-11-01 20:32:30 +0000225 other keyword argument) is passed through to the incremental encoder.
Georg Brandl116aa622007-08-15 14:28:22 +0000226
Martin Panterc73e9d82016-10-15 00:56:47 +0000227 This function requires that the codec accept text :class:`str` objects
228 to encode. Therefore it does not support bytes-to-bytes encoders such as
229 ``base64_codec``.
230
Georg Brandl116aa622007-08-15 14:28:22 +0000231
Georg Brandl0d8f0732009-04-05 22:20:44 +0000232.. function:: iterdecode(iterator, encoding, errors='strict', **kwargs)
Georg Brandl116aa622007-08-15 14:28:22 +0000233
234 Uses an incremental decoder to iteratively decode the input provided by
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000235 *iterator*. This function is a :term:`generator`.
236 The *errors* argument (as well as any
Georg Brandl9afde1c2007-11-01 20:32:30 +0000237 other keyword argument) is passed through to the incremental decoder.
Georg Brandl116aa622007-08-15 14:28:22 +0000238
Martin Panterc73e9d82016-10-15 00:56:47 +0000239 This function requires that the codec accept :class:`bytes` objects
240 to decode. Therefore it does not support text-to-text encoders such as
241 ``rot_13``, although ``rot_13`` may be used equivalently with
242 :func:`iterencode`.
243
Georg Brandl0d8f0732009-04-05 22:20:44 +0000244
Georg Brandl116aa622007-08-15 14:28:22 +0000245The module also provides the following constants which are useful for reading
246and writing to platform dependent files:
247
248
249.. data:: BOM
250 BOM_BE
251 BOM_LE
252 BOM_UTF8
253 BOM_UTF16
254 BOM_UTF16_BE
255 BOM_UTF16_LE
256 BOM_UTF32
257 BOM_UTF32_BE
258 BOM_UTF32_LE
259
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000260 These constants define various byte sequences,
261 being Unicode byte order marks (BOMs) for several encodings. They are
262 used in UTF-16 and UTF-32 data streams to indicate the byte order used,
263 and in UTF-8 as a Unicode signature. :const:`BOM_UTF16` is either
Georg Brandl116aa622007-08-15 14:28:22 +0000264 :const:`BOM_UTF16_BE` or :const:`BOM_UTF16_LE` depending on the platform's
265 native byte order, :const:`BOM` is an alias for :const:`BOM_UTF16`,
266 :const:`BOM_LE` for :const:`BOM_UTF16_LE` and :const:`BOM_BE` for
267 :const:`BOM_UTF16_BE`. The others represent the BOM in UTF-8 and UTF-32
268 encodings.
269
270
271.. _codec-base-classes:
272
273Codec Base Classes
274------------------
275
276The :mod:`codecs` module defines a set of base classes which define the
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000277interfaces for working with codec objects, and can also be used as the basis
278for custom codec implementations.
Georg Brandl116aa622007-08-15 14:28:22 +0000279
280Each codec has to define four interfaces to make it usable as codec in Python:
281stateless encoder, stateless decoder, stream reader and stream writer. The
282stream reader and writers typically reuse the stateless encoder/decoder to
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000283implement the file protocols. Codec authors also need to define how the
284codec will handle encoding and decoding errors.
Georg Brandl116aa622007-08-15 14:28:22 +0000285
Georg Brandl116aa622007-08-15 14:28:22 +0000286
Nick Coghlanf2126362015-01-07 13:14:47 +1000287.. _surrogateescape:
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000288.. _error-handlers:
289
290Error Handlers
291^^^^^^^^^^^^^^
292
293To simplify and standardize error handling,
294codecs may implement different error handling schemes by
Miss Islington (bot)c4976a62019-10-01 14:02:29 -0700295accepting the *errors* string argument. The following string values are
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000296defined and implemented by all standard Python codecs:
Georg Brandl116aa622007-08-15 14:28:22 +0000297
Georg Brandl44ea77b2013-03-28 13:28:44 +0100298.. tabularcolumns:: |l|L|
299
Georg Brandl116aa622007-08-15 14:28:22 +0000300+-------------------------+-----------------------------------------------+
301| Value | Meaning |
302+=========================+===============================================+
303| ``'strict'`` | Raise :exc:`UnicodeError` (or a subclass); |
Miss Islington (bot)c4976a62019-10-01 14:02:29 -0700304| | this is the default. Implemented in |
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000305| | :func:`strict_errors`. |
Georg Brandl116aa622007-08-15 14:28:22 +0000306+-------------------------+-----------------------------------------------+
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000307| ``'ignore'`` | Ignore the malformed data and continue |
Miss Islington (bot)c4976a62019-10-01 14:02:29 -0700308| | without further notice. Implemented in |
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000309| | :func:`ignore_errors`. |
Georg Brandl116aa622007-08-15 14:28:22 +0000310+-------------------------+-----------------------------------------------+
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000311
312The following error handlers are only applicable to
313:term:`text encodings <text encoding>`:
314
Serhiy Storchaka913876d2018-10-28 13:41:26 +0200315.. index::
316 single: ? (question mark); replacement character
317 single: \ (backslash); escape sequence
318 single: \x; escape sequence
319 single: \u; escape sequence
320 single: \U; escape sequence
321 single: \N; escape sequence
322
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000323+-------------------------+-----------------------------------------------+
324| Value | Meaning |
325+=========================+===============================================+
Georg Brandl116aa622007-08-15 14:28:22 +0000326| ``'replace'`` | Replace with a suitable replacement |
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000327| | marker; Python will use the official |
328| | ``U+FFFD`` REPLACEMENT CHARACTER for the |
329| | built-in codecs on decoding, and '?' on |
Miss Islington (bot)c4976a62019-10-01 14:02:29 -0700330| | encoding. Implemented in |
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000331| | :func:`replace_errors`. |
Georg Brandl116aa622007-08-15 14:28:22 +0000332+-------------------------+-----------------------------------------------+
333| ``'xmlcharrefreplace'`` | Replace with the appropriate XML character |
Miss Islington (bot)c4976a62019-10-01 14:02:29 -0700334| | reference (only for encoding). Implemented |
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000335| | in :func:`xmlcharrefreplace_errors`. |
Georg Brandl116aa622007-08-15 14:28:22 +0000336+-------------------------+-----------------------------------------------+
Serhiy Storchaka07985ef2015-01-25 22:56:57 +0200337| ``'backslashreplace'`` | Replace with backslashed escape sequences. |
338| | Implemented in |
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000339| | :func:`backslashreplace_errors`. |
Georg Brandl116aa622007-08-15 14:28:22 +0000340+-------------------------+-----------------------------------------------+
Serhiy Storchaka166ebc42014-11-25 13:57:17 +0200341| ``'namereplace'`` | Replace with ``\N{...}`` escape sequences |
Miss Islington (bot)c4976a62019-10-01 14:02:29 -0700342| | (only for encoding). Implemented in |
Nick Coghlanf2126362015-01-07 13:14:47 +1000343| | :func:`namereplace_errors`. |
Serhiy Storchaka166ebc42014-11-25 13:57:17 +0200344+-------------------------+-----------------------------------------------+
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000345| ``'surrogateescape'`` | On decoding, replace byte with individual |
346| | surrogate code ranging from ``U+DC80`` to |
Miss Islington (bot)c4976a62019-10-01 14:02:29 -0700347| | ``U+DCFF``. This code will then be turned |
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000348| | back into the same byte when the |
349| | ``'surrogateescape'`` error handler is used |
Miss Islington (bot)c4976a62019-10-01 14:02:29 -0700350| | when encoding the data. (See :pep:`383` for |
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000351| | more.) |
Martin v. Löwis011e8422009-05-05 04:43:17 +0000352+-------------------------+-----------------------------------------------+
Georg Brandl116aa622007-08-15 14:28:22 +0000353
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000354In addition, the following error handler is specific to the given codecs:
Martin v. Löwisdb12d452009-05-02 18:52:14 +0000355
Serhiy Storchaka58cf6072013-11-19 11:32:41 +0200356+-------------------+------------------------+-------------------------------------------+
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000357| Value | Codecs | Meaning |
Serhiy Storchaka58cf6072013-11-19 11:32:41 +0200358+===================+========================+===========================================+
359|``'surrogatepass'``| utf-8, utf-16, utf-32, | Allow encoding and decoding of surrogate |
Miss Islington (bot)c4976a62019-10-01 14:02:29 -0700360| | utf-16-be, utf-16-le, | codes. These codecs normally treat the |
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000361| | utf-32-be, utf-32-le | presence of surrogates as an error. |
Serhiy Storchaka58cf6072013-11-19 11:32:41 +0200362+-------------------+------------------------+-------------------------------------------+
Martin v. Löwisdb12d452009-05-02 18:52:14 +0000363
364.. versionadded:: 3.1
Martin v. Löwis43c57782009-05-10 08:15:24 +0000365 The ``'surrogateescape'`` and ``'surrogatepass'`` error handlers.
Martin v. Löwisdb12d452009-05-02 18:52:14 +0000366
Serhiy Storchaka58cf6072013-11-19 11:32:41 +0200367.. versionchanged:: 3.4
368 The ``'surrogatepass'`` error handlers now works with utf-16\* and utf-32\* codecs.
369
Berker Peksag87f6c222014-11-25 18:59:20 +0200370.. versionadded:: 3.5
Serhiy Storchaka166ebc42014-11-25 13:57:17 +0200371 The ``'namereplace'`` error handler.
372
Serhiy Storchaka07985ef2015-01-25 22:56:57 +0200373.. versionchanged:: 3.5
374 The ``'backslashreplace'`` error handlers now works with decoding and
375 translating.
376
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000377The set of allowed values can be extended by registering a new named error
378handler:
379
380.. function:: register_error(name, error_handler)
381
382 Register the error handling function *error_handler* under the name *name*.
383 The *error_handler* argument will be called during encoding and decoding
384 in case of an error, when *name* is specified as the errors parameter.
385
386 For encoding, *error_handler* will be called with a :exc:`UnicodeEncodeError`
387 instance, which contains information about the location of the error. The
388 error handler must either raise this or a different exception, or return a
389 tuple with a replacement for the unencodable part of the input and a position
390 where encoding should continue. The replacement may be either :class:`str` or
Miss Islington (bot)c4976a62019-10-01 14:02:29 -0700391 :class:`bytes`. If the replacement is bytes, the encoder will simply copy
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000392 them into the output buffer. If the replacement is a string, the encoder will
Miss Islington (bot)c4976a62019-10-01 14:02:29 -0700393 encode the replacement. Encoding continues on original input at the
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000394 specified position. Negative position values will be treated as being
395 relative to the end of the input string. If the resulting position is out of
396 bound an :exc:`IndexError` will be raised.
397
398 Decoding and translating works similarly, except :exc:`UnicodeDecodeError` or
399 :exc:`UnicodeTranslateError` will be passed to the handler and that the
400 replacement from the error handler will be put into the output directly.
401
402
403Previously registered error handlers (including the standard error handlers)
404can be looked up by name:
405
406.. function:: lookup_error(name)
407
408 Return the error handler previously registered under the name *name*.
409
410 Raises a :exc:`LookupError` in case the handler cannot be found.
411
412The following standard error handlers are also made available as module level
413functions:
414
415.. function:: strict_errors(exception)
416
417 Implements the ``'strict'`` error handling: each encoding or
418 decoding error raises a :exc:`UnicodeError`.
419
420
421.. function:: replace_errors(exception)
422
423 Implements the ``'replace'`` error handling (for :term:`text encodings
424 <text encoding>` only): substitutes ``'?'`` for encoding errors
425 (to be encoded by the codec), and ``'\ufffd'`` (the Unicode replacement
Georg Brandl7e91af32015-02-25 13:05:53 +0100426 character) for decoding errors.
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000427
428
429.. function:: ignore_errors(exception)
430
431 Implements the ``'ignore'`` error handling: malformed data is ignored and
432 encoding or decoding is continued without further notice.
433
434
435.. function:: xmlcharrefreplace_errors(exception)
436
437 Implements the ``'xmlcharrefreplace'`` error handling (for encoding with
438 :term:`text encodings <text encoding>` only): the
439 unencodable character is replaced by an appropriate XML character reference.
440
441
442.. function:: backslashreplace_errors(exception)
443
Serhiy Storchaka07985ef2015-01-25 22:56:57 +0200444 Implements the ``'backslashreplace'`` error handling (for
445 :term:`text encodings <text encoding>` only): malformed data is
446 replaced by a backslashed escape sequence.
Georg Brandl116aa622007-08-15 14:28:22 +0000447
Nick Coghlan582acb72015-01-07 00:37:01 +1000448.. function:: namereplace_errors(exception)
449
Nick Coghlanf2126362015-01-07 13:14:47 +1000450 Implements the ``'namereplace'`` error handling (for encoding with
451 :term:`text encodings <text encoding>` only): the
Nick Coghlan582acb72015-01-07 00:37:01 +1000452 unencodable character is replaced by a ``\N{...}`` escape sequence.
453
454 .. versionadded:: 3.5
Georg Brandl116aa622007-08-15 14:28:22 +0000455
456
457.. _codec-objects:
458
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000459Stateless Encoding and Decoding
460^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Georg Brandl116aa622007-08-15 14:28:22 +0000461
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000462The base :class:`Codec` class defines these methods which also define the
463function interfaces of the stateless encoder and decoder:
Georg Brandl116aa622007-08-15 14:28:22 +0000464
465
466.. method:: Codec.encode(input[, errors])
467
468 Encodes the object *input* and returns a tuple (output object, length consumed).
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000469 For instance, :term:`text encoding` converts
470 a string object to a bytes object using a particular
Georg Brandl116aa622007-08-15 14:28:22 +0000471 character set encoding (e.g., ``cp1252`` or ``iso-8859-1``).
472
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000473 The *errors* argument defines the error handling to apply.
474 It defaults to ``'strict'`` handling.
Georg Brandl116aa622007-08-15 14:28:22 +0000475
476 The method may not store state in the :class:`Codec` instance. Use
Berker Peksag41ca8282015-07-30 18:26:10 +0300477 :class:`StreamWriter` for codecs which have to keep state in order to make
478 encoding efficient.
Georg Brandl116aa622007-08-15 14:28:22 +0000479
480 The encoder must be able to handle zero length input and return an empty object
481 of the output object type in this situation.
482
483
484.. method:: Codec.decode(input[, errors])
485
Georg Brandl30c78d62008-05-11 14:52:00 +0000486 Decodes the object *input* and returns a tuple (output object, length
Miss Islington (bot)c4976a62019-10-01 14:02:29 -0700487 consumed). For instance, for a :term:`text encoding`, decoding converts
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000488 a bytes object encoded using a particular
Georg Brandl30c78d62008-05-11 14:52:00 +0000489 character set encoding to a string object.
Georg Brandl116aa622007-08-15 14:28:22 +0000490
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000491 For text encodings and bytes-to-bytes codecs,
492 *input* must be a bytes object or one which provides the read-only
Georg Brandl30c78d62008-05-11 14:52:00 +0000493 buffer interface -- for example, buffer objects and memory mapped files.
Georg Brandl116aa622007-08-15 14:28:22 +0000494
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000495 The *errors* argument defines the error handling to apply.
496 It defaults to ``'strict'`` handling.
Georg Brandl116aa622007-08-15 14:28:22 +0000497
498 The method may not store state in the :class:`Codec` instance. Use
Berker Peksag41ca8282015-07-30 18:26:10 +0300499 :class:`StreamReader` for codecs which have to keep state in order to make
500 decoding efficient.
Georg Brandl116aa622007-08-15 14:28:22 +0000501
502 The decoder must be able to handle zero length input and return an empty object
503 of the output object type in this situation.
504
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000505
506Incremental Encoding and Decoding
507^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
508
Georg Brandl116aa622007-08-15 14:28:22 +0000509The :class:`IncrementalEncoder` and :class:`IncrementalDecoder` classes provide
510the basic interface for incremental encoding and decoding. Encoding/decoding the
511input isn't done with one call to the stateless encoder/decoder function, but
Serhiy Storchakabfdcd432013-10-13 23:09:14 +0300512with multiple calls to the
513:meth:`~IncrementalEncoder.encode`/:meth:`~IncrementalDecoder.decode` method of
514the incremental encoder/decoder. The incremental encoder/decoder keeps track of
515the encoding/decoding process during method calls.
Georg Brandl116aa622007-08-15 14:28:22 +0000516
Serhiy Storchakabfdcd432013-10-13 23:09:14 +0300517The joined output of calls to the
518:meth:`~IncrementalEncoder.encode`/:meth:`~IncrementalDecoder.decode` method is
519the same as if all the single inputs were joined into one, and this input was
Georg Brandl116aa622007-08-15 14:28:22 +0000520encoded/decoded with the stateless encoder/decoder.
521
522
523.. _incremental-encoder-objects:
524
525IncrementalEncoder Objects
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000526~~~~~~~~~~~~~~~~~~~~~~~~~~
Georg Brandl116aa622007-08-15 14:28:22 +0000527
Georg Brandl116aa622007-08-15 14:28:22 +0000528The :class:`IncrementalEncoder` class is used for encoding an input in multiple
529steps. It defines the following methods which every incremental encoder must
530define in order to be compatible with the Python codec registry.
531
532
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000533.. class:: IncrementalEncoder(errors='strict')
Georg Brandl116aa622007-08-15 14:28:22 +0000534
535 Constructor for an :class:`IncrementalEncoder` instance.
536
537 All incremental encoders must provide this constructor interface. They are free
538 to add additional keyword arguments, but only the ones defined here are used by
539 the Python codec registry.
540
541 The :class:`IncrementalEncoder` may implement different error handling schemes
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000542 by providing the *errors* keyword argument. See :ref:`error-handlers` for
543 possible values.
Serhiy Storchaka166ebc42014-11-25 13:57:17 +0200544
Georg Brandl116aa622007-08-15 14:28:22 +0000545 The *errors* argument will be assigned to an attribute of the same name.
546 Assigning to this attribute makes it possible to switch between different error
547 handling strategies during the lifetime of the :class:`IncrementalEncoder`
548 object.
549
Georg Brandl116aa622007-08-15 14:28:22 +0000550
Benjamin Petersone41251e2008-04-25 01:59:09 +0000551 .. method:: encode(object[, final])
Georg Brandl116aa622007-08-15 14:28:22 +0000552
Benjamin Petersone41251e2008-04-25 01:59:09 +0000553 Encodes *object* (taking the current state of the encoder into account)
554 and returns the resulting encoded object. If this is the last call to
555 :meth:`encode` *final* must be true (the default is false).
Georg Brandl116aa622007-08-15 14:28:22 +0000556
557
Benjamin Petersone41251e2008-04-25 01:59:09 +0000558 .. method:: reset()
Georg Brandl116aa622007-08-15 14:28:22 +0000559
Victor Stinnere15dce32011-05-30 22:56:00 +0200560 Reset the encoder to the initial state. The output is discarded: call
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000561 ``.encode(object, final=True)``, passing an empty byte or text string
562 if necessary, to reset the encoder and to get the output.
Georg Brandl116aa622007-08-15 14:28:22 +0000563
564
Zhiming Wang30644de2017-09-10 02:09:55 -0400565 .. method:: getstate()
Georg Brandl116aa622007-08-15 14:28:22 +0000566
Zhiming Wang30644de2017-09-10 02:09:55 -0400567 Return the current state of the encoder which must be an integer. The
568 implementation should make sure that ``0`` is the most common
569 state. (States that are more complicated than integers can be converted
570 into an integer by marshaling/pickling the state and encoding the bytes
Miss Islington (bot)c4976a62019-10-01 14:02:29 -0700571 of the resulting string into an integer.)
Georg Brandl116aa622007-08-15 14:28:22 +0000572
Georg Brandl116aa622007-08-15 14:28:22 +0000573
Zhiming Wang30644de2017-09-10 02:09:55 -0400574 .. method:: setstate(state)
Georg Brandl116aa622007-08-15 14:28:22 +0000575
Zhiming Wang30644de2017-09-10 02:09:55 -0400576 Set the state of the encoder to *state*. *state* must be an encoder state
577 returned by :meth:`getstate`.
Georg Brandl116aa622007-08-15 14:28:22 +0000578
Georg Brandl116aa622007-08-15 14:28:22 +0000579
580.. _incremental-decoder-objects:
581
582IncrementalDecoder Objects
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000583~~~~~~~~~~~~~~~~~~~~~~~~~~
Georg Brandl116aa622007-08-15 14:28:22 +0000584
585The :class:`IncrementalDecoder` class is used for decoding an input in multiple
586steps. It defines the following methods which every incremental decoder must
587define in order to be compatible with the Python codec registry.
588
589
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000590.. class:: IncrementalDecoder(errors='strict')
Georg Brandl116aa622007-08-15 14:28:22 +0000591
592 Constructor for an :class:`IncrementalDecoder` instance.
593
594 All incremental decoders must provide this constructor interface. They are free
595 to add additional keyword arguments, but only the ones defined here are used by
596 the Python codec registry.
597
598 The :class:`IncrementalDecoder` may implement different error handling schemes
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000599 by providing the *errors* keyword argument. See :ref:`error-handlers` for
600 possible values.
Georg Brandl116aa622007-08-15 14:28:22 +0000601
602 The *errors* argument will be assigned to an attribute of the same name.
603 Assigning to this attribute makes it possible to switch between different error
Benjamin Peterson3e4f0552008-09-02 00:31:15 +0000604 handling strategies during the lifetime of the :class:`IncrementalDecoder`
Georg Brandl116aa622007-08-15 14:28:22 +0000605 object.
606
Georg Brandl116aa622007-08-15 14:28:22 +0000607
Benjamin Petersone41251e2008-04-25 01:59:09 +0000608 .. method:: decode(object[, final])
Georg Brandl116aa622007-08-15 14:28:22 +0000609
Benjamin Petersone41251e2008-04-25 01:59:09 +0000610 Decodes *object* (taking the current state of the decoder into account)
611 and returns the resulting decoded object. If this is the last call to
612 :meth:`decode` *final* must be true (the default is false). If *final* is
613 true the decoder must decode the input completely and must flush all
614 buffers. If this isn't possible (e.g. because of incomplete byte sequences
615 at the end of the input) it must initiate error handling just like in the
616 stateless case (which might raise an exception).
Georg Brandl116aa622007-08-15 14:28:22 +0000617
618
Benjamin Petersone41251e2008-04-25 01:59:09 +0000619 .. method:: reset()
Georg Brandl116aa622007-08-15 14:28:22 +0000620
Benjamin Petersone41251e2008-04-25 01:59:09 +0000621 Reset the decoder to the initial state.
Georg Brandl116aa622007-08-15 14:28:22 +0000622
623
Benjamin Petersone41251e2008-04-25 01:59:09 +0000624 .. method:: getstate()
Georg Brandl116aa622007-08-15 14:28:22 +0000625
Benjamin Petersone41251e2008-04-25 01:59:09 +0000626 Return the current state of the decoder. This must be a tuple with two
627 items, the first must be the buffer containing the still undecoded
628 input. The second must be an integer and can be additional state
629 info. (The implementation should make sure that ``0`` is the most common
630 additional state info.) If this additional state info is ``0`` it must be
631 possible to set the decoder to the state which has no input buffered and
632 ``0`` as the additional state info, so that feeding the previously
633 buffered input to the decoder returns it to the previous state without
634 producing any output. (Additional state info that is more complicated than
635 integers can be converted into an integer by marshaling/pickling the info
636 and encoding the bytes of the resulting string into an integer.)
Georg Brandl116aa622007-08-15 14:28:22 +0000637
Georg Brandl116aa622007-08-15 14:28:22 +0000638
Benjamin Petersone41251e2008-04-25 01:59:09 +0000639 .. method:: setstate(state)
Georg Brandl116aa622007-08-15 14:28:22 +0000640
Christopher Thorneb5e29592019-04-11 07:09:29 +0100641 Set the state of the decoder to *state*. *state* must be a decoder state
Benjamin Petersone41251e2008-04-25 01:59:09 +0000642 returned by :meth:`getstate`.
643
Georg Brandl116aa622007-08-15 14:28:22 +0000644
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000645Stream Encoding and Decoding
646^^^^^^^^^^^^^^^^^^^^^^^^^^^^
647
648
Georg Brandl116aa622007-08-15 14:28:22 +0000649The :class:`StreamWriter` and :class:`StreamReader` classes provide generic
650working interfaces which can be used to implement new encoding submodules very
651easily. See :mod:`encodings.utf_8` for an example of how this is done.
652
653
654.. _stream-writer-objects:
655
656StreamWriter Objects
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000657~~~~~~~~~~~~~~~~~~~~
Georg Brandl116aa622007-08-15 14:28:22 +0000658
659The :class:`StreamWriter` class is a subclass of :class:`Codec` and defines the
660following methods which every stream writer must define in order to be
661compatible with the Python codec registry.
662
663
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000664.. class:: StreamWriter(stream, errors='strict')
Georg Brandl116aa622007-08-15 14:28:22 +0000665
666 Constructor for a :class:`StreamWriter` instance.
667
668 All stream writers must provide this constructor interface. They are free to add
669 additional keyword arguments, but only the ones defined here are used by the
670 Python codec registry.
671
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000672 The *stream* argument must be a file-like object open for writing
673 text or binary data, as appropriate for the specific codec.
Georg Brandl116aa622007-08-15 14:28:22 +0000674
675 The :class:`StreamWriter` may implement different error handling schemes by
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000676 providing the *errors* keyword argument. See :ref:`error-handlers` for
677 the standard error handlers the underlying stream codec may support.
Serhiy Storchaka166ebc42014-11-25 13:57:17 +0200678
Georg Brandl116aa622007-08-15 14:28:22 +0000679 The *errors* argument will be assigned to an attribute of the same name.
680 Assigning to this attribute makes it possible to switch between different error
681 handling strategies during the lifetime of the :class:`StreamWriter` object.
682
Benjamin Petersone41251e2008-04-25 01:59:09 +0000683 .. method:: write(object)
Georg Brandl116aa622007-08-15 14:28:22 +0000684
Benjamin Petersone41251e2008-04-25 01:59:09 +0000685 Writes the object's contents encoded to the stream.
Georg Brandl116aa622007-08-15 14:28:22 +0000686
687
Benjamin Petersone41251e2008-04-25 01:59:09 +0000688 .. method:: writelines(list)
Georg Brandl116aa622007-08-15 14:28:22 +0000689
Benjamin Petersone41251e2008-04-25 01:59:09 +0000690 Writes the concatenated list of strings to the stream (possibly by reusing
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000691 the :meth:`write` method). The standard bytes-to-bytes codecs
692 do not support this method.
Georg Brandl116aa622007-08-15 14:28:22 +0000693
694
Benjamin Petersone41251e2008-04-25 01:59:09 +0000695 .. method:: reset()
Georg Brandl116aa622007-08-15 14:28:22 +0000696
Benjamin Petersone41251e2008-04-25 01:59:09 +0000697 Flushes and resets the codec buffers used for keeping state.
Georg Brandl116aa622007-08-15 14:28:22 +0000698
Benjamin Petersone41251e2008-04-25 01:59:09 +0000699 Calling this method should ensure that the data on the output is put into
700 a clean state that allows appending of new fresh data without having to
701 rescan the whole stream to recover state.
702
Georg Brandl116aa622007-08-15 14:28:22 +0000703
704In addition to the above methods, the :class:`StreamWriter` must also inherit
705all other methods and attributes from the underlying stream.
706
707
708.. _stream-reader-objects:
709
710StreamReader Objects
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000711~~~~~~~~~~~~~~~~~~~~
Georg Brandl116aa622007-08-15 14:28:22 +0000712
713The :class:`StreamReader` class is a subclass of :class:`Codec` and defines the
714following methods which every stream reader must define in order to be
715compatible with the Python codec registry.
716
717
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000718.. class:: StreamReader(stream, errors='strict')
Georg Brandl116aa622007-08-15 14:28:22 +0000719
720 Constructor for a :class:`StreamReader` instance.
721
722 All stream readers must provide this constructor interface. They are free to add
723 additional keyword arguments, but only the ones defined here are used by the
724 Python codec registry.
725
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000726 The *stream* argument must be a file-like object open for reading
727 text or binary data, as appropriate for the specific codec.
Georg Brandl116aa622007-08-15 14:28:22 +0000728
729 The :class:`StreamReader` may implement different error handling schemes by
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000730 providing the *errors* keyword argument. See :ref:`error-handlers` for
731 the standard error handlers the underlying stream codec may support.
Georg Brandl116aa622007-08-15 14:28:22 +0000732
733 The *errors* argument will be assigned to an attribute of the same name.
734 Assigning to this attribute makes it possible to switch between different error
735 handling strategies during the lifetime of the :class:`StreamReader` object.
736
737 The set of allowed values for the *errors* argument can be extended with
738 :func:`register_error`.
739
740
Benjamin Petersone41251e2008-04-25 01:59:09 +0000741 .. method:: read([size[, chars, [firstline]]])
Georg Brandl116aa622007-08-15 14:28:22 +0000742
Benjamin Petersone41251e2008-04-25 01:59:09 +0000743 Decodes data from the stream and returns the resulting object.
Georg Brandl116aa622007-08-15 14:28:22 +0000744
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000745 The *chars* argument indicates the number of decoded
746 code points or bytes to return. The :func:`read` method will
747 never return more data than requested, but it might return less,
748 if there is not enough available.
Georg Brandl116aa622007-08-15 14:28:22 +0000749
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000750 The *size* argument indicates the approximate maximum
751 number of encoded bytes or code points to read
752 for decoding. The decoder can modify this setting as
Benjamin Petersone41251e2008-04-25 01:59:09 +0000753 appropriate. The default value -1 indicates to read and decode as much as
Miss Islington (bot)c4976a62019-10-01 14:02:29 -0700754 possible. This parameter is intended to
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000755 prevent having to decode huge files in one step.
Georg Brandl116aa622007-08-15 14:28:22 +0000756
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000757 The *firstline* flag indicates that
758 it would be sufficient to only return the first
Benjamin Petersone41251e2008-04-25 01:59:09 +0000759 line, if there are decoding errors on later lines.
Georg Brandl116aa622007-08-15 14:28:22 +0000760
Benjamin Petersone41251e2008-04-25 01:59:09 +0000761 The method should use a greedy read strategy meaning that it should read
762 as much data as is allowed within the definition of the encoding and the
763 given size, e.g. if optional encoding endings or state markers are
764 available on the stream, these should be read too.
Georg Brandl116aa622007-08-15 14:28:22 +0000765
Georg Brandl116aa622007-08-15 14:28:22 +0000766
Benjamin Petersone41251e2008-04-25 01:59:09 +0000767 .. method:: readline([size[, keepends]])
Georg Brandl116aa622007-08-15 14:28:22 +0000768
Benjamin Petersone41251e2008-04-25 01:59:09 +0000769 Read one line from the input stream and return the decoded data.
Georg Brandl116aa622007-08-15 14:28:22 +0000770
Benjamin Petersone41251e2008-04-25 01:59:09 +0000771 *size*, if given, is passed as size argument to the stream's
Serhiy Storchakacca40ff2013-07-11 18:26:13 +0300772 :meth:`read` method.
Georg Brandl116aa622007-08-15 14:28:22 +0000773
Benjamin Petersone41251e2008-04-25 01:59:09 +0000774 If *keepends* is false line-endings will be stripped from the lines
775 returned.
Georg Brandl116aa622007-08-15 14:28:22 +0000776
Georg Brandl116aa622007-08-15 14:28:22 +0000777
Benjamin Petersone41251e2008-04-25 01:59:09 +0000778 .. method:: readlines([sizehint[, keepends]])
Georg Brandl116aa622007-08-15 14:28:22 +0000779
Benjamin Petersone41251e2008-04-25 01:59:09 +0000780 Read all lines available on the input stream and return them as a list of
781 lines.
Georg Brandl116aa622007-08-15 14:28:22 +0000782
Miss Islington (bot)c4976a62019-10-01 14:02:29 -0700783 Line-endings are implemented using the codec's :meth:`decode` method and
784 are included in the list entries if *keepends* is true.
Georg Brandl116aa622007-08-15 14:28:22 +0000785
Benjamin Petersone41251e2008-04-25 01:59:09 +0000786 *sizehint*, if given, is passed as the *size* argument to the stream's
787 :meth:`read` method.
Georg Brandl116aa622007-08-15 14:28:22 +0000788
789
Benjamin Petersone41251e2008-04-25 01:59:09 +0000790 .. method:: reset()
Georg Brandl116aa622007-08-15 14:28:22 +0000791
Benjamin Petersone41251e2008-04-25 01:59:09 +0000792 Resets the codec buffers used for keeping state.
Georg Brandl116aa622007-08-15 14:28:22 +0000793
Miss Islington (bot)c4976a62019-10-01 14:02:29 -0700794 Note that no stream repositioning should take place. This method is
Benjamin Petersone41251e2008-04-25 01:59:09 +0000795 primarily intended to be able to recover from decoding errors.
796
Georg Brandl116aa622007-08-15 14:28:22 +0000797
798In addition to the above methods, the :class:`StreamReader` must also inherit
799all other methods and attributes from the underlying stream.
800
Georg Brandl116aa622007-08-15 14:28:22 +0000801.. _stream-reader-writer:
802
803StreamReaderWriter Objects
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000804~~~~~~~~~~~~~~~~~~~~~~~~~~
Georg Brandl116aa622007-08-15 14:28:22 +0000805
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000806The :class:`StreamReaderWriter` is a convenience class that allows wrapping
807streams which work in both read and write modes.
Georg Brandl116aa622007-08-15 14:28:22 +0000808
809The design is such that one can use the factory functions returned by the
810:func:`lookup` function to construct the instance.
811
812
Pablo Galindoe184cfd2017-11-10 23:05:12 +0000813.. class:: StreamReaderWriter(stream, Reader, Writer, errors='strict')
Georg Brandl116aa622007-08-15 14:28:22 +0000814
815 Creates a :class:`StreamReaderWriter` instance. *stream* must be a file-like
816 object. *Reader* and *Writer* must be factory functions or classes providing the
817 :class:`StreamReader` and :class:`StreamWriter` interface resp. Error handling
818 is done in the same way as defined for the stream readers and writers.
819
820:class:`StreamReaderWriter` instances define the combined interfaces of
821:class:`StreamReader` and :class:`StreamWriter` classes. They inherit all other
822methods and attributes from the underlying stream.
823
824
825.. _stream-recoder-objects:
826
827StreamRecoder Objects
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000828~~~~~~~~~~~~~~~~~~~~~
Georg Brandl116aa622007-08-15 14:28:22 +0000829
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000830The :class:`StreamRecoder` translates data from one encoding to another,
Georg Brandl116aa622007-08-15 14:28:22 +0000831which is sometimes useful when dealing with different encoding environments.
832
833The design is such that one can use the factory functions returned by the
834:func:`lookup` function to construct the instance.
835
836
Pablo Galindoe184cfd2017-11-10 23:05:12 +0000837.. class:: StreamRecoder(stream, encode, decode, Reader, Writer, errors='strict')
Georg Brandl116aa622007-08-15 14:28:22 +0000838
839 Creates a :class:`StreamRecoder` instance which implements a two-way conversion:
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000840 *encode* and *decode* work on the frontend — the data visible to
841 code calling :meth:`read` and :meth:`write`, while *Reader* and *Writer*
842 work on the backend — the data in *stream*.
Georg Brandl116aa622007-08-15 14:28:22 +0000843
Miss Islington (bot)c4976a62019-10-01 14:02:29 -0700844 You can use these objects to do transparent transcodings, e.g., from Latin-1
Georg Brandl116aa622007-08-15 14:28:22 +0000845 to UTF-8 and back.
846
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000847 The *stream* argument must be a file-like object.
Georg Brandl116aa622007-08-15 14:28:22 +0000848
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000849 The *encode* and *decode* arguments must
850 adhere to the :class:`Codec` interface. *Reader* and
Georg Brandl116aa622007-08-15 14:28:22 +0000851 *Writer* must be factory functions or classes providing objects of the
852 :class:`StreamReader` and :class:`StreamWriter` interface respectively.
853
Georg Brandl116aa622007-08-15 14:28:22 +0000854 Error handling is done in the same way as defined for the stream readers and
855 writers.
856
Benjamin Petersone41251e2008-04-25 01:59:09 +0000857
Georg Brandl116aa622007-08-15 14:28:22 +0000858:class:`StreamRecoder` instances define the combined interfaces of
859:class:`StreamReader` and :class:`StreamWriter` classes. They inherit all other
860methods and attributes from the underlying stream.
861
862
863.. _encodings-overview:
864
865Encodings and Unicode
866---------------------
867
Georg Brandl3be472b2015-01-14 08:26:30 +0100868Strings are stored internally as sequences of code points in
Miss Islington (bot)c4976a62019-10-01 14:02:29 -0700869range ``0x0``--``0x10FFFF``. (See :pep:`393` for
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000870more details about the implementation.)
871Once a string object is used outside of CPU and memory, endianness
Miss Islington (bot)c4976a62019-10-01 14:02:29 -0700872and how these arrays are stored as bytes become an issue. As with other
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000873codecs, serialising a string into a sequence of bytes is known as *encoding*,
874and recreating the string from the sequence of bytes is known as *decoding*.
875There are a variety of different text serialisation codecs, which are
876collectivity referred to as :term:`text encodings <text encoding>`.
877
878The simplest text encoding (called ``'latin-1'`` or ``'iso-8859-1'``) maps
Serhiy Storchakac7b1a0b2016-11-26 13:43:28 +0200879the code points 0--255 to the bytes ``0x0``--``0xff``, which means that a string
Georg Brandl3be472b2015-01-14 08:26:30 +0100880object that contains code points above ``U+00FF`` can't be encoded with this
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000881codec. Doing so will raise a :exc:`UnicodeEncodeError` that looks
882like the following (although the details of the error message may differ):
883``UnicodeEncodeError: 'latin-1' codec can't encode character '\u1234' in
884position 3: ordinal not in range(256)``.
Georg Brandl116aa622007-08-15 14:28:22 +0000885
886There's another group of encodings (the so called charmap encodings) that choose
Georg Brandl3be472b2015-01-14 08:26:30 +0100887a different subset of all Unicode code points and how these code points are
Serhiy Storchakac7b1a0b2016-11-26 13:43:28 +0200888mapped to the bytes ``0x0``--``0xff``. To see how this is done simply open
Georg Brandl116aa622007-08-15 14:28:22 +0000889e.g. :file:`encodings/cp1252.py` (which is an encoding that is used primarily on
890Windows). There's a string constant with 256 characters that shows you which
891character is mapped to which byte value.
892
Georg Brandl3be472b2015-01-14 08:26:30 +0100893All of these encodings can only encode 256 of the 1114112 code points
Georg Brandl30c78d62008-05-11 14:52:00 +0000894defined in Unicode. A simple and straightforward way that can store each Unicode
Georg Brandl3be472b2015-01-14 08:26:30 +0100895code point, is to store each code point as four consecutive bytes. There are two
Ezio Melottifbb39812011-10-25 10:40:38 +0300896possibilities: store the bytes in big endian or in little endian order. These
897two encodings are called ``UTF-32-BE`` and ``UTF-32-LE`` respectively. Their
898disadvantage is that if e.g. you use ``UTF-32-BE`` on a little endian machine you
899will always have to swap bytes on encoding and decoding. ``UTF-32`` avoids this
900problem: bytes will always be in natural endianness. When these bytes are read
Georg Brandl116aa622007-08-15 14:28:22 +0000901by a CPU with a different endianness, then bytes have to be swapped though. To
Ezio Melottifbb39812011-10-25 10:40:38 +0300902be able to detect the endianness of a ``UTF-16`` or ``UTF-32`` byte sequence,
903there's the so called BOM ("Byte Order Mark"). This is the Unicode character
904``U+FEFF``. This character can be prepended to every ``UTF-16`` or ``UTF-32``
905byte sequence. The byte swapped version of this character (``0xFFFE``) is an
906illegal character that may not appear in a Unicode text. So when the
907first character in an ``UTF-16`` or ``UTF-32`` byte sequence
Georg Brandl116aa622007-08-15 14:28:22 +0000908appears to be a ``U+FFFE`` the bytes have to be swapped on decoding.
Ezio Melottifbb39812011-10-25 10:40:38 +0300909Unfortunately the character ``U+FEFF`` had a second purpose as
910a ``ZERO WIDTH NO-BREAK SPACE``: a character that has no width and doesn't allow
Georg Brandl116aa622007-08-15 14:28:22 +0000911a word to be split. It can e.g. be used to give hints to a ligature algorithm.
912With Unicode 4.0 using ``U+FEFF`` as a ``ZERO WIDTH NO-BREAK SPACE`` has been
913deprecated (with ``U+2060`` (``WORD JOINER``) assuming this role). Nevertheless
Ezio Melottifbb39812011-10-25 10:40:38 +0300914Unicode software still must be able to handle ``U+FEFF`` in both roles: as a BOM
Georg Brandl116aa622007-08-15 14:28:22 +0000915it's a device to determine the storage layout of the encoded bytes, and vanishes
Georg Brandl30c78d62008-05-11 14:52:00 +0000916once the byte sequence has been decoded into a string; as a ``ZERO WIDTH
Georg Brandl116aa622007-08-15 14:28:22 +0000917NO-BREAK SPACE`` it's a normal character that will be decoded like any other.
918
919There's another encoding that is able to encoding the full range of Unicode
920characters: UTF-8. UTF-8 is an 8-bit encoding, which means there are no issues
921with byte order in UTF-8. Each byte in a UTF-8 byte sequence consists of two
Ezio Melottifbb39812011-10-25 10:40:38 +0300922parts: marker bits (the most significant bits) and payload bits. The marker bits
Ezio Melotti222b2082011-09-01 08:11:28 +0300923are a sequence of zero to four ``1`` bits followed by a ``0`` bit. Unicode characters are
Georg Brandl116aa622007-08-15 14:28:22 +0000924encoded like this (with x being payload bits, which when concatenated give the
925Unicode character):
926
927+-----------------------------------+----------------------------------------------+
928| Range | Encoding |
929+===================================+==============================================+
930| ``U-00000000`` ... ``U-0000007F`` | 0xxxxxxx |
931+-----------------------------------+----------------------------------------------+
932| ``U-00000080`` ... ``U-000007FF`` | 110xxxxx 10xxxxxx |
933+-----------------------------------+----------------------------------------------+
934| ``U-00000800`` ... ``U-0000FFFF`` | 1110xxxx 10xxxxxx 10xxxxxx |
935+-----------------------------------+----------------------------------------------+
Ezio Melotti222b2082011-09-01 08:11:28 +0300936| ``U-00010000`` ... ``U-0010FFFF`` | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx |
Georg Brandl116aa622007-08-15 14:28:22 +0000937+-----------------------------------+----------------------------------------------+
938
939The least significant bit of the Unicode character is the rightmost x bit.
940
941As UTF-8 is an 8-bit encoding no BOM is required and any ``U+FEFF`` character in
Georg Brandl30c78d62008-05-11 14:52:00 +0000942the decoded string (even if it's the first character) is treated as a ``ZERO
943WIDTH NO-BREAK SPACE``.
Georg Brandl116aa622007-08-15 14:28:22 +0000944
945Without external information it's impossible to reliably determine which
Georg Brandl30c78d62008-05-11 14:52:00 +0000946encoding was used for encoding a string. Each charmap encoding can
Georg Brandl116aa622007-08-15 14:28:22 +0000947decode any random byte sequence. However that's not possible with UTF-8, as
948UTF-8 byte sequences have a structure that doesn't allow arbitrary byte
Thomas Wouters89d996e2007-09-08 17:39:28 +0000949sequences. To increase the reliability with which a UTF-8 encoding can be
Georg Brandl116aa622007-08-15 14:28:22 +0000950detected, Microsoft invented a variant of UTF-8 (that Python 2.5 calls
951``"utf-8-sig"``) for its Notepad program: Before any of the Unicode characters
952is written to the file, a UTF-8 encoded BOM (which looks like this as a byte
953sequence: ``0xef``, ``0xbb``, ``0xbf``) is written. As it's rather improbable
954that any charmap encoded file starts with these byte values (which would e.g.
955map to
956
957 | LATIN SMALL LETTER I WITH DIAERESIS
958 | RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK
959 | INVERTED QUESTION MARK
960
Ezio Melottifbb39812011-10-25 10:40:38 +0300961in iso-8859-1), this increases the probability that a ``utf-8-sig`` encoding can be
Georg Brandl116aa622007-08-15 14:28:22 +0000962correctly guessed from the byte sequence. So here the BOM is not used to be able
963to determine the byte order used for generating the byte sequence, but as a
964signature that helps in guessing the encoding. On encoding the utf-8-sig codec
965will write ``0xef``, ``0xbb``, ``0xbf`` as the first three bytes to the file. On
Ezio Melottifbb39812011-10-25 10:40:38 +0300966decoding ``utf-8-sig`` will skip those three bytes if they appear as the first
Miss Islington (bot)c4976a62019-10-01 14:02:29 -0700967three bytes in the file. In UTF-8, the use of the BOM is discouraged and
Ezio Melottifbb39812011-10-25 10:40:38 +0300968should generally be avoided.
Georg Brandl116aa622007-08-15 14:28:22 +0000969
970
971.. _standard-encodings:
972
973Standard Encodings
974------------------
975
976Python comes with a number of codecs built-in, either implemented as C functions
977or with dictionaries as mapping tables. The following table lists the codecs by
978name, together with a few common aliases, and the languages for which the
979encoding is likely used. Neither the list of aliases nor the list of languages
980is meant to be exhaustive. Notice that spelling alternatives that only differ in
Georg Brandla6053b42009-09-01 08:11:14 +0000981case or use a hyphen instead of an underscore are also valid aliases; therefore,
982e.g. ``'utf-8'`` is a valid alias for the ``'utf_8'`` codec.
Georg Brandl116aa622007-08-15 14:28:22 +0000983
Alexander Belopolsky1d521462011-02-25 19:19:57 +0000984.. impl-detail::
985
986 Some common encodings can bypass the codecs lookup machinery to
Miss Islington (bot)c4976a62019-10-01 14:02:29 -0700987 improve performance. These optimization opportunities are only
Ville Skyttä297fd872017-12-15 12:19:23 +0200988 recognized by CPython for a limited set of (case insensitive)
989 aliases: utf-8, utf8, latin-1, latin1, iso-8859-1, iso8859-1, mbcs
990 (Windows only), ascii, us-ascii, utf-16, utf16, utf-32, utf32, and
991 the same using underscores instead of dashes. Using alternative
992 aliases for these encodings may result in slower execution.
993
994 .. versionchanged:: 3.6
995 Optimization opportunity recognized for us-ascii.
Alexander Belopolsky1d521462011-02-25 19:19:57 +0000996
Georg Brandl116aa622007-08-15 14:28:22 +0000997Many of the character sets support the same languages. They vary in individual
998characters (e.g. whether the EURO SIGN is supported or not), and in the
999assignment of characters to code positions. For the European languages in
1000particular, the following variants typically exist:
1001
1002* an ISO 8859 codeset
1003
Martin Panter4c359642016-05-08 13:53:41 +00001004* a Microsoft Windows code page, which is typically derived from an 8859 codeset,
Georg Brandl116aa622007-08-15 14:28:22 +00001005 but replaces control characters with additional graphic characters
1006
1007* an IBM EBCDIC code page
1008
1009* an IBM PC code page, which is ASCII compatible
1010
Georg Brandl44ea77b2013-03-28 13:28:44 +01001011.. tabularcolumns:: |l|p{0.3\linewidth}|p{0.3\linewidth}|
1012
Georg Brandl116aa622007-08-15 14:28:22 +00001013+-----------------+--------------------------------+--------------------------------+
1014| Codec | Aliases | Languages |
1015+=================+================================+================================+
1016| ascii | 646, us-ascii | English |
1017+-----------------+--------------------------------+--------------------------------+
1018| big5 | big5-tw, csbig5 | Traditional Chinese |
1019+-----------------+--------------------------------+--------------------------------+
1020| big5hkscs | big5-hkscs, hkscs | Traditional Chinese |
1021+-----------------+--------------------------------+--------------------------------+
1022| cp037 | IBM037, IBM039 | English |
1023+-----------------+--------------------------------+--------------------------------+
R David Murray47d083c2014-03-07 21:00:34 -05001024| cp273 | 273, IBM273, csIBM273 | German |
1025| | | |
1026| | | .. versionadded:: 3.4 |
1027+-----------------+--------------------------------+--------------------------------+
Georg Brandl116aa622007-08-15 14:28:22 +00001028| cp424 | EBCDIC-CP-HE, IBM424 | Hebrew |
1029+-----------------+--------------------------------+--------------------------------+
1030| cp437 | 437, IBM437 | English |
1031+-----------------+--------------------------------+--------------------------------+
1032| cp500 | EBCDIC-CP-BE, EBCDIC-CP-CH, | Western Europe |
1033| | IBM500 | |
1034+-----------------+--------------------------------+--------------------------------+
Amaury Forgeot d'Arcae6388d2009-07-15 19:21:18 +00001035| cp720 | | Arabic |
1036+-----------------+--------------------------------+--------------------------------+
Georg Brandl116aa622007-08-15 14:28:22 +00001037| cp737 | | Greek |
1038+-----------------+--------------------------------+--------------------------------+
1039| cp775 | IBM775 | Baltic languages |
1040+-----------------+--------------------------------+--------------------------------+
1041| cp850 | 850, IBM850 | Western Europe |
1042+-----------------+--------------------------------+--------------------------------+
1043| cp852 | 852, IBM852 | Central and Eastern Europe |
1044+-----------------+--------------------------------+--------------------------------+
1045| cp855 | 855, IBM855 | Bulgarian, Byelorussian, |
1046| | | Macedonian, Russian, Serbian |
1047+-----------------+--------------------------------+--------------------------------+
1048| cp856 | | Hebrew |
1049+-----------------+--------------------------------+--------------------------------+
1050| cp857 | 857, IBM857 | Turkish |
1051+-----------------+--------------------------------+--------------------------------+
Benjamin Peterson5a6214a2010-06-27 22:41:29 +00001052| cp858 | 858, IBM858 | Western Europe |
1053+-----------------+--------------------------------+--------------------------------+
Georg Brandl116aa622007-08-15 14:28:22 +00001054| cp860 | 860, IBM860 | Portuguese |
1055+-----------------+--------------------------------+--------------------------------+
1056| cp861 | 861, CP-IS, IBM861 | Icelandic |
1057+-----------------+--------------------------------+--------------------------------+
1058| cp862 | 862, IBM862 | Hebrew |
1059+-----------------+--------------------------------+--------------------------------+
1060| cp863 | 863, IBM863 | Canadian |
1061+-----------------+--------------------------------+--------------------------------+
1062| cp864 | IBM864 | Arabic |
1063+-----------------+--------------------------------+--------------------------------+
1064| cp865 | 865, IBM865 | Danish, Norwegian |
1065+-----------------+--------------------------------+--------------------------------+
1066| cp866 | 866, IBM866 | Russian |
1067+-----------------+--------------------------------+--------------------------------+
1068| cp869 | 869, CP-GR, IBM869 | Greek |
1069+-----------------+--------------------------------+--------------------------------+
1070| cp874 | | Thai |
1071+-----------------+--------------------------------+--------------------------------+
1072| cp875 | | Greek |
1073+-----------------+--------------------------------+--------------------------------+
1074| cp932 | 932, ms932, mskanji, ms-kanji | Japanese |
1075+-----------------+--------------------------------+--------------------------------+
1076| cp949 | 949, ms949, uhc | Korean |
1077+-----------------+--------------------------------+--------------------------------+
1078| cp950 | 950, ms950 | Traditional Chinese |
1079+-----------------+--------------------------------+--------------------------------+
1080| cp1006 | | Urdu |
1081+-----------------+--------------------------------+--------------------------------+
1082| cp1026 | ibm1026 | Turkish |
1083+-----------------+--------------------------------+--------------------------------+
Serhiy Storchakabe0c3252013-11-23 18:52:23 +02001084| cp1125 | 1125, ibm1125, cp866u, ruscii | Ukrainian |
1085| | | |
1086| | | .. versionadded:: 3.4 |
1087+-----------------+--------------------------------+--------------------------------+
Georg Brandl116aa622007-08-15 14:28:22 +00001088| cp1140 | ibm1140 | Western Europe |
1089+-----------------+--------------------------------+--------------------------------+
1090| cp1250 | windows-1250 | Central and Eastern Europe |
1091+-----------------+--------------------------------+--------------------------------+
1092| cp1251 | windows-1251 | Bulgarian, Byelorussian, |
1093| | | Macedonian, Russian, Serbian |
1094+-----------------+--------------------------------+--------------------------------+
1095| cp1252 | windows-1252 | Western Europe |
1096+-----------------+--------------------------------+--------------------------------+
1097| cp1253 | windows-1253 | Greek |
1098+-----------------+--------------------------------+--------------------------------+
1099| cp1254 | windows-1254 | Turkish |
1100+-----------------+--------------------------------+--------------------------------+
1101| cp1255 | windows-1255 | Hebrew |
1102+-----------------+--------------------------------+--------------------------------+
Benjamin Peterson4ac9ce42009-10-04 14:49:41 +00001103| cp1256 | windows-1256 | Arabic |
Georg Brandl116aa622007-08-15 14:28:22 +00001104+-----------------+--------------------------------+--------------------------------+
1105| cp1257 | windows-1257 | Baltic languages |
1106+-----------------+--------------------------------+--------------------------------+
1107| cp1258 | windows-1258 | Vietnamese |
1108+-----------------+--------------------------------+--------------------------------+
1109| euc_jp | eucjp, ujis, u-jis | Japanese |
1110+-----------------+--------------------------------+--------------------------------+
1111| euc_jis_2004 | jisx0213, eucjis2004 | Japanese |
1112+-----------------+--------------------------------+--------------------------------+
1113| euc_jisx0213 | eucjisx0213 | Japanese |
1114+-----------------+--------------------------------+--------------------------------+
1115| euc_kr | euckr, korean, ksc5601, | Korean |
1116| | ks_c-5601, ks_c-5601-1987, | |
1117| | ksx1001, ks_x-1001 | |
1118+-----------------+--------------------------------+--------------------------------+
Serhiy Storchaka3f819ca2018-10-31 02:26:06 +02001119| gb2312 | chinese, csiso58gb231280, | Simplified Chinese |
1120| | euc-cn, euccn, eucgb2312-cn, | |
1121| | gb2312-1980, gb2312-80, | |
1122| | iso-ir-58 | |
Georg Brandl116aa622007-08-15 14:28:22 +00001123+-----------------+--------------------------------+--------------------------------+
1124| gbk | 936, cp936, ms936 | Unified Chinese |
1125+-----------------+--------------------------------+--------------------------------+
1126| gb18030 | gb18030-2000 | Unified Chinese |
1127+-----------------+--------------------------------+--------------------------------+
1128| hz | hzgb, hz-gb, hz-gb-2312 | Simplified Chinese |
1129+-----------------+--------------------------------+--------------------------------+
1130| iso2022_jp | csiso2022jp, iso2022jp, | Japanese |
1131| | iso-2022-jp | |
1132+-----------------+--------------------------------+--------------------------------+
1133| iso2022_jp_1 | iso2022jp-1, iso-2022-jp-1 | Japanese |
1134+-----------------+--------------------------------+--------------------------------+
1135| iso2022_jp_2 | iso2022jp-2, iso-2022-jp-2 | Japanese, Korean, Simplified |
1136| | | Chinese, Western Europe, Greek |
1137+-----------------+--------------------------------+--------------------------------+
1138| iso2022_jp_2004 | iso2022jp-2004, | Japanese |
1139| | iso-2022-jp-2004 | |
1140+-----------------+--------------------------------+--------------------------------+
1141| iso2022_jp_3 | iso2022jp-3, iso-2022-jp-3 | Japanese |
1142+-----------------+--------------------------------+--------------------------------+
1143| iso2022_jp_ext | iso2022jp-ext, iso-2022-jp-ext | Japanese |
1144+-----------------+--------------------------------+--------------------------------+
1145| iso2022_kr | csiso2022kr, iso2022kr, | Korean |
1146| | iso-2022-kr | |
1147+-----------------+--------------------------------+--------------------------------+
Miss Islington (bot)c4976a62019-10-01 14:02:29 -07001148| latin_1 | iso-8859-1, iso8859-1, 8859, | Western Europe |
Georg Brandl116aa622007-08-15 14:28:22 +00001149| | cp819, latin, latin1, L1 | |
1150+-----------------+--------------------------------+--------------------------------+
1151| iso8859_2 | iso-8859-2, latin2, L2 | Central and Eastern Europe |
1152+-----------------+--------------------------------+--------------------------------+
1153| iso8859_3 | iso-8859-3, latin3, L3 | Esperanto, Maltese |
1154+-----------------+--------------------------------+--------------------------------+
Christian Heimesc3f30c42008-02-22 16:37:40 +00001155| iso8859_4 | iso-8859-4, latin4, L4 | Baltic languages |
Georg Brandl116aa622007-08-15 14:28:22 +00001156+-----------------+--------------------------------+--------------------------------+
1157| iso8859_5 | iso-8859-5, cyrillic | Bulgarian, Byelorussian, |
1158| | | Macedonian, Russian, Serbian |
1159+-----------------+--------------------------------+--------------------------------+
1160| iso8859_6 | iso-8859-6, arabic | Arabic |
1161+-----------------+--------------------------------+--------------------------------+
1162| iso8859_7 | iso-8859-7, greek, greek8 | Greek |
1163+-----------------+--------------------------------+--------------------------------+
1164| iso8859_8 | iso-8859-8, hebrew | Hebrew |
1165+-----------------+--------------------------------+--------------------------------+
1166| iso8859_9 | iso-8859-9, latin5, L5 | Turkish |
1167+-----------------+--------------------------------+--------------------------------+
1168| iso8859_10 | iso-8859-10, latin6, L6 | Nordic languages |
1169+-----------------+--------------------------------+--------------------------------+
Victor Stinnerbfd97672015-09-24 09:04:05 +02001170| iso8859_11 | iso-8859-11, thai | Thai languages |
1171+-----------------+--------------------------------+--------------------------------+
Georg Brandl93dc9eb2010-03-14 10:56:14 +00001172| iso8859_13 | iso-8859-13, latin7, L7 | Baltic languages |
Georg Brandl116aa622007-08-15 14:28:22 +00001173+-----------------+--------------------------------+--------------------------------+
1174| iso8859_14 | iso-8859-14, latin8, L8 | Celtic languages |
1175+-----------------+--------------------------------+--------------------------------+
Georg Brandl93dc9eb2010-03-14 10:56:14 +00001176| iso8859_15 | iso-8859-15, latin9, L9 | Western Europe |
1177+-----------------+--------------------------------+--------------------------------+
1178| iso8859_16 | iso-8859-16, latin10, L10 | South-Eastern Europe |
Georg Brandl116aa622007-08-15 14:28:22 +00001179+-----------------+--------------------------------+--------------------------------+
1180| johab | cp1361, ms1361 | Korean |
1181+-----------------+--------------------------------+--------------------------------+
1182| koi8_r | | Russian |
1183+-----------------+--------------------------------+--------------------------------+
Serhiy Storchakaf0eeedf2015-05-12 23:24:19 +03001184| koi8_t | | Tajik |
1185| | | |
1186| | | .. versionadded:: 3.5 |
1187+-----------------+--------------------------------+--------------------------------+
Georg Brandl116aa622007-08-15 14:28:22 +00001188| koi8_u | | Ukrainian |
1189+-----------------+--------------------------------+--------------------------------+
Serhiy Storchakaad8a1c32015-05-12 23:16:55 +03001190| kz1048 | kz_1048, strk1048_2002, rk1048 | Kazakh |
1191| | | |
1192| | | .. versionadded:: 3.5 |
1193+-----------------+--------------------------------+--------------------------------+
Georg Brandl116aa622007-08-15 14:28:22 +00001194| mac_cyrillic | maccyrillic | Bulgarian, Byelorussian, |
1195| | | Macedonian, Russian, Serbian |
1196+-----------------+--------------------------------+--------------------------------+
1197| mac_greek | macgreek | Greek |
1198+-----------------+--------------------------------+--------------------------------+
1199| mac_iceland | maciceland | Icelandic |
1200+-----------------+--------------------------------+--------------------------------+
1201| mac_latin2 | maclatin2, maccentraleurope | Central and Eastern Europe |
1202+-----------------+--------------------------------+--------------------------------+
Benjamin Peterson23110e72010-08-21 02:54:44 +00001203| mac_roman | macroman, macintosh | Western Europe |
Georg Brandl116aa622007-08-15 14:28:22 +00001204+-----------------+--------------------------------+--------------------------------+
1205| mac_turkish | macturkish | Turkish |
1206+-----------------+--------------------------------+--------------------------------+
1207| ptcp154 | csptcp154, pt154, cp154, | Kazakh |
1208| | cyrillic-asian | |
1209+-----------------+--------------------------------+--------------------------------+
1210| shift_jis | csshiftjis, shiftjis, sjis, | Japanese |
1211| | s_jis | |
1212+-----------------+--------------------------------+--------------------------------+
1213| shift_jis_2004 | shiftjis2004, sjis_2004, | Japanese |
1214| | sjis2004 | |
1215+-----------------+--------------------------------+--------------------------------+
1216| shift_jisx0213 | shiftjisx0213, sjisx0213, | Japanese |
1217| | s_jisx0213 | |
1218+-----------------+--------------------------------+--------------------------------+
Walter Dörwald41980ca2007-08-16 21:55:45 +00001219| utf_32 | U32, utf32 | all languages |
1220+-----------------+--------------------------------+--------------------------------+
1221| utf_32_be | UTF-32BE | all languages |
1222+-----------------+--------------------------------+--------------------------------+
1223| utf_32_le | UTF-32LE | all languages |
1224+-----------------+--------------------------------+--------------------------------+
Georg Brandl116aa622007-08-15 14:28:22 +00001225| utf_16 | U16, utf16 | all languages |
1226+-----------------+--------------------------------+--------------------------------+
Victor Stinner53a9dd72010-12-08 22:25:45 +00001227| utf_16_be | UTF-16BE | all languages |
Georg Brandl116aa622007-08-15 14:28:22 +00001228+-----------------+--------------------------------+--------------------------------+
Victor Stinner53a9dd72010-12-08 22:25:45 +00001229| utf_16_le | UTF-16LE | all languages |
Georg Brandl116aa622007-08-15 14:28:22 +00001230+-----------------+--------------------------------+--------------------------------+
1231| utf_7 | U7, unicode-1-1-utf-7 | all languages |
1232+-----------------+--------------------------------+--------------------------------+
Victor Stinner3aef48e2019-05-13 10:42:31 +02001233| utf_8 | U8, UTF, utf8, cp65001 | all languages |
Georg Brandl116aa622007-08-15 14:28:22 +00001234+-----------------+--------------------------------+--------------------------------+
1235| utf_8_sig | | all languages |
1236+-----------------+--------------------------------+--------------------------------+
1237
Serhiy Storchaka58cf6072013-11-19 11:32:41 +02001238.. versionchanged:: 3.4
1239 The utf-16\* and utf-32\* encoders no longer allow surrogate code points
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +10001240 (``U+D800``--``U+DFFF``) to be encoded.
1241 The utf-32\* decoders no longer decode
Serhiy Storchaka58cf6072013-11-19 11:32:41 +02001242 byte sequences that correspond to surrogate code points.
1243
Victor Stinner3aef48e2019-05-13 10:42:31 +02001244.. versionchanged:: 3.8
1245 ``cp65001`` is now an alias to ``utf_8``.
1246
Serhiy Storchaka58cf6072013-11-19 11:32:41 +02001247
Nick Coghlan650e3222013-05-23 20:24:02 +10001248Python Specific Encodings
1249-------------------------
1250
1251A number of predefined codecs are specific to Python, so their codec names have
Miss Islington (bot)c4976a62019-10-01 14:02:29 -07001252no meaning outside Python. These are listed in the tables below based on the
Nick Coghlan650e3222013-05-23 20:24:02 +10001253expected input and output types (note that while text encodings are the most
1254common use case for codecs, the underlying codec infrastructure supports
Miss Islington (bot)c4976a62019-10-01 14:02:29 -07001255arbitrary data transforms rather than just text encodings). For asymmetric
1256codecs, the stated meaning describes the encoding direction.
Nick Coghlan650e3222013-05-23 20:24:02 +10001257
Nick Coghlan9c1aed82013-11-23 11:13:36 +10001258Text Encodings
1259^^^^^^^^^^^^^^
1260
Nick Coghlan650e3222013-05-23 20:24:02 +10001261The following codecs provide :class:`str` to :class:`bytes` encoding and
1262:term:`bytes-like object` to :class:`str` decoding, similar to the Unicode text
1263encodings.
Georg Brandl226878c2007-08-31 10:15:37 +00001264
Georg Brandl44ea77b2013-03-28 13:28:44 +01001265.. tabularcolumns:: |l|p{0.3\linewidth}|p{0.3\linewidth}|
1266
Georg Brandl30c78d62008-05-11 14:52:00 +00001267+--------------------+---------+---------------------------+
Miss Islington (bot)c4976a62019-10-01 14:02:29 -07001268| Codec | Aliases | Meaning |
Georg Brandl30c78d62008-05-11 14:52:00 +00001269+====================+=========+===========================+
Miss Islington (bot)c4976a62019-10-01 14:02:29 -07001270| idna | | Implement :rfc:`3490`, |
Georg Brandl30c78d62008-05-11 14:52:00 +00001271| | | see also |
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +10001272| | | :mod:`encodings.idna`. |
1273| | | Only ``errors='strict'`` |
1274| | | is supported. |
Georg Brandl30c78d62008-05-11 14:52:00 +00001275+--------------------+---------+---------------------------+
Miss Islington (bot)c4976a62019-10-01 14:02:29 -07001276| mbcs | ansi, | Windows only: Encode the |
Steve Dower5a713272016-09-06 19:46:42 -07001277| | dbcs | operand according to the |
Miss Islington (bot)c4976a62019-10-01 14:02:29 -07001278| | | ANSI codepage (CP_ACP). |
Georg Brandl30c78d62008-05-11 14:52:00 +00001279+--------------------+---------+---------------------------+
Miss Islington (bot)c4976a62019-10-01 14:02:29 -07001280| oem | | Windows only: Encode the |
Steve Dower5a713272016-09-06 19:46:42 -07001281| | | operand according to the |
Miss Islington (bot)c4976a62019-10-01 14:02:29 -07001282| | | OEM codepage (CP_OEMCP). |
Steve Dower5a713272016-09-06 19:46:42 -07001283| | | |
1284| | | .. versionadded:: 3.6 |
1285+--------------------+---------+---------------------------+
Miss Islington (bot)c4976a62019-10-01 14:02:29 -07001286| palmos | | Encoding of PalmOS 3.5. |
Georg Brandl30c78d62008-05-11 14:52:00 +00001287+--------------------+---------+---------------------------+
Miss Islington (bot)c4976a62019-10-01 14:02:29 -07001288| punycode | | Implement :rfc:`3492`. |
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +10001289| | | Stateful codecs are not |
1290| | | supported. |
Georg Brandl30c78d62008-05-11 14:52:00 +00001291+--------------------+---------+---------------------------+
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +10001292| raw_unicode_escape | | Latin-1 encoding with |
1293| | | ``\uXXXX`` and |
1294| | | ``\UXXXXXXXX`` for other |
1295| | | code points. Existing |
1296| | | backslashes are not |
1297| | | escaped in any way. |
1298| | | It is used in the Python |
1299| | | pickle protocol. |
Georg Brandl30c78d62008-05-11 14:52:00 +00001300+--------------------+---------+---------------------------+
1301| undefined | | Raise an exception for |
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +10001302| | | all conversions, even |
1303| | | empty strings. The error |
1304| | | handler is ignored. |
Georg Brandl30c78d62008-05-11 14:52:00 +00001305+--------------------+---------+---------------------------+
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +10001306| unicode_escape | | Encoding suitable as the |
1307| | | contents of a Unicode |
1308| | | literal in ASCII-encoded |
1309| | | Python source code, |
1310| | | except that quotes are |
Miss Islington (bot)c4976a62019-10-01 14:02:29 -07001311| | | not escaped. Decode |
1312| | | from Latin-1 source code. |
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +10001313| | | Beware that Python source |
1314| | | code actually uses UTF-8 |
1315| | | by default. |
Georg Brandl30c78d62008-05-11 14:52:00 +00001316+--------------------+---------+---------------------------+
Inada Naoki6a16b182019-03-18 15:44:11 +09001317
1318.. versionchanged:: 3.8
1319 "unicode_internal" codec is removed.
1320
Georg Brandl116aa622007-08-15 14:28:22 +00001321
Nick Coghlan9c1aed82013-11-23 11:13:36 +10001322.. _binary-transforms:
1323
1324Binary Transforms
1325^^^^^^^^^^^^^^^^^
1326
1327The following codecs provide binary transforms: :term:`bytes-like object`
Miss Islington (bot)c4976a62019-10-01 14:02:29 -07001328to :class:`bytes` mappings. They are not supported by :meth:`bytes.decode`
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +10001329(which only produces :class:`str` output).
Nick Coghlan650e3222013-05-23 20:24:02 +10001330
Georg Brandl02524622010-12-02 18:06:51 +00001331
Nick Coghlan9c1aed82013-11-23 11:13:36 +10001332.. tabularcolumns:: |l|L|L|L|
Georg Brandl44ea77b2013-03-28 13:28:44 +01001333
Nick Coghlan9c1aed82013-11-23 11:13:36 +10001334+----------------------+------------------+------------------------------+------------------------------+
Miss Islington (bot)c4976a62019-10-01 14:02:29 -07001335| Codec | Aliases | Meaning | Encoder / decoder |
Nick Coghlan9c1aed82013-11-23 11:13:36 +10001336+======================+==================+==============================+==============================+
Miss Islington (bot)c4976a62019-10-01 14:02:29 -07001337| base64_codec [#b64]_ | base64, base_64 | Convert the operand to | :meth:`base64.encodebytes` / |
1338| | | multiline MIME base64 (the | :meth:`base64.decodebytes` |
1339| | | result always includes a | |
1340| | | trailing ``'\n'``). | |
Nick Coghlan9c1aed82013-11-23 11:13:36 +10001341| | | | |
1342| | | .. versionchanged:: 3.4 | |
1343| | | accepts any | |
1344| | | :term:`bytes-like object` | |
1345| | | as input for encoding and | |
1346| | | decoding | |
1347+----------------------+------------------+------------------------------+------------------------------+
Miss Islington (bot)c4976a62019-10-01 14:02:29 -07001348| bz2_codec | bz2 | Compress the operand using | :meth:`bz2.compress` / |
1349| | | bz2. | :meth:`bz2.decompress` |
Nick Coghlan9c1aed82013-11-23 11:13:36 +10001350+----------------------+------------------+------------------------------+------------------------------+
Miss Islington (bot)c4976a62019-10-01 14:02:29 -07001351| hex_codec | hex | Convert the operand to | :meth:`binascii.b2a_hex` / |
Martin Panter06171bd2015-09-12 00:34:28 +00001352| | | hexadecimal | :meth:`binascii.a2b_hex` |
Nick Coghlan9c1aed82013-11-23 11:13:36 +10001353| | | representation, with two | |
Miss Islington (bot)c4976a62019-10-01 14:02:29 -07001354| | | digits per byte. | |
Nick Coghlan9c1aed82013-11-23 11:13:36 +10001355+----------------------+------------------+------------------------------+------------------------------+
Miss Islington (bot)c4976a62019-10-01 14:02:29 -07001356| quopri_codec | quopri, | Convert the operand to MIME | :meth:`quopri.encode` with |
1357| | quotedprintable, | quoted printable. | ``quotetabs=True`` / |
Martin Panter06171bd2015-09-12 00:34:28 +00001358| | quoted_printable | | :meth:`quopri.decode` |
Nick Coghlan9c1aed82013-11-23 11:13:36 +10001359+----------------------+------------------+------------------------------+------------------------------+
1360| uu_codec | uu | Convert the operand using | :meth:`uu.encode` / |
Miss Islington (bot)c4976a62019-10-01 14:02:29 -07001361| | | uuencode. | :meth:`uu.decode` |
Nick Coghlan9c1aed82013-11-23 11:13:36 +10001362+----------------------+------------------+------------------------------+------------------------------+
Miss Islington (bot)c4976a62019-10-01 14:02:29 -07001363| zlib_codec | zip, zlib | Compress the operand using | :meth:`zlib.compress` / |
1364| | | gzip. | :meth:`zlib.decompress` |
Nick Coghlan9c1aed82013-11-23 11:13:36 +10001365+----------------------+------------------+------------------------------+------------------------------+
Georg Brandl02524622010-12-02 18:06:51 +00001366
Nick Coghlanfdf239a2013-10-03 00:43:22 +10001367.. [#b64] In addition to :term:`bytes-like objects <bytes-like object>`,
1368 ``'base64_codec'`` also accepts ASCII-only instances of :class:`str` for
1369 decoding
Nick Coghlan650e3222013-05-23 20:24:02 +10001370
Nick Coghlan9c1aed82013-11-23 11:13:36 +10001371.. versionadded:: 3.2
1372 Restoration of the binary transforms.
Nick Coghlan650e3222013-05-23 20:24:02 +10001373
Nick Coghlan9c1aed82013-11-23 11:13:36 +10001374.. versionchanged:: 3.4
1375 Restoration of the aliases for the binary transforms.
Georg Brandl02524622010-12-02 18:06:51 +00001376
Georg Brandl44ea77b2013-03-28 13:28:44 +01001377
Nick Coghlan9c1aed82013-11-23 11:13:36 +10001378.. _text-transforms:
1379
1380Text Transforms
1381^^^^^^^^^^^^^^^
1382
1383The following codec provides a text transform: a :class:`str` to :class:`str`
Miss Islington (bot)c4976a62019-10-01 14:02:29 -07001384mapping. It is not supported by :meth:`str.encode` (which only produces
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +10001385:class:`bytes` output).
Nick Coghlan9c1aed82013-11-23 11:13:36 +10001386
1387.. tabularcolumns:: |l|l|L|
1388
1389+--------------------+---------+---------------------------+
Miss Islington (bot)c4976a62019-10-01 14:02:29 -07001390| Codec | Aliases | Meaning |
Nick Coghlan9c1aed82013-11-23 11:13:36 +10001391+====================+=========+===========================+
Miss Islington (bot)c4976a62019-10-01 14:02:29 -07001392| rot_13 | rot13 | Return the Caesar-cypher |
1393| | | encryption of the |
1394| | | operand. |
Nick Coghlan9c1aed82013-11-23 11:13:36 +10001395+--------------------+---------+---------------------------+
Georg Brandl02524622010-12-02 18:06:51 +00001396
1397.. versionadded:: 3.2
Nick Coghlan9c1aed82013-11-23 11:13:36 +10001398 Restoration of the ``rot_13`` text transform.
1399
1400.. versionchanged:: 3.4
1401 Restoration of the ``rot13`` alias.
Georg Brandl02524622010-12-02 18:06:51 +00001402
Georg Brandl116aa622007-08-15 14:28:22 +00001403
1404:mod:`encodings.idna` --- Internationalized Domain Names in Applications
1405------------------------------------------------------------------------
1406
1407.. module:: encodings.idna
1408 :synopsis: Internationalized Domain Names implementation
1409.. moduleauthor:: Martin v. Löwis
1410
Georg Brandl116aa622007-08-15 14:28:22 +00001411This module implements :rfc:`3490` (Internationalized Domain Names in
1412Applications) and :rfc:`3492` (Nameprep: A Stringprep Profile for
1413Internationalized Domain Names (IDN)). It builds upon the ``punycode`` encoding
1414and :mod:`stringprep`.
1415
1416These RFCs together define a protocol to support non-ASCII characters in domain
1417names. A domain name containing non-ASCII characters (such as
1418``www.Alliancefrançaise.nu``) is converted into an ASCII-compatible encoding
1419(ACE, such as ``www.xn--alliancefranaise-npb.nu``). The ACE form of the domain
1420name is then used in all places where arbitrary characters are not allowed by
1421the protocol, such as DNS queries, HTTP :mailheader:`Host` fields, and so
1422on. This conversion is carried out in the application; if possible invisible to
1423the user: The application should transparently convert Unicode domain labels to
1424IDNA on the wire, and convert back ACE labels to Unicode before presenting them
1425to the user.
1426
R David Murraye0fd2f82011-04-13 14:12:18 -04001427Python supports this conversion in several ways: the ``idna`` codec performs
1428conversion between Unicode and ACE, separating an input string into labels
Serhiy Storchaka0a36ac12018-05-31 07:39:00 +03001429based on the separator characters defined in :rfc:`section 3.1 of RFC 3490 <3490#section-3.1>`
R David Murraye0fd2f82011-04-13 14:12:18 -04001430and converting each label to ACE as required, and conversely separating an input
1431byte string into labels based on the ``.`` separator and converting any ACE
Miss Islington (bot)c4976a62019-10-01 14:02:29 -07001432labels found into unicode. Furthermore, the :mod:`socket` module
Georg Brandl116aa622007-08-15 14:28:22 +00001433transparently converts Unicode host names to ACE, so that applications need not
1434be concerned about converting host names themselves when they pass them to the
1435socket module. On top of that, modules that have host names as function
Georg Brandl24420152008-05-26 16:32:26 +00001436parameters, such as :mod:`http.client` and :mod:`ftplib`, accept Unicode host
1437names (:mod:`http.client` then also transparently sends an IDNA hostname in the
Georg Brandl116aa622007-08-15 14:28:22 +00001438:mailheader:`Host` field if it sends that field at all).
1439
1440When receiving host names from the wire (such as in reverse name lookup), no
Miss Islington (bot)c4976a62019-10-01 14:02:29 -07001441automatic conversion to Unicode is performed: applications wishing to present
Georg Brandl116aa622007-08-15 14:28:22 +00001442such host names to the user should decode them to Unicode.
1443
1444The module :mod:`encodings.idna` also implements the nameprep procedure, which
1445performs certain normalizations on host names, to achieve case-insensitivity of
1446international domain names, and to unify similar characters. The nameprep
1447functions can be used directly if desired.
1448
1449
1450.. function:: nameprep(label)
1451
1452 Return the nameprepped version of *label*. The implementation currently assumes
1453 query strings, so ``AllowUnassigned`` is true.
1454
1455
1456.. function:: ToASCII(label)
1457
1458 Convert a label to ASCII, as specified in :rfc:`3490`. ``UseSTD3ASCIIRules`` is
1459 assumed to be false.
1460
1461
1462.. function:: ToUnicode(label)
1463
1464 Convert a label to Unicode, as specified in :rfc:`3490`.
1465
1466
Victor Stinner554f3f02010-06-16 23:33:54 +00001467:mod:`encodings.mbcs` --- Windows ANSI codepage
1468-----------------------------------------------
1469
1470.. module:: encodings.mbcs
1471 :synopsis: Windows ANSI codepage
1472
Miss Islington (bot)c4976a62019-10-01 14:02:29 -07001473This module implements the ANSI codepage (CP_ACP).
Victor Stinner554f3f02010-06-16 23:33:54 +00001474
Cheryl Sabella2d6097d2018-10-12 10:55:20 -04001475.. availability:: Windows only.
Victor Stinner554f3f02010-06-16 23:33:54 +00001476
Victor Stinner3a50e702011-10-18 21:21:00 +02001477.. versionchanged:: 3.3
1478 Support any error handler.
1479
Victor Stinner554f3f02010-06-16 23:33:54 +00001480.. versionchanged:: 3.2
1481 Before 3.2, the *errors* argument was ignored; ``'replace'`` was always used
1482 to encode, and ``'ignore'`` to decode.
1483
1484
Georg Brandl116aa622007-08-15 14:28:22 +00001485:mod:`encodings.utf_8_sig` --- UTF-8 codec with BOM signature
1486-------------------------------------------------------------
1487
1488.. module:: encodings.utf_8_sig
1489 :synopsis: UTF-8 codec with BOM signature
1490.. moduleauthor:: Walter Dörwald
1491
Miss Islington (bot)c4976a62019-10-01 14:02:29 -07001492This module implements a variant of the UTF-8 codec. On encoding, a UTF-8 encoded
Georg Brandl116aa622007-08-15 14:28:22 +00001493BOM will be prepended to the UTF-8 encoded bytes. For the stateful encoder this
Miss Islington (bot)c4976a62019-10-01 14:02:29 -07001494is only done once (on the first write to the byte stream). On decoding, an
Georg Brandl116aa622007-08-15 14:28:22 +00001495optional UTF-8 encoded BOM at the start of the data will be skipped.