blob: 628969c65d640680ee48d4785dfad967ba3fc4ab [file] [log] [blame]
Georg Brandl116aa622007-08-15 14:28:22 +00001:mod:`codecs` --- Codec registry and base classes
2=================================================
3
4.. module:: codecs
5 :synopsis: Encode and decode data and streams.
Antoine Pitroufbd4f802012-08-11 16:51:50 +02006.. moduleauthor:: Marc-André Lemburg <mal@lemburg.com>
7.. sectionauthor:: Marc-André Lemburg <mal@lemburg.com>
Georg Brandl116aa622007-08-15 14:28:22 +00008.. sectionauthor:: Martin v. Löwis <martin@v.loewis.de>
9
10
11.. index::
12 single: Unicode
13 single: Codecs
14 pair: Codecs; encode
15 pair: Codecs; decode
16 single: streams
17 pair: stackable; streams
18
19This module defines base classes for standard Python codecs (encoders and
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +100020decoders) and provides access to the internal Python codec registry, which
21manages the codec and error handling lookup process. Most standard codecs
22are :term:`text encodings <text encoding>`, which encode text to bytes,
23but there are also codecs provided that encode text to text, and bytes to
24bytes. Custom codecs may encode and decode between arbitrary types, but some
25module features are restricted to use specifically with
26:term:`text encodings <text encoding>`, or with codecs that encode to
27:class:`bytes`.
Georg Brandl116aa622007-08-15 14:28:22 +000028
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +100029The module defines the following functions for encoding and decoding with
30any codec:
Georg Brandl116aa622007-08-15 14:28:22 +000031
Victor Stinneref5b4e32014-05-14 17:08:45 +020032.. function:: encode(obj, [encoding[, errors]])
Nick Coghlan6cb2b5b2013-10-14 00:22:13 +100033
Victor Stinneref5b4e32014-05-14 17:08:45 +020034 Encodes *obj* using the codec registered for *encoding*. The default
35 encoding is ``utf-8``.
Nick Coghlan6cb2b5b2013-10-14 00:22:13 +100036
37 *Errors* may be given to set the desired error handling scheme. The
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +100038 default error handler is ``'strict'`` meaning that encoding errors raise
Nick Coghlan6cb2b5b2013-10-14 00:22:13 +100039 :exc:`ValueError` (or a more codec specific subclass, such as
40 :exc:`UnicodeEncodeError`). Refer to :ref:`codec-base-classes` for more
41 information on codec error handling.
42
Victor Stinneref5b4e32014-05-14 17:08:45 +020043.. function:: decode(obj, [encoding[, errors]])
Nick Coghlan6cb2b5b2013-10-14 00:22:13 +100044
Victor Stinneref5b4e32014-05-14 17:08:45 +020045 Decodes *obj* using the codec registered for *encoding*. The default
46 encoding is ``utf-8``.
Nick Coghlan6cb2b5b2013-10-14 00:22:13 +100047
48 *Errors* may be given to set the desired error handling scheme. The
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +100049 default error handler is ``'strict'`` meaning that decoding errors raise
Nick Coghlan6cb2b5b2013-10-14 00:22:13 +100050 :exc:`ValueError` (or a more codec specific subclass, such as
51 :exc:`UnicodeDecodeError`). Refer to :ref:`codec-base-classes` for more
52 information on codec error handling.
Georg Brandl116aa622007-08-15 14:28:22 +000053
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +100054The full details for each codec can also be looked up directly:
Georg Brandl116aa622007-08-15 14:28:22 +000055
56.. function:: lookup(encoding)
57
58 Looks up the codec info in the Python codec registry and returns a
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +100059 :class:`CodecInfo` object as defined below.
Georg Brandl116aa622007-08-15 14:28:22 +000060
61 Encodings are first looked up in the registry's cache. If not found, the list of
62 registered search functions is scanned. If no :class:`CodecInfo` object is
63 found, a :exc:`LookupError` is raised. Otherwise, the :class:`CodecInfo` object
64 is stored in the cache and returned to the caller.
65
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +100066.. class:: CodecInfo(encode, decode, streamreader=None, streamwriter=None, incrementalencoder=None, incrementaldecoder=None, name=None)
67
68 Codec details when looking up the codec registry. The constructor
69 arguments are stored in attributes of the same name:
70
71
72 .. attribute:: name
73
74 The name of the encoding.
75
76
77 .. attribute:: encode
78 decode
79
80 The stateless encoding and decoding functions. These must be
81 functions or methods which have the same interface as
82 the :meth:`~Codec.encode` and :meth:`~Codec.decode` methods of Codec
83 instances (see :ref:`Codec Interface <codec-objects>`).
84 The functions or methods are expected to work in a stateless mode.
85
86
87 .. attribute:: incrementalencoder
88 incrementaldecoder
89
90 Incremental encoder and decoder classes or factory functions.
91 These have to provide the interface defined by the base classes
92 :class:`IncrementalEncoder` and :class:`IncrementalDecoder`,
93 respectively. Incremental codecs can maintain state.
94
95
96 .. attribute:: streamwriter
97 streamreader
98
99 Stream writer and reader classes or factory functions. These have to
100 provide the interface defined by the base classes
101 :class:`StreamWriter` and :class:`StreamReader`, respectively.
102 Stream codecs can maintain state.
103
104To simplify access to the various codec components, the module provides
105these additional functions which use :func:`lookup` for the codec lookup:
Georg Brandl116aa622007-08-15 14:28:22 +0000106
107
108.. function:: getencoder(encoding)
109
110 Look up the codec for the given encoding and return its encoder function.
111
112 Raises a :exc:`LookupError` in case the encoding cannot be found.
113
114
115.. function:: getdecoder(encoding)
116
117 Look up the codec for the given encoding and return its decoder function.
118
119 Raises a :exc:`LookupError` in case the encoding cannot be found.
120
121
122.. function:: getincrementalencoder(encoding)
123
124 Look up the codec for the given encoding and return its incremental encoder
125 class or factory function.
126
127 Raises a :exc:`LookupError` in case the encoding cannot be found or the codec
128 doesn't support an incremental encoder.
129
Georg Brandl116aa622007-08-15 14:28:22 +0000130
131.. function:: getincrementaldecoder(encoding)
132
133 Look up the codec for the given encoding and return its incremental decoder
134 class or factory function.
135
136 Raises a :exc:`LookupError` in case the encoding cannot be found or the codec
137 doesn't support an incremental decoder.
138
Georg Brandl116aa622007-08-15 14:28:22 +0000139
140.. function:: getreader(encoding)
141
142 Look up the codec for the given encoding and return its StreamReader class or
143 factory function.
144
145 Raises a :exc:`LookupError` in case the encoding cannot be found.
146
147
148.. function:: getwriter(encoding)
149
150 Look up the codec for the given encoding and return its StreamWriter class or
151 factory function.
152
153 Raises a :exc:`LookupError` in case the encoding cannot be found.
154
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000155Custom codecs are made available by registering a suitable codec search
156function:
Georg Brandl116aa622007-08-15 14:28:22 +0000157
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000158.. function:: register(search_function)
Georg Brandl116aa622007-08-15 14:28:22 +0000159
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000160 Register a codec search function. Search functions are expected to take one
161 argument, being the encoding name in all lower case letters, and return a
162 :class:`CodecInfo` object. In case a search function cannot find
163 a given encoding, it should return ``None``.
Georg Brandl116aa622007-08-15 14:28:22 +0000164
165 .. note::
166
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000167 Search function registration is not currently reversible,
168 which may cause problems in some cases, such as unit testing or
169 module reloading.
170
171While the builtin :func:`open` and the associated :mod:`io` module are the
172recommended approach for working with encoded text files, this module
173provides additional utility functions and classes that allow the use of a
174wider range of codecs when working with binary files:
175
176.. function:: open(filename, mode='r', encoding=None, errors='strict', buffering=1)
177
178 Open an encoded file using the given *mode* and return an instance of
179 :class:`StreamReaderWriter`, providing transparent encoding/decoding.
180 The default file mode is ``'r'``, meaning to open the file in read mode.
Georg Brandl116aa622007-08-15 14:28:22 +0000181
Christian Heimes18c66892008-02-17 13:31:39 +0000182 .. note::
183
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000184 Underlying encoded files are always opened in binary mode.
185 No automatic conversion of ``'\n'`` is done on reading and writing.
186 The *mode* argument may be any binary mode acceptable to the built-in
187 :func:`open` function; the ``'b'`` is automatically added.
Christian Heimes18c66892008-02-17 13:31:39 +0000188
Georg Brandl116aa622007-08-15 14:28:22 +0000189 *encoding* specifies the encoding which is to be used for the file.
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000190 Any encoding that encodes to and decodes from bytes is allowed, and
191 the data types supported by the file methods depend on the codec used.
Georg Brandl116aa622007-08-15 14:28:22 +0000192
193 *errors* may be given to define the error handling. It defaults to ``'strict'``
194 which causes a :exc:`ValueError` to be raised in case an encoding error occurs.
195
196 *buffering* has the same meaning as for the built-in :func:`open` function. It
197 defaults to line buffered.
198
199
Georg Brandl0d8f0732009-04-05 22:20:44 +0000200.. function:: EncodedFile(file, data_encoding, file_encoding=None, errors='strict')
Georg Brandl116aa622007-08-15 14:28:22 +0000201
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000202 Return a :class:`StreamRecoder` instance, a wrapped version of *file*
203 which provides transparent transcoding. The original file is closed
204 when the wrapped version is closed.
Georg Brandl116aa622007-08-15 14:28:22 +0000205
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000206 Data written to the wrapped file is decoded according to the given
207 *data_encoding* and then written to the original file as bytes using
208 *file_encoding*. Bytes read from the original file are decoded
209 according to *file_encoding*, and the result is encoded
210 using *data_encoding*.
Georg Brandl116aa622007-08-15 14:28:22 +0000211
Georg Brandl0d8f0732009-04-05 22:20:44 +0000212 If *file_encoding* is not given, it defaults to *data_encoding*.
Georg Brandl116aa622007-08-15 14:28:22 +0000213
Georg Brandl0d8f0732009-04-05 22:20:44 +0000214 *errors* may be given to define the error handling. It defaults to
215 ``'strict'``, which causes :exc:`ValueError` to be raised in case an encoding
216 error occurs.
Georg Brandl116aa622007-08-15 14:28:22 +0000217
218
Georg Brandl0d8f0732009-04-05 22:20:44 +0000219.. function:: iterencode(iterator, encoding, errors='strict', **kwargs)
Georg Brandl116aa622007-08-15 14:28:22 +0000220
221 Uses an incremental encoder to iteratively encode the input provided by
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000222 *iterator*. This function is a :term:`generator`.
223 The *errors* argument (as well as any
Georg Brandl9afde1c2007-11-01 20:32:30 +0000224 other keyword argument) is passed through to the incremental encoder.
Georg Brandl116aa622007-08-15 14:28:22 +0000225
Georg Brandl116aa622007-08-15 14:28:22 +0000226
Georg Brandl0d8f0732009-04-05 22:20:44 +0000227.. function:: iterdecode(iterator, encoding, errors='strict', **kwargs)
Georg Brandl116aa622007-08-15 14:28:22 +0000228
229 Uses an incremental decoder to iteratively decode the input provided by
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000230 *iterator*. This function is a :term:`generator`.
231 The *errors* argument (as well as any
Georg Brandl9afde1c2007-11-01 20:32:30 +0000232 other keyword argument) is passed through to the incremental decoder.
Georg Brandl116aa622007-08-15 14:28:22 +0000233
Georg Brandl0d8f0732009-04-05 22:20:44 +0000234
Georg Brandl116aa622007-08-15 14:28:22 +0000235The module also provides the following constants which are useful for reading
236and writing to platform dependent files:
237
238
239.. data:: BOM
240 BOM_BE
241 BOM_LE
242 BOM_UTF8
243 BOM_UTF16
244 BOM_UTF16_BE
245 BOM_UTF16_LE
246 BOM_UTF32
247 BOM_UTF32_BE
248 BOM_UTF32_LE
249
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000250 These constants define various byte sequences,
251 being Unicode byte order marks (BOMs) for several encodings. They are
252 used in UTF-16 and UTF-32 data streams to indicate the byte order used,
253 and in UTF-8 as a Unicode signature. :const:`BOM_UTF16` is either
Georg Brandl116aa622007-08-15 14:28:22 +0000254 :const:`BOM_UTF16_BE` or :const:`BOM_UTF16_LE` depending on the platform's
255 native byte order, :const:`BOM` is an alias for :const:`BOM_UTF16`,
256 :const:`BOM_LE` for :const:`BOM_UTF16_LE` and :const:`BOM_BE` for
257 :const:`BOM_UTF16_BE`. The others represent the BOM in UTF-8 and UTF-32
258 encodings.
259
260
261.. _codec-base-classes:
262
263Codec Base Classes
264------------------
265
266The :mod:`codecs` module defines a set of base classes which define the
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000267interfaces for working with codec objects, and can also be used as the basis
268for custom codec implementations.
Georg Brandl116aa622007-08-15 14:28:22 +0000269
270Each codec has to define four interfaces to make it usable as codec in Python:
271stateless encoder, stateless decoder, stream reader and stream writer. The
272stream reader and writers typically reuse the stateless encoder/decoder to
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000273implement the file protocols. Codec authors also need to define how the
274codec will handle encoding and decoding errors.
Georg Brandl116aa622007-08-15 14:28:22 +0000275
Georg Brandl116aa622007-08-15 14:28:22 +0000276
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000277.. _error-handlers:
278
279Error Handlers
280^^^^^^^^^^^^^^
281
282To simplify and standardize error handling,
283codecs may implement different error handling schemes by
284accepting the *errors* string argument. The following string values are
285defined and implemented by all standard Python codecs:
Georg Brandl116aa622007-08-15 14:28:22 +0000286
Georg Brandl44ea77b2013-03-28 13:28:44 +0100287.. tabularcolumns:: |l|L|
288
Georg Brandl116aa622007-08-15 14:28:22 +0000289+-------------------------+-----------------------------------------------+
290| Value | Meaning |
291+=========================+===============================================+
292| ``'strict'`` | Raise :exc:`UnicodeError` (or a subclass); |
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000293| | this is the default. Implemented in |
294| | :func:`strict_errors`. |
Georg Brandl116aa622007-08-15 14:28:22 +0000295+-------------------------+-----------------------------------------------+
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000296| ``'ignore'`` | Ignore the malformed data and continue |
297| | without further notice. Implemented in |
298| | :func:`ignore_errors`. |
Georg Brandl116aa622007-08-15 14:28:22 +0000299+-------------------------+-----------------------------------------------+
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000300
301The following error handlers are only applicable to
302:term:`text encodings <text encoding>`:
303
304+-------------------------+-----------------------------------------------+
305| Value | Meaning |
306+=========================+===============================================+
Georg Brandl116aa622007-08-15 14:28:22 +0000307| ``'replace'`` | Replace with a suitable replacement |
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000308| | marker; Python will use the official |
309| | ``U+FFFD`` REPLACEMENT CHARACTER for the |
310| | built-in codecs on decoding, and '?' on |
311| | encoding. Implemented in |
312| | :func:`replace_errors`. |
Georg Brandl116aa622007-08-15 14:28:22 +0000313+-------------------------+-----------------------------------------------+
314| ``'xmlcharrefreplace'`` | Replace with the appropriate XML character |
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000315| | reference (only for encoding). Implemented |
316| | in :func:`xmlcharrefreplace_errors`. |
Georg Brandl116aa622007-08-15 14:28:22 +0000317+-------------------------+-----------------------------------------------+
318| ``'backslashreplace'`` | Replace with backslashed escape sequences |
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000319| | (only for encoding). Implemented in |
320| | :func:`backslashreplace_errors`. |
Georg Brandl116aa622007-08-15 14:28:22 +0000321+-------------------------+-----------------------------------------------+
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000322| ``'surrogateescape'`` | On decoding, replace byte with individual |
323| | surrogate code ranging from ``U+DC80`` to |
324| | ``U+DCFF``. This code will then be turned |
325| | back into the same byte when the |
326| | ``'surrogateescape'`` error handler is used |
327| | when encoding the data. (See :pep:`383` for |
328| | more.) |
Martin v. Löwis011e8422009-05-05 04:43:17 +0000329+-------------------------+-----------------------------------------------+
Georg Brandl116aa622007-08-15 14:28:22 +0000330
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000331In addition, the following error handler is specific to the given codecs:
Martin v. Löwisdb12d452009-05-02 18:52:14 +0000332
Serhiy Storchaka58cf6072013-11-19 11:32:41 +0200333+-------------------+------------------------+-------------------------------------------+
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000334| Value | Codecs | Meaning |
Serhiy Storchaka58cf6072013-11-19 11:32:41 +0200335+===================+========================+===========================================+
336|``'surrogatepass'``| utf-8, utf-16, utf-32, | Allow encoding and decoding of surrogate |
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000337| | utf-16-be, utf-16-le, | codes. These codecs normally treat the |
338| | utf-32-be, utf-32-le | presence of surrogates as an error. |
Serhiy Storchaka58cf6072013-11-19 11:32:41 +0200339+-------------------+------------------------+-------------------------------------------+
Martin v. Löwisdb12d452009-05-02 18:52:14 +0000340
341.. versionadded:: 3.1
Martin v. Löwis43c57782009-05-10 08:15:24 +0000342 The ``'surrogateescape'`` and ``'surrogatepass'`` error handlers.
Martin v. Löwisdb12d452009-05-02 18:52:14 +0000343
Serhiy Storchaka58cf6072013-11-19 11:32:41 +0200344.. versionchanged:: 3.4
345 The ``'surrogatepass'`` error handlers now works with utf-16\* and utf-32\* codecs.
346
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000347The set of allowed values can be extended by registering a new named error
348handler:
349
350.. function:: register_error(name, error_handler)
351
352 Register the error handling function *error_handler* under the name *name*.
353 The *error_handler* argument will be called during encoding and decoding
354 in case of an error, when *name* is specified as the errors parameter.
355
356 For encoding, *error_handler* will be called with a :exc:`UnicodeEncodeError`
357 instance, which contains information about the location of the error. The
358 error handler must either raise this or a different exception, or return a
359 tuple with a replacement for the unencodable part of the input and a position
360 where encoding should continue. The replacement may be either :class:`str` or
361 :class:`bytes`. If the replacement is bytes, the encoder will simply copy
362 them into the output buffer. If the replacement is a string, the encoder will
363 encode the replacement. Encoding continues on original input at the
364 specified position. Negative position values will be treated as being
365 relative to the end of the input string. If the resulting position is out of
366 bound an :exc:`IndexError` will be raised.
367
368 Decoding and translating works similarly, except :exc:`UnicodeDecodeError` or
369 :exc:`UnicodeTranslateError` will be passed to the handler and that the
370 replacement from the error handler will be put into the output directly.
371
372
373Previously registered error handlers (including the standard error handlers)
374can be looked up by name:
375
376.. function:: lookup_error(name)
377
378 Return the error handler previously registered under the name *name*.
379
380 Raises a :exc:`LookupError` in case the handler cannot be found.
381
382The following standard error handlers are also made available as module level
383functions:
384
385.. function:: strict_errors(exception)
386
387 Implements the ``'strict'`` error handling: each encoding or
388 decoding error raises a :exc:`UnicodeError`.
389
390
391.. function:: replace_errors(exception)
392
393 Implements the ``'replace'`` error handling (for :term:`text encodings
394 <text encoding>` only): substitutes ``'?'`` for encoding errors
395 (to be encoded by the codec), and ``'\ufffd'`` (the Unicode replacement
Larry Hastingsf5caf2b2015-02-25 04:15:33 -0800396 character) for decoding errors.
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000397
398
399.. function:: ignore_errors(exception)
400
401 Implements the ``'ignore'`` error handling: malformed data is ignored and
402 encoding or decoding is continued without further notice.
403
404
405.. function:: xmlcharrefreplace_errors(exception)
406
407 Implements the ``'xmlcharrefreplace'`` error handling (for encoding with
408 :term:`text encodings <text encoding>` only): the
409 unencodable character is replaced by an appropriate XML character reference.
410
411
412.. function:: backslashreplace_errors(exception)
413
414 Implements the ``'backslashreplace'`` error handling (for encoding with
415 :term:`text encodings <text encoding>` only): the
416 unencodable character is replaced by a backslashed escape sequence.
Georg Brandl116aa622007-08-15 14:28:22 +0000417
418
419.. _codec-objects:
420
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000421Stateless Encoding and Decoding
422^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Georg Brandl116aa622007-08-15 14:28:22 +0000423
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000424The base :class:`Codec` class defines these methods which also define the
425function interfaces of the stateless encoder and decoder:
Georg Brandl116aa622007-08-15 14:28:22 +0000426
427
428.. method:: Codec.encode(input[, errors])
429
430 Encodes the object *input* and returns a tuple (output object, length consumed).
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000431 For instance, :term:`text encoding` converts
432 a string object to a bytes object using a particular
Georg Brandl116aa622007-08-15 14:28:22 +0000433 character set encoding (e.g., ``cp1252`` or ``iso-8859-1``).
434
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000435 The *errors* argument defines the error handling to apply.
436 It defaults to ``'strict'`` handling.
Georg Brandl116aa622007-08-15 14:28:22 +0000437
438 The method may not store state in the :class:`Codec` instance. Use
Berker Peksag41ca8282015-07-30 18:26:10 +0300439 :class:`StreamWriter` for codecs which have to keep state in order to make
440 encoding efficient.
Georg Brandl116aa622007-08-15 14:28:22 +0000441
442 The encoder must be able to handle zero length input and return an empty object
443 of the output object type in this situation.
444
445
446.. method:: Codec.decode(input[, errors])
447
Georg Brandl30c78d62008-05-11 14:52:00 +0000448 Decodes the object *input* and returns a tuple (output object, length
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000449 consumed). For instance, for a :term:`text encoding`, decoding converts
450 a bytes object encoded using a particular
Georg Brandl30c78d62008-05-11 14:52:00 +0000451 character set encoding to a string object.
Georg Brandl116aa622007-08-15 14:28:22 +0000452
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000453 For text encodings and bytes-to-bytes codecs,
454 *input* must be a bytes object or one which provides the read-only
Georg Brandl30c78d62008-05-11 14:52:00 +0000455 buffer interface -- for example, buffer objects and memory mapped files.
Georg Brandl116aa622007-08-15 14:28:22 +0000456
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000457 The *errors* argument defines the error handling to apply.
458 It defaults to ``'strict'`` handling.
Georg Brandl116aa622007-08-15 14:28:22 +0000459
460 The method may not store state in the :class:`Codec` instance. Use
Berker Peksag41ca8282015-07-30 18:26:10 +0300461 :class:`StreamReader` for codecs which have to keep state in order to make
462 decoding efficient.
Georg Brandl116aa622007-08-15 14:28:22 +0000463
464 The decoder must be able to handle zero length input and return an empty object
465 of the output object type in this situation.
466
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000467
468Incremental Encoding and Decoding
469^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
470
Georg Brandl116aa622007-08-15 14:28:22 +0000471The :class:`IncrementalEncoder` and :class:`IncrementalDecoder` classes provide
472the basic interface for incremental encoding and decoding. Encoding/decoding the
473input isn't done with one call to the stateless encoder/decoder function, but
Serhiy Storchakabfdcd432013-10-13 23:09:14 +0300474with multiple calls to the
475:meth:`~IncrementalEncoder.encode`/:meth:`~IncrementalDecoder.decode` method of
476the incremental encoder/decoder. The incremental encoder/decoder keeps track of
477the encoding/decoding process during method calls.
Georg Brandl116aa622007-08-15 14:28:22 +0000478
Serhiy Storchakabfdcd432013-10-13 23:09:14 +0300479The joined output of calls to the
480:meth:`~IncrementalEncoder.encode`/:meth:`~IncrementalDecoder.decode` method is
481the same as if all the single inputs were joined into one, and this input was
Georg Brandl116aa622007-08-15 14:28:22 +0000482encoded/decoded with the stateless encoder/decoder.
483
484
485.. _incremental-encoder-objects:
486
487IncrementalEncoder Objects
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000488~~~~~~~~~~~~~~~~~~~~~~~~~~
Georg Brandl116aa622007-08-15 14:28:22 +0000489
Georg Brandl116aa622007-08-15 14:28:22 +0000490The :class:`IncrementalEncoder` class is used for encoding an input in multiple
491steps. It defines the following methods which every incremental encoder must
492define in order to be compatible with the Python codec registry.
493
494
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000495.. class:: IncrementalEncoder(errors='strict')
Georg Brandl116aa622007-08-15 14:28:22 +0000496
497 Constructor for an :class:`IncrementalEncoder` instance.
498
499 All incremental encoders must provide this constructor interface. They are free
500 to add additional keyword arguments, but only the ones defined here are used by
501 the Python codec registry.
502
503 The :class:`IncrementalEncoder` may implement different error handling schemes
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000504 by providing the *errors* keyword argument. See :ref:`error-handlers` for
505 possible values.
Georg Brandl116aa622007-08-15 14:28:22 +0000506
507 The *errors* argument will be assigned to an attribute of the same name.
508 Assigning to this attribute makes it possible to switch between different error
509 handling strategies during the lifetime of the :class:`IncrementalEncoder`
510 object.
511
Georg Brandl116aa622007-08-15 14:28:22 +0000512
Benjamin Petersone41251e2008-04-25 01:59:09 +0000513 .. method:: encode(object[, final])
Georg Brandl116aa622007-08-15 14:28:22 +0000514
Benjamin Petersone41251e2008-04-25 01:59:09 +0000515 Encodes *object* (taking the current state of the encoder into account)
516 and returns the resulting encoded object. If this is the last call to
517 :meth:`encode` *final* must be true (the default is false).
Georg Brandl116aa622007-08-15 14:28:22 +0000518
519
Benjamin Petersone41251e2008-04-25 01:59:09 +0000520 .. method:: reset()
Georg Brandl116aa622007-08-15 14:28:22 +0000521
Victor Stinnere15dce32011-05-30 22:56:00 +0200522 Reset the encoder to the initial state. The output is discarded: call
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000523 ``.encode(object, final=True)``, passing an empty byte or text string
524 if necessary, to reset the encoder and to get the output.
Georg Brandl116aa622007-08-15 14:28:22 +0000525
526
527.. method:: IncrementalEncoder.getstate()
528
529 Return the current state of the encoder which must be an integer. The
530 implementation should make sure that ``0`` is the most common state. (States
531 that are more complicated than integers can be converted into an integer by
532 marshaling/pickling the state and encoding the bytes of the resulting string
533 into an integer).
534
Georg Brandl116aa622007-08-15 14:28:22 +0000535
536.. method:: IncrementalEncoder.setstate(state)
537
538 Set the state of the encoder to *state*. *state* must be an encoder state
539 returned by :meth:`getstate`.
540
Georg Brandl116aa622007-08-15 14:28:22 +0000541
542.. _incremental-decoder-objects:
543
544IncrementalDecoder Objects
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000545~~~~~~~~~~~~~~~~~~~~~~~~~~
Georg Brandl116aa622007-08-15 14:28:22 +0000546
547The :class:`IncrementalDecoder` class is used for decoding an input in multiple
548steps. It defines the following methods which every incremental decoder must
549define in order to be compatible with the Python codec registry.
550
551
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000552.. class:: IncrementalDecoder(errors='strict')
Georg Brandl116aa622007-08-15 14:28:22 +0000553
554 Constructor for an :class:`IncrementalDecoder` instance.
555
556 All incremental decoders must provide this constructor interface. They are free
557 to add additional keyword arguments, but only the ones defined here are used by
558 the Python codec registry.
559
560 The :class:`IncrementalDecoder` may implement different error handling schemes
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000561 by providing the *errors* keyword argument. See :ref:`error-handlers` for
562 possible values.
Georg Brandl116aa622007-08-15 14:28:22 +0000563
564 The *errors* argument will be assigned to an attribute of the same name.
565 Assigning to this attribute makes it possible to switch between different error
Benjamin Peterson3e4f0552008-09-02 00:31:15 +0000566 handling strategies during the lifetime of the :class:`IncrementalDecoder`
Georg Brandl116aa622007-08-15 14:28:22 +0000567 object.
568
Georg Brandl116aa622007-08-15 14:28:22 +0000569
Benjamin Petersone41251e2008-04-25 01:59:09 +0000570 .. method:: decode(object[, final])
Georg Brandl116aa622007-08-15 14:28:22 +0000571
Benjamin Petersone41251e2008-04-25 01:59:09 +0000572 Decodes *object* (taking the current state of the decoder into account)
573 and returns the resulting decoded object. If this is the last call to
574 :meth:`decode` *final* must be true (the default is false). If *final* is
575 true the decoder must decode the input completely and must flush all
576 buffers. If this isn't possible (e.g. because of incomplete byte sequences
577 at the end of the input) it must initiate error handling just like in the
578 stateless case (which might raise an exception).
Georg Brandl116aa622007-08-15 14:28:22 +0000579
580
Benjamin Petersone41251e2008-04-25 01:59:09 +0000581 .. method:: reset()
Georg Brandl116aa622007-08-15 14:28:22 +0000582
Benjamin Petersone41251e2008-04-25 01:59:09 +0000583 Reset the decoder to the initial state.
Georg Brandl116aa622007-08-15 14:28:22 +0000584
585
Benjamin Petersone41251e2008-04-25 01:59:09 +0000586 .. method:: getstate()
Georg Brandl116aa622007-08-15 14:28:22 +0000587
Benjamin Petersone41251e2008-04-25 01:59:09 +0000588 Return the current state of the decoder. This must be a tuple with two
589 items, the first must be the buffer containing the still undecoded
590 input. The second must be an integer and can be additional state
591 info. (The implementation should make sure that ``0`` is the most common
592 additional state info.) If this additional state info is ``0`` it must be
593 possible to set the decoder to the state which has no input buffered and
594 ``0`` as the additional state info, so that feeding the previously
595 buffered input to the decoder returns it to the previous state without
596 producing any output. (Additional state info that is more complicated than
597 integers can be converted into an integer by marshaling/pickling the info
598 and encoding the bytes of the resulting string into an integer.)
Georg Brandl116aa622007-08-15 14:28:22 +0000599
Georg Brandl116aa622007-08-15 14:28:22 +0000600
Benjamin Petersone41251e2008-04-25 01:59:09 +0000601 .. method:: setstate(state)
Georg Brandl116aa622007-08-15 14:28:22 +0000602
Benjamin Petersone41251e2008-04-25 01:59:09 +0000603 Set the state of the encoder to *state*. *state* must be a decoder state
604 returned by :meth:`getstate`.
605
Georg Brandl116aa622007-08-15 14:28:22 +0000606
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000607Stream Encoding and Decoding
608^^^^^^^^^^^^^^^^^^^^^^^^^^^^
609
610
Georg Brandl116aa622007-08-15 14:28:22 +0000611The :class:`StreamWriter` and :class:`StreamReader` classes provide generic
612working interfaces which can be used to implement new encoding submodules very
613easily. See :mod:`encodings.utf_8` for an example of how this is done.
614
615
616.. _stream-writer-objects:
617
618StreamWriter Objects
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000619~~~~~~~~~~~~~~~~~~~~
Georg Brandl116aa622007-08-15 14:28:22 +0000620
621The :class:`StreamWriter` class is a subclass of :class:`Codec` and defines the
622following methods which every stream writer must define in order to be
623compatible with the Python codec registry.
624
625
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000626.. class:: StreamWriter(stream, errors='strict')
Georg Brandl116aa622007-08-15 14:28:22 +0000627
628 Constructor for a :class:`StreamWriter` instance.
629
630 All stream writers must provide this constructor interface. They are free to add
631 additional keyword arguments, but only the ones defined here are used by the
632 Python codec registry.
633
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000634 The *stream* argument must be a file-like object open for writing
635 text or binary data, as appropriate for the specific codec.
Georg Brandl116aa622007-08-15 14:28:22 +0000636
637 The :class:`StreamWriter` may implement different error handling schemes by
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000638 providing the *errors* keyword argument. See :ref:`error-handlers` for
639 the standard error handlers the underlying stream codec may support.
Georg Brandl116aa622007-08-15 14:28:22 +0000640
641 The *errors* argument will be assigned to an attribute of the same name.
642 Assigning to this attribute makes it possible to switch between different error
643 handling strategies during the lifetime of the :class:`StreamWriter` object.
644
Benjamin Petersone41251e2008-04-25 01:59:09 +0000645 .. method:: write(object)
Georg Brandl116aa622007-08-15 14:28:22 +0000646
Benjamin Petersone41251e2008-04-25 01:59:09 +0000647 Writes the object's contents encoded to the stream.
Georg Brandl116aa622007-08-15 14:28:22 +0000648
649
Benjamin Petersone41251e2008-04-25 01:59:09 +0000650 .. method:: writelines(list)
Georg Brandl116aa622007-08-15 14:28:22 +0000651
Benjamin Petersone41251e2008-04-25 01:59:09 +0000652 Writes the concatenated list of strings to the stream (possibly by reusing
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000653 the :meth:`write` method). The standard bytes-to-bytes codecs
654 do not support this method.
Georg Brandl116aa622007-08-15 14:28:22 +0000655
656
Benjamin Petersone41251e2008-04-25 01:59:09 +0000657 .. method:: reset()
Georg Brandl116aa622007-08-15 14:28:22 +0000658
Benjamin Petersone41251e2008-04-25 01:59:09 +0000659 Flushes and resets the codec buffers used for keeping state.
Georg Brandl116aa622007-08-15 14:28:22 +0000660
Benjamin Petersone41251e2008-04-25 01:59:09 +0000661 Calling this method should ensure that the data on the output is put into
662 a clean state that allows appending of new fresh data without having to
663 rescan the whole stream to recover state.
664
Georg Brandl116aa622007-08-15 14:28:22 +0000665
666In addition to the above methods, the :class:`StreamWriter` must also inherit
667all other methods and attributes from the underlying stream.
668
669
670.. _stream-reader-objects:
671
672StreamReader Objects
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000673~~~~~~~~~~~~~~~~~~~~
Georg Brandl116aa622007-08-15 14:28:22 +0000674
675The :class:`StreamReader` class is a subclass of :class:`Codec` and defines the
676following methods which every stream reader must define in order to be
677compatible with the Python codec registry.
678
679
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000680.. class:: StreamReader(stream, errors='strict')
Georg Brandl116aa622007-08-15 14:28:22 +0000681
682 Constructor for a :class:`StreamReader` instance.
683
684 All stream readers must provide this constructor interface. They are free to add
685 additional keyword arguments, but only the ones defined here are used by the
686 Python codec registry.
687
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000688 The *stream* argument must be a file-like object open for reading
689 text or binary data, as appropriate for the specific codec.
Georg Brandl116aa622007-08-15 14:28:22 +0000690
691 The :class:`StreamReader` may implement different error handling schemes by
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000692 providing the *errors* keyword argument. See :ref:`error-handlers` for
693 the standard error handlers the underlying stream codec may support.
Georg Brandl116aa622007-08-15 14:28:22 +0000694
695 The *errors* argument will be assigned to an attribute of the same name.
696 Assigning to this attribute makes it possible to switch between different error
697 handling strategies during the lifetime of the :class:`StreamReader` object.
698
699 The set of allowed values for the *errors* argument can be extended with
700 :func:`register_error`.
701
702
Benjamin Petersone41251e2008-04-25 01:59:09 +0000703 .. method:: read([size[, chars, [firstline]]])
Georg Brandl116aa622007-08-15 14:28:22 +0000704
Benjamin Petersone41251e2008-04-25 01:59:09 +0000705 Decodes data from the stream and returns the resulting object.
Georg Brandl116aa622007-08-15 14:28:22 +0000706
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000707 The *chars* argument indicates the number of decoded
708 code points or bytes to return. The :func:`read` method will
709 never return more data than requested, but it might return less,
710 if there is not enough available.
Georg Brandl116aa622007-08-15 14:28:22 +0000711
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000712 The *size* argument indicates the approximate maximum
713 number of encoded bytes or code points to read
714 for decoding. The decoder can modify this setting as
Benjamin Petersone41251e2008-04-25 01:59:09 +0000715 appropriate. The default value -1 indicates to read and decode as much as
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000716 possible. This parameter is intended to
717 prevent having to decode huge files in one step.
Georg Brandl116aa622007-08-15 14:28:22 +0000718
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000719 The *firstline* flag indicates that
720 it would be sufficient to only return the first
Benjamin Petersone41251e2008-04-25 01:59:09 +0000721 line, if there are decoding errors on later lines.
Georg Brandl116aa622007-08-15 14:28:22 +0000722
Benjamin Petersone41251e2008-04-25 01:59:09 +0000723 The method should use a greedy read strategy meaning that it should read
724 as much data as is allowed within the definition of the encoding and the
725 given size, e.g. if optional encoding endings or state markers are
726 available on the stream, these should be read too.
Georg Brandl116aa622007-08-15 14:28:22 +0000727
Georg Brandl116aa622007-08-15 14:28:22 +0000728
Benjamin Petersone41251e2008-04-25 01:59:09 +0000729 .. method:: readline([size[, keepends]])
Georg Brandl116aa622007-08-15 14:28:22 +0000730
Benjamin Petersone41251e2008-04-25 01:59:09 +0000731 Read one line from the input stream and return the decoded data.
Georg Brandl116aa622007-08-15 14:28:22 +0000732
Benjamin Petersone41251e2008-04-25 01:59:09 +0000733 *size*, if given, is passed as size argument to the stream's
Serhiy Storchakacca40ff2013-07-11 18:26:13 +0300734 :meth:`read` method.
Georg Brandl116aa622007-08-15 14:28:22 +0000735
Benjamin Petersone41251e2008-04-25 01:59:09 +0000736 If *keepends* is false line-endings will be stripped from the lines
737 returned.
Georg Brandl116aa622007-08-15 14:28:22 +0000738
Georg Brandl116aa622007-08-15 14:28:22 +0000739
Benjamin Petersone41251e2008-04-25 01:59:09 +0000740 .. method:: readlines([sizehint[, keepends]])
Georg Brandl116aa622007-08-15 14:28:22 +0000741
Benjamin Petersone41251e2008-04-25 01:59:09 +0000742 Read all lines available on the input stream and return them as a list of
743 lines.
Georg Brandl116aa622007-08-15 14:28:22 +0000744
Benjamin Petersone41251e2008-04-25 01:59:09 +0000745 Line-endings are implemented using the codec's decoder method and are
746 included in the list entries if *keepends* is true.
Georg Brandl116aa622007-08-15 14:28:22 +0000747
Benjamin Petersone41251e2008-04-25 01:59:09 +0000748 *sizehint*, if given, is passed as the *size* argument to the stream's
749 :meth:`read` method.
Georg Brandl116aa622007-08-15 14:28:22 +0000750
751
Benjamin Petersone41251e2008-04-25 01:59:09 +0000752 .. method:: reset()
Georg Brandl116aa622007-08-15 14:28:22 +0000753
Benjamin Petersone41251e2008-04-25 01:59:09 +0000754 Resets the codec buffers used for keeping state.
Georg Brandl116aa622007-08-15 14:28:22 +0000755
Benjamin Petersone41251e2008-04-25 01:59:09 +0000756 Note that no stream repositioning should take place. This method is
757 primarily intended to be able to recover from decoding errors.
758
Georg Brandl116aa622007-08-15 14:28:22 +0000759
760In addition to the above methods, the :class:`StreamReader` must also inherit
761all other methods and attributes from the underlying stream.
762
Georg Brandl116aa622007-08-15 14:28:22 +0000763.. _stream-reader-writer:
764
765StreamReaderWriter Objects
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000766~~~~~~~~~~~~~~~~~~~~~~~~~~
Georg Brandl116aa622007-08-15 14:28:22 +0000767
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000768The :class:`StreamReaderWriter` is a convenience class that allows wrapping
769streams which work in both read and write modes.
Georg Brandl116aa622007-08-15 14:28:22 +0000770
771The design is such that one can use the factory functions returned by the
772:func:`lookup` function to construct the instance.
773
774
775.. class:: StreamReaderWriter(stream, Reader, Writer, errors)
776
777 Creates a :class:`StreamReaderWriter` instance. *stream* must be a file-like
778 object. *Reader* and *Writer* must be factory functions or classes providing the
779 :class:`StreamReader` and :class:`StreamWriter` interface resp. Error handling
780 is done in the same way as defined for the stream readers and writers.
781
782:class:`StreamReaderWriter` instances define the combined interfaces of
783:class:`StreamReader` and :class:`StreamWriter` classes. They inherit all other
784methods and attributes from the underlying stream.
785
786
787.. _stream-recoder-objects:
788
789StreamRecoder Objects
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000790~~~~~~~~~~~~~~~~~~~~~
Georg Brandl116aa622007-08-15 14:28:22 +0000791
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000792The :class:`StreamRecoder` translates data from one encoding to another,
Georg Brandl116aa622007-08-15 14:28:22 +0000793which is sometimes useful when dealing with different encoding environments.
794
795The design is such that one can use the factory functions returned by the
796:func:`lookup` function to construct the instance.
797
798
799.. class:: StreamRecoder(stream, encode, decode, Reader, Writer, errors)
800
801 Creates a :class:`StreamRecoder` instance which implements a two-way conversion:
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000802 *encode* and *decode* work on the frontend — the data visible to
803 code calling :meth:`read` and :meth:`write`, while *Reader* and *Writer*
804 work on the backend — the data in *stream*.
Georg Brandl116aa622007-08-15 14:28:22 +0000805
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000806 You can use these objects to do transparent transcodings from e.g. Latin-1
Georg Brandl116aa622007-08-15 14:28:22 +0000807 to UTF-8 and back.
808
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000809 The *stream* argument must be a file-like object.
Georg Brandl116aa622007-08-15 14:28:22 +0000810
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000811 The *encode* and *decode* arguments must
812 adhere to the :class:`Codec` interface. *Reader* and
Georg Brandl116aa622007-08-15 14:28:22 +0000813 *Writer* must be factory functions or classes providing objects of the
814 :class:`StreamReader` and :class:`StreamWriter` interface respectively.
815
Georg Brandl116aa622007-08-15 14:28:22 +0000816 Error handling is done in the same way as defined for the stream readers and
817 writers.
818
Benjamin Petersone41251e2008-04-25 01:59:09 +0000819
Georg Brandl116aa622007-08-15 14:28:22 +0000820:class:`StreamRecoder` instances define the combined interfaces of
821:class:`StreamReader` and :class:`StreamWriter` classes. They inherit all other
822methods and attributes from the underlying stream.
823
824
825.. _encodings-overview:
826
827Encodings and Unicode
828---------------------
829
Serhiy Storchakad3faf432015-01-18 11:28:37 +0200830Strings are stored internally as sequences of code points in
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000831range ``0x0``-``0x10FFFF``. (See :pep:`393` for
832more details about the implementation.)
833Once a string object is used outside of CPU and memory, endianness
834and how these arrays are stored as bytes become an issue. As with other
835codecs, serialising a string into a sequence of bytes is known as *encoding*,
836and recreating the string from the sequence of bytes is known as *decoding*.
837There are a variety of different text serialisation codecs, which are
838collectivity referred to as :term:`text encodings <text encoding>`.
839
840The simplest text encoding (called ``'latin-1'`` or ``'iso-8859-1'``) maps
Serhiy Storchakad3faf432015-01-18 11:28:37 +0200841the code points 0-255 to the bytes ``0x0``-``0xff``, which means that a string
842object that contains code points above ``U+00FF`` can't be encoded with this
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000843codec. Doing so will raise a :exc:`UnicodeEncodeError` that looks
844like the following (although the details of the error message may differ):
845``UnicodeEncodeError: 'latin-1' codec can't encode character '\u1234' in
846position 3: ordinal not in range(256)``.
Georg Brandl116aa622007-08-15 14:28:22 +0000847
848There's another group of encodings (the so called charmap encodings) that choose
Serhiy Storchakad3faf432015-01-18 11:28:37 +0200849a different subset of all Unicode code points and how these code points are
Georg Brandl116aa622007-08-15 14:28:22 +0000850mapped to the bytes ``0x0``-``0xff``. To see how this is done simply open
851e.g. :file:`encodings/cp1252.py` (which is an encoding that is used primarily on
852Windows). There's a string constant with 256 characters that shows you which
853character is mapped to which byte value.
854
Serhiy Storchakad3faf432015-01-18 11:28:37 +0200855All of these encodings can only encode 256 of the 1114112 code points
Georg Brandl30c78d62008-05-11 14:52:00 +0000856defined in Unicode. A simple and straightforward way that can store each Unicode
Serhiy Storchakad3faf432015-01-18 11:28:37 +0200857code point, is to store each code point as four consecutive bytes. There are two
Ezio Melottifbb39812011-10-25 10:40:38 +0300858possibilities: store the bytes in big endian or in little endian order. These
859two encodings are called ``UTF-32-BE`` and ``UTF-32-LE`` respectively. Their
860disadvantage is that if e.g. you use ``UTF-32-BE`` on a little endian machine you
861will always have to swap bytes on encoding and decoding. ``UTF-32`` avoids this
862problem: bytes will always be in natural endianness. When these bytes are read
Georg Brandl116aa622007-08-15 14:28:22 +0000863by a CPU with a different endianness, then bytes have to be swapped though. To
Ezio Melottifbb39812011-10-25 10:40:38 +0300864be able to detect the endianness of a ``UTF-16`` or ``UTF-32`` byte sequence,
865there's the so called BOM ("Byte Order Mark"). This is the Unicode character
866``U+FEFF``. This character can be prepended to every ``UTF-16`` or ``UTF-32``
867byte sequence. The byte swapped version of this character (``0xFFFE``) is an
868illegal character that may not appear in a Unicode text. So when the
869first character in an ``UTF-16`` or ``UTF-32`` byte sequence
Georg Brandl116aa622007-08-15 14:28:22 +0000870appears to be a ``U+FFFE`` the bytes have to be swapped on decoding.
Ezio Melottifbb39812011-10-25 10:40:38 +0300871Unfortunately the character ``U+FEFF`` had a second purpose as
872a ``ZERO WIDTH NO-BREAK SPACE``: a character that has no width and doesn't allow
Georg Brandl116aa622007-08-15 14:28:22 +0000873a word to be split. It can e.g. be used to give hints to a ligature algorithm.
874With Unicode 4.0 using ``U+FEFF`` as a ``ZERO WIDTH NO-BREAK SPACE`` has been
875deprecated (with ``U+2060`` (``WORD JOINER``) assuming this role). Nevertheless
Ezio Melottifbb39812011-10-25 10:40:38 +0300876Unicode software still must be able to handle ``U+FEFF`` in both roles: as a BOM
Georg Brandl116aa622007-08-15 14:28:22 +0000877it's a device to determine the storage layout of the encoded bytes, and vanishes
Georg Brandl30c78d62008-05-11 14:52:00 +0000878once the byte sequence has been decoded into a string; as a ``ZERO WIDTH
Georg Brandl116aa622007-08-15 14:28:22 +0000879NO-BREAK SPACE`` it's a normal character that will be decoded like any other.
880
881There's another encoding that is able to encoding the full range of Unicode
882characters: UTF-8. UTF-8 is an 8-bit encoding, which means there are no issues
883with byte order in UTF-8. Each byte in a UTF-8 byte sequence consists of two
Ezio Melottifbb39812011-10-25 10:40:38 +0300884parts: marker bits (the most significant bits) and payload bits. The marker bits
Ezio Melotti222b2082011-09-01 08:11:28 +0300885are a sequence of zero to four ``1`` bits followed by a ``0`` bit. Unicode characters are
Georg Brandl116aa622007-08-15 14:28:22 +0000886encoded like this (with x being payload bits, which when concatenated give the
887Unicode character):
888
889+-----------------------------------+----------------------------------------------+
890| Range | Encoding |
891+===================================+==============================================+
892| ``U-00000000`` ... ``U-0000007F`` | 0xxxxxxx |
893+-----------------------------------+----------------------------------------------+
894| ``U-00000080`` ... ``U-000007FF`` | 110xxxxx 10xxxxxx |
895+-----------------------------------+----------------------------------------------+
896| ``U-00000800`` ... ``U-0000FFFF`` | 1110xxxx 10xxxxxx 10xxxxxx |
897+-----------------------------------+----------------------------------------------+
Ezio Melotti222b2082011-09-01 08:11:28 +0300898| ``U-00010000`` ... ``U-0010FFFF`` | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx |
Georg Brandl116aa622007-08-15 14:28:22 +0000899+-----------------------------------+----------------------------------------------+
900
901The least significant bit of the Unicode character is the rightmost x bit.
902
903As UTF-8 is an 8-bit encoding no BOM is required and any ``U+FEFF`` character in
Georg Brandl30c78d62008-05-11 14:52:00 +0000904the decoded string (even if it's the first character) is treated as a ``ZERO
905WIDTH NO-BREAK SPACE``.
Georg Brandl116aa622007-08-15 14:28:22 +0000906
907Without external information it's impossible to reliably determine which
Georg Brandl30c78d62008-05-11 14:52:00 +0000908encoding was used for encoding a string. Each charmap encoding can
Georg Brandl116aa622007-08-15 14:28:22 +0000909decode any random byte sequence. However that's not possible with UTF-8, as
910UTF-8 byte sequences have a structure that doesn't allow arbitrary byte
Thomas Wouters89d996e2007-09-08 17:39:28 +0000911sequences. To increase the reliability with which a UTF-8 encoding can be
Georg Brandl116aa622007-08-15 14:28:22 +0000912detected, Microsoft invented a variant of UTF-8 (that Python 2.5 calls
913``"utf-8-sig"``) for its Notepad program: Before any of the Unicode characters
914is written to the file, a UTF-8 encoded BOM (which looks like this as a byte
915sequence: ``0xef``, ``0xbb``, ``0xbf``) is written. As it's rather improbable
916that any charmap encoded file starts with these byte values (which would e.g.
917map to
918
919 | LATIN SMALL LETTER I WITH DIAERESIS
920 | RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK
921 | INVERTED QUESTION MARK
922
Ezio Melottifbb39812011-10-25 10:40:38 +0300923in iso-8859-1), this increases the probability that a ``utf-8-sig`` encoding can be
Georg Brandl116aa622007-08-15 14:28:22 +0000924correctly guessed from the byte sequence. So here the BOM is not used to be able
925to determine the byte order used for generating the byte sequence, but as a
926signature that helps in guessing the encoding. On encoding the utf-8-sig codec
927will write ``0xef``, ``0xbb``, ``0xbf`` as the first three bytes to the file. On
Ezio Melottifbb39812011-10-25 10:40:38 +0300928decoding ``utf-8-sig`` will skip those three bytes if they appear as the first
929three bytes in the file. In UTF-8, the use of the BOM is discouraged and
930should generally be avoided.
Georg Brandl116aa622007-08-15 14:28:22 +0000931
932
933.. _standard-encodings:
934
935Standard Encodings
936------------------
937
938Python comes with a number of codecs built-in, either implemented as C functions
939or with dictionaries as mapping tables. The following table lists the codecs by
940name, together with a few common aliases, and the languages for which the
941encoding is likely used. Neither the list of aliases nor the list of languages
942is meant to be exhaustive. Notice that spelling alternatives that only differ in
Georg Brandla6053b42009-09-01 08:11:14 +0000943case or use a hyphen instead of an underscore are also valid aliases; therefore,
944e.g. ``'utf-8'`` is a valid alias for the ``'utf_8'`` codec.
Georg Brandl116aa622007-08-15 14:28:22 +0000945
Alexander Belopolsky1d521462011-02-25 19:19:57 +0000946.. impl-detail::
947
948 Some common encodings can bypass the codecs lookup machinery to
949 improve performance. These optimization opportunities are only
950 recognized by CPython for a limited set of aliases: utf-8, utf8,
951 latin-1, latin1, iso-8859-1, mbcs (Windows only), ascii, utf-16,
952 and utf-32. Using alternative spellings for these encodings may
953 result in slower execution.
954
Georg Brandl116aa622007-08-15 14:28:22 +0000955Many of the character sets support the same languages. They vary in individual
956characters (e.g. whether the EURO SIGN is supported or not), and in the
957assignment of characters to code positions. For the European languages in
958particular, the following variants typically exist:
959
960* an ISO 8859 codeset
961
962* a Microsoft Windows code page, which is typically derived from a 8859 codeset,
963 but replaces control characters with additional graphic characters
964
965* an IBM EBCDIC code page
966
967* an IBM PC code page, which is ASCII compatible
968
Georg Brandl44ea77b2013-03-28 13:28:44 +0100969.. tabularcolumns:: |l|p{0.3\linewidth}|p{0.3\linewidth}|
970
Georg Brandl116aa622007-08-15 14:28:22 +0000971+-----------------+--------------------------------+--------------------------------+
972| Codec | Aliases | Languages |
973+=================+================================+================================+
974| ascii | 646, us-ascii | English |
975+-----------------+--------------------------------+--------------------------------+
976| big5 | big5-tw, csbig5 | Traditional Chinese |
977+-----------------+--------------------------------+--------------------------------+
978| big5hkscs | big5-hkscs, hkscs | Traditional Chinese |
979+-----------------+--------------------------------+--------------------------------+
980| cp037 | IBM037, IBM039 | English |
981+-----------------+--------------------------------+--------------------------------+
R David Murrayc4c7b1c2014-03-07 21:00:34 -0500982| cp273 | 273, IBM273, csIBM273 | German |
983| | | |
984| | | .. versionadded:: 3.4 |
985+-----------------+--------------------------------+--------------------------------+
Georg Brandl116aa622007-08-15 14:28:22 +0000986| cp424 | EBCDIC-CP-HE, IBM424 | Hebrew |
987+-----------------+--------------------------------+--------------------------------+
988| cp437 | 437, IBM437 | English |
989+-----------------+--------------------------------+--------------------------------+
990| cp500 | EBCDIC-CP-BE, EBCDIC-CP-CH, | Western Europe |
991| | IBM500 | |
992+-----------------+--------------------------------+--------------------------------+
Amaury Forgeot d'Arcae6388d2009-07-15 19:21:18 +0000993| cp720 | | Arabic |
994+-----------------+--------------------------------+--------------------------------+
Georg Brandl116aa622007-08-15 14:28:22 +0000995| cp737 | | Greek |
996+-----------------+--------------------------------+--------------------------------+
997| cp775 | IBM775 | Baltic languages |
998+-----------------+--------------------------------+--------------------------------+
999| cp850 | 850, IBM850 | Western Europe |
1000+-----------------+--------------------------------+--------------------------------+
1001| cp852 | 852, IBM852 | Central and Eastern Europe |
1002+-----------------+--------------------------------+--------------------------------+
1003| cp855 | 855, IBM855 | Bulgarian, Byelorussian, |
1004| | | Macedonian, Russian, Serbian |
1005+-----------------+--------------------------------+--------------------------------+
1006| cp856 | | Hebrew |
1007+-----------------+--------------------------------+--------------------------------+
1008| cp857 | 857, IBM857 | Turkish |
1009+-----------------+--------------------------------+--------------------------------+
Benjamin Peterson5a6214a2010-06-27 22:41:29 +00001010| cp858 | 858, IBM858 | Western Europe |
1011+-----------------+--------------------------------+--------------------------------+
Georg Brandl116aa622007-08-15 14:28:22 +00001012| cp860 | 860, IBM860 | Portuguese |
1013+-----------------+--------------------------------+--------------------------------+
1014| cp861 | 861, CP-IS, IBM861 | Icelandic |
1015+-----------------+--------------------------------+--------------------------------+
1016| cp862 | 862, IBM862 | Hebrew |
1017+-----------------+--------------------------------+--------------------------------+
1018| cp863 | 863, IBM863 | Canadian |
1019+-----------------+--------------------------------+--------------------------------+
1020| cp864 | IBM864 | Arabic |
1021+-----------------+--------------------------------+--------------------------------+
1022| cp865 | 865, IBM865 | Danish, Norwegian |
1023+-----------------+--------------------------------+--------------------------------+
1024| cp866 | 866, IBM866 | Russian |
1025+-----------------+--------------------------------+--------------------------------+
1026| cp869 | 869, CP-GR, IBM869 | Greek |
1027+-----------------+--------------------------------+--------------------------------+
1028| cp874 | | Thai |
1029+-----------------+--------------------------------+--------------------------------+
1030| cp875 | | Greek |
1031+-----------------+--------------------------------+--------------------------------+
1032| cp932 | 932, ms932, mskanji, ms-kanji | Japanese |
1033+-----------------+--------------------------------+--------------------------------+
1034| cp949 | 949, ms949, uhc | Korean |
1035+-----------------+--------------------------------+--------------------------------+
1036| cp950 | 950, ms950 | Traditional Chinese |
1037+-----------------+--------------------------------+--------------------------------+
1038| cp1006 | | Urdu |
1039+-----------------+--------------------------------+--------------------------------+
1040| cp1026 | ibm1026 | Turkish |
1041+-----------------+--------------------------------+--------------------------------+
Serhiy Storchakabe0c3252013-11-23 18:52:23 +02001042| cp1125 | 1125, ibm1125, cp866u, ruscii | Ukrainian |
1043| | | |
1044| | | .. versionadded:: 3.4 |
1045+-----------------+--------------------------------+--------------------------------+
Georg Brandl116aa622007-08-15 14:28:22 +00001046| cp1140 | ibm1140 | Western Europe |
1047+-----------------+--------------------------------+--------------------------------+
1048| cp1250 | windows-1250 | Central and Eastern Europe |
1049+-----------------+--------------------------------+--------------------------------+
1050| cp1251 | windows-1251 | Bulgarian, Byelorussian, |
1051| | | Macedonian, Russian, Serbian |
1052+-----------------+--------------------------------+--------------------------------+
1053| cp1252 | windows-1252 | Western Europe |
1054+-----------------+--------------------------------+--------------------------------+
1055| cp1253 | windows-1253 | Greek |
1056+-----------------+--------------------------------+--------------------------------+
1057| cp1254 | windows-1254 | Turkish |
1058+-----------------+--------------------------------+--------------------------------+
1059| cp1255 | windows-1255 | Hebrew |
1060+-----------------+--------------------------------+--------------------------------+
Benjamin Peterson4ac9ce42009-10-04 14:49:41 +00001061| cp1256 | windows-1256 | Arabic |
Georg Brandl116aa622007-08-15 14:28:22 +00001062+-----------------+--------------------------------+--------------------------------+
1063| cp1257 | windows-1257 | Baltic languages |
1064+-----------------+--------------------------------+--------------------------------+
1065| cp1258 | windows-1258 | Vietnamese |
1066+-----------------+--------------------------------+--------------------------------+
Victor Stinner2f3ca9f2011-10-27 01:38:56 +02001067| cp65001 | | Windows only: Windows UTF-8 |
1068| | | (``CP_UTF8``) |
1069| | | |
1070| | | .. versionadded:: 3.3 |
1071+-----------------+--------------------------------+--------------------------------+
Georg Brandl116aa622007-08-15 14:28:22 +00001072| euc_jp | eucjp, ujis, u-jis | Japanese |
1073+-----------------+--------------------------------+--------------------------------+
1074| euc_jis_2004 | jisx0213, eucjis2004 | Japanese |
1075+-----------------+--------------------------------+--------------------------------+
1076| euc_jisx0213 | eucjisx0213 | Japanese |
1077+-----------------+--------------------------------+--------------------------------+
1078| euc_kr | euckr, korean, ksc5601, | Korean |
1079| | ks_c-5601, ks_c-5601-1987, | |
1080| | ksx1001, ks_x-1001 | |
1081+-----------------+--------------------------------+--------------------------------+
1082| gb2312 | chinese, csiso58gb231280, euc- | Simplified Chinese |
1083| | cn, euccn, eucgb2312-cn, | |
1084| | gb2312-1980, gb2312-80, iso- | |
1085| | ir-58 | |
1086+-----------------+--------------------------------+--------------------------------+
1087| gbk | 936, cp936, ms936 | Unified Chinese |
1088+-----------------+--------------------------------+--------------------------------+
1089| gb18030 | gb18030-2000 | Unified Chinese |
1090+-----------------+--------------------------------+--------------------------------+
1091| hz | hzgb, hz-gb, hz-gb-2312 | Simplified Chinese |
1092+-----------------+--------------------------------+--------------------------------+
1093| iso2022_jp | csiso2022jp, iso2022jp, | Japanese |
1094| | iso-2022-jp | |
1095+-----------------+--------------------------------+--------------------------------+
1096| iso2022_jp_1 | iso2022jp-1, iso-2022-jp-1 | Japanese |
1097+-----------------+--------------------------------+--------------------------------+
1098| iso2022_jp_2 | iso2022jp-2, iso-2022-jp-2 | Japanese, Korean, Simplified |
1099| | | Chinese, Western Europe, Greek |
1100+-----------------+--------------------------------+--------------------------------+
1101| iso2022_jp_2004 | iso2022jp-2004, | Japanese |
1102| | iso-2022-jp-2004 | |
1103+-----------------+--------------------------------+--------------------------------+
1104| iso2022_jp_3 | iso2022jp-3, iso-2022-jp-3 | Japanese |
1105+-----------------+--------------------------------+--------------------------------+
1106| iso2022_jp_ext | iso2022jp-ext, iso-2022-jp-ext | Japanese |
1107+-----------------+--------------------------------+--------------------------------+
1108| iso2022_kr | csiso2022kr, iso2022kr, | Korean |
1109| | iso-2022-kr | |
1110+-----------------+--------------------------------+--------------------------------+
1111| latin_1 | iso-8859-1, iso8859-1, 8859, | West Europe |
1112| | cp819, latin, latin1, L1 | |
1113+-----------------+--------------------------------+--------------------------------+
1114| iso8859_2 | iso-8859-2, latin2, L2 | Central and Eastern Europe |
1115+-----------------+--------------------------------+--------------------------------+
1116| iso8859_3 | iso-8859-3, latin3, L3 | Esperanto, Maltese |
1117+-----------------+--------------------------------+--------------------------------+
Christian Heimesc3f30c42008-02-22 16:37:40 +00001118| iso8859_4 | iso-8859-4, latin4, L4 | Baltic languages |
Georg Brandl116aa622007-08-15 14:28:22 +00001119+-----------------+--------------------------------+--------------------------------+
1120| iso8859_5 | iso-8859-5, cyrillic | Bulgarian, Byelorussian, |
1121| | | Macedonian, Russian, Serbian |
1122+-----------------+--------------------------------+--------------------------------+
1123| iso8859_6 | iso-8859-6, arabic | Arabic |
1124+-----------------+--------------------------------+--------------------------------+
1125| iso8859_7 | iso-8859-7, greek, greek8 | Greek |
1126+-----------------+--------------------------------+--------------------------------+
1127| iso8859_8 | iso-8859-8, hebrew | Hebrew |
1128+-----------------+--------------------------------+--------------------------------+
1129| iso8859_9 | iso-8859-9, latin5, L5 | Turkish |
1130+-----------------+--------------------------------+--------------------------------+
1131| iso8859_10 | iso-8859-10, latin6, L6 | Nordic languages |
1132+-----------------+--------------------------------+--------------------------------+
Victor Stinnerbfd97672015-09-24 09:04:05 +02001133| iso8859_11 | iso-8859-11, thai | Thai languages |
1134+-----------------+--------------------------------+--------------------------------+
Georg Brandl93dc9eb2010-03-14 10:56:14 +00001135| iso8859_13 | iso-8859-13, latin7, L7 | Baltic languages |
Georg Brandl116aa622007-08-15 14:28:22 +00001136+-----------------+--------------------------------+--------------------------------+
1137| iso8859_14 | iso-8859-14, latin8, L8 | Celtic languages |
1138+-----------------+--------------------------------+--------------------------------+
Georg Brandl93dc9eb2010-03-14 10:56:14 +00001139| iso8859_15 | iso-8859-15, latin9, L9 | Western Europe |
1140+-----------------+--------------------------------+--------------------------------+
1141| iso8859_16 | iso-8859-16, latin10, L10 | South-Eastern Europe |
Georg Brandl116aa622007-08-15 14:28:22 +00001142+-----------------+--------------------------------+--------------------------------+
1143| johab | cp1361, ms1361 | Korean |
1144+-----------------+--------------------------------+--------------------------------+
1145| koi8_r | | Russian |
1146+-----------------+--------------------------------+--------------------------------+
1147| koi8_u | | Ukrainian |
1148+-----------------+--------------------------------+--------------------------------+
1149| mac_cyrillic | maccyrillic | Bulgarian, Byelorussian, |
1150| | | Macedonian, Russian, Serbian |
1151+-----------------+--------------------------------+--------------------------------+
1152| mac_greek | macgreek | Greek |
1153+-----------------+--------------------------------+--------------------------------+
1154| mac_iceland | maciceland | Icelandic |
1155+-----------------+--------------------------------+--------------------------------+
1156| mac_latin2 | maclatin2, maccentraleurope | Central and Eastern Europe |
1157+-----------------+--------------------------------+--------------------------------+
Benjamin Peterson23110e72010-08-21 02:54:44 +00001158| mac_roman | macroman, macintosh | Western Europe |
Georg Brandl116aa622007-08-15 14:28:22 +00001159+-----------------+--------------------------------+--------------------------------+
1160| mac_turkish | macturkish | Turkish |
1161+-----------------+--------------------------------+--------------------------------+
1162| ptcp154 | csptcp154, pt154, cp154, | Kazakh |
1163| | cyrillic-asian | |
1164+-----------------+--------------------------------+--------------------------------+
1165| shift_jis | csshiftjis, shiftjis, sjis, | Japanese |
1166| | s_jis | |
1167+-----------------+--------------------------------+--------------------------------+
1168| shift_jis_2004 | shiftjis2004, sjis_2004, | Japanese |
1169| | sjis2004 | |
1170+-----------------+--------------------------------+--------------------------------+
1171| shift_jisx0213 | shiftjisx0213, sjisx0213, | Japanese |
1172| | s_jisx0213 | |
1173+-----------------+--------------------------------+--------------------------------+
Walter Dörwald41980ca2007-08-16 21:55:45 +00001174| utf_32 | U32, utf32 | all languages |
1175+-----------------+--------------------------------+--------------------------------+
1176| utf_32_be | UTF-32BE | all languages |
1177+-----------------+--------------------------------+--------------------------------+
1178| utf_32_le | UTF-32LE | all languages |
1179+-----------------+--------------------------------+--------------------------------+
Georg Brandl116aa622007-08-15 14:28:22 +00001180| utf_16 | U16, utf16 | all languages |
1181+-----------------+--------------------------------+--------------------------------+
Victor Stinner53a9dd72010-12-08 22:25:45 +00001182| utf_16_be | UTF-16BE | all languages |
Georg Brandl116aa622007-08-15 14:28:22 +00001183+-----------------+--------------------------------+--------------------------------+
Victor Stinner53a9dd72010-12-08 22:25:45 +00001184| utf_16_le | UTF-16LE | all languages |
Georg Brandl116aa622007-08-15 14:28:22 +00001185+-----------------+--------------------------------+--------------------------------+
1186| utf_7 | U7, unicode-1-1-utf-7 | all languages |
1187+-----------------+--------------------------------+--------------------------------+
1188| utf_8 | U8, UTF, utf8 | all languages |
1189+-----------------+--------------------------------+--------------------------------+
1190| utf_8_sig | | all languages |
1191+-----------------+--------------------------------+--------------------------------+
1192
Serhiy Storchaka58cf6072013-11-19 11:32:41 +02001193.. versionchanged:: 3.4
1194 The utf-16\* and utf-32\* encoders no longer allow surrogate code points
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +10001195 (``U+D800``--``U+DFFF``) to be encoded.
1196 The utf-32\* decoders no longer decode
Serhiy Storchaka58cf6072013-11-19 11:32:41 +02001197 byte sequences that correspond to surrogate code points.
1198
1199
Nick Coghlan650e3222013-05-23 20:24:02 +10001200Python Specific Encodings
1201-------------------------
1202
1203A number of predefined codecs are specific to Python, so their codec names have
1204no meaning outside Python. These are listed in the tables below based on the
1205expected input and output types (note that while text encodings are the most
1206common use case for codecs, the underlying codec infrastructure supports
1207arbitrary data transforms rather than just text encodings). For asymmetric
1208codecs, the stated purpose describes the encoding direction.
1209
Nick Coghlan9c1aed82013-11-23 11:13:36 +10001210Text Encodings
1211^^^^^^^^^^^^^^
1212
Nick Coghlan650e3222013-05-23 20:24:02 +10001213The following codecs provide :class:`str` to :class:`bytes` encoding and
1214:term:`bytes-like object` to :class:`str` decoding, similar to the Unicode text
1215encodings.
Georg Brandl226878c2007-08-31 10:15:37 +00001216
Georg Brandl44ea77b2013-03-28 13:28:44 +01001217.. tabularcolumns:: |l|p{0.3\linewidth}|p{0.3\linewidth}|
1218
Georg Brandl30c78d62008-05-11 14:52:00 +00001219+--------------------+---------+---------------------------+
1220| Codec | Aliases | Purpose |
1221+====================+=========+===========================+
1222| idna | | Implements :rfc:`3490`, |
1223| | | see also |
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +10001224| | | :mod:`encodings.idna`. |
1225| | | Only ``errors='strict'`` |
1226| | | is supported. |
Georg Brandl30c78d62008-05-11 14:52:00 +00001227+--------------------+---------+---------------------------+
1228| mbcs | dbcs | Windows only: Encode |
1229| | | operand according to the |
1230| | | ANSI codepage (CP_ACP) |
1231+--------------------+---------+---------------------------+
1232| palmos | | Encoding of PalmOS 3.5 |
1233+--------------------+---------+---------------------------+
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +10001234| punycode | | Implements :rfc:`3492`. |
1235| | | Stateful codecs are not |
1236| | | supported. |
Georg Brandl30c78d62008-05-11 14:52:00 +00001237+--------------------+---------+---------------------------+
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +10001238| raw_unicode_escape | | Latin-1 encoding with |
1239| | | ``\uXXXX`` and |
1240| | | ``\UXXXXXXXX`` for other |
1241| | | code points. Existing |
1242| | | backslashes are not |
1243| | | escaped in any way. |
1244| | | It is used in the Python |
1245| | | pickle protocol. |
Georg Brandl30c78d62008-05-11 14:52:00 +00001246+--------------------+---------+---------------------------+
1247| undefined | | Raise an exception for |
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +10001248| | | all conversions, even |
1249| | | empty strings. The error |
1250| | | handler is ignored. |
Georg Brandl30c78d62008-05-11 14:52:00 +00001251+--------------------+---------+---------------------------+
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +10001252| unicode_escape | | Encoding suitable as the |
1253| | | contents of a Unicode |
1254| | | literal in ASCII-encoded |
1255| | | Python source code, |
1256| | | except that quotes are |
1257| | | not escaped. Decodes from |
1258| | | Latin-1 source code. |
1259| | | Beware that Python source |
1260| | | code actually uses UTF-8 |
1261| | | by default. |
Georg Brandl30c78d62008-05-11 14:52:00 +00001262+--------------------+---------+---------------------------+
1263| unicode_internal | | Return the internal |
1264| | | representation of the |
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +10001265| | | operand. Stateful codecs |
1266| | | are not supported. |
Victor Stinner9f4b1e92011-11-10 20:56:30 +01001267| | | |
1268| | | .. deprecated:: 3.3 |
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +10001269| | | This representation is |
1270| | | obsoleted by |
1271| | | :pep:`393`. |
Georg Brandl30c78d62008-05-11 14:52:00 +00001272+--------------------+---------+---------------------------+
Georg Brandl116aa622007-08-15 14:28:22 +00001273
Nick Coghlan9c1aed82013-11-23 11:13:36 +10001274.. _binary-transforms:
1275
1276Binary Transforms
1277^^^^^^^^^^^^^^^^^
1278
1279The following codecs provide binary transforms: :term:`bytes-like object`
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +10001280to :class:`bytes` mappings. They are not supported by :meth:`bytes.decode`
1281(which only produces :class:`str` output).
Nick Coghlan650e3222013-05-23 20:24:02 +10001282
Georg Brandl02524622010-12-02 18:06:51 +00001283
Nick Coghlan9c1aed82013-11-23 11:13:36 +10001284.. tabularcolumns:: |l|L|L|L|
Georg Brandl44ea77b2013-03-28 13:28:44 +01001285
Nick Coghlan9c1aed82013-11-23 11:13:36 +10001286+----------------------+------------------+------------------------------+------------------------------+
1287| Codec | Aliases | Purpose | Encoder / decoder |
1288+======================+==================+==============================+==============================+
Martin Panter06171bd2015-09-12 00:34:28 +00001289| base64_codec [#b64]_ | base64, base_64 | Convert operand to multiline | :meth:`base64.encodebytes` / |
1290| | | MIME base64 (the result | :meth:`base64.decodebytes` |
1291| | | always includes a trailing | |
Nick Coghlan9c1aed82013-11-23 11:13:36 +10001292| | | ``'\n'``) | |
1293| | | | |
1294| | | .. versionchanged:: 3.4 | |
1295| | | accepts any | |
1296| | | :term:`bytes-like object` | |
1297| | | as input for encoding and | |
1298| | | decoding | |
1299+----------------------+------------------+------------------------------+------------------------------+
1300| bz2_codec | bz2 | Compress the operand | :meth:`bz2.compress` / |
1301| | | using bz2 | :meth:`bz2.decompress` |
1302+----------------------+------------------+------------------------------+------------------------------+
Martin Panter06171bd2015-09-12 00:34:28 +00001303| hex_codec | hex | Convert operand to | :meth:`binascii.b2a_hex` / |
1304| | | hexadecimal | :meth:`binascii.a2b_hex` |
Nick Coghlan9c1aed82013-11-23 11:13:36 +10001305| | | representation, with two | |
1306| | | digits per byte | |
1307+----------------------+------------------+------------------------------+------------------------------+
Martin Panter06171bd2015-09-12 00:34:28 +00001308| quopri_codec | quopri, | Convert operand to MIME | :meth:`quopri.encode` with |
1309| | quotedprintable, | quoted printable | ``quotetabs=True`` / |
1310| | quoted_printable | | :meth:`quopri.decode` |
Nick Coghlan9c1aed82013-11-23 11:13:36 +10001311+----------------------+------------------+------------------------------+------------------------------+
1312| uu_codec | uu | Convert the operand using | :meth:`uu.encode` / |
1313| | | uuencode | :meth:`uu.decode` |
1314+----------------------+------------------+------------------------------+------------------------------+
1315| zlib_codec | zip, zlib | Compress the operand | :meth:`zlib.compress` / |
1316| | | using gzip | :meth:`zlib.decompress` |
1317+----------------------+------------------+------------------------------+------------------------------+
Georg Brandl02524622010-12-02 18:06:51 +00001318
Nick Coghlanfdf239a2013-10-03 00:43:22 +10001319.. [#b64] In addition to :term:`bytes-like objects <bytes-like object>`,
1320 ``'base64_codec'`` also accepts ASCII-only instances of :class:`str` for
1321 decoding
Nick Coghlan650e3222013-05-23 20:24:02 +10001322
Nick Coghlan9c1aed82013-11-23 11:13:36 +10001323.. versionadded:: 3.2
1324 Restoration of the binary transforms.
Nick Coghlan650e3222013-05-23 20:24:02 +10001325
Nick Coghlan9c1aed82013-11-23 11:13:36 +10001326.. versionchanged:: 3.4
1327 Restoration of the aliases for the binary transforms.
Georg Brandl02524622010-12-02 18:06:51 +00001328
Georg Brandl44ea77b2013-03-28 13:28:44 +01001329
Nick Coghlan9c1aed82013-11-23 11:13:36 +10001330.. _text-transforms:
1331
1332Text Transforms
1333^^^^^^^^^^^^^^^
1334
1335The following codec provides a text transform: a :class:`str` to :class:`str`
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +10001336mapping. It is not supported by :meth:`str.encode` (which only produces
1337:class:`bytes` output).
Nick Coghlan9c1aed82013-11-23 11:13:36 +10001338
1339.. tabularcolumns:: |l|l|L|
1340
1341+--------------------+---------+---------------------------+
1342| Codec | Aliases | Purpose |
1343+====================+=========+===========================+
1344| rot_13 | rot13 | Returns the Caesar-cypher |
1345| | | encryption of the operand |
1346+--------------------+---------+---------------------------+
Georg Brandl02524622010-12-02 18:06:51 +00001347
1348.. versionadded:: 3.2
Nick Coghlan9c1aed82013-11-23 11:13:36 +10001349 Restoration of the ``rot_13`` text transform.
1350
1351.. versionchanged:: 3.4
1352 Restoration of the ``rot13`` alias.
Georg Brandl02524622010-12-02 18:06:51 +00001353
Georg Brandl116aa622007-08-15 14:28:22 +00001354
1355:mod:`encodings.idna` --- Internationalized Domain Names in Applications
1356------------------------------------------------------------------------
1357
1358.. module:: encodings.idna
1359 :synopsis: Internationalized Domain Names implementation
1360.. moduleauthor:: Martin v. Löwis
1361
Georg Brandl116aa622007-08-15 14:28:22 +00001362This module implements :rfc:`3490` (Internationalized Domain Names in
1363Applications) and :rfc:`3492` (Nameprep: A Stringprep Profile for
1364Internationalized Domain Names (IDN)). It builds upon the ``punycode`` encoding
1365and :mod:`stringprep`.
1366
1367These RFCs together define a protocol to support non-ASCII characters in domain
1368names. A domain name containing non-ASCII characters (such as
1369``www.Alliancefrançaise.nu``) is converted into an ASCII-compatible encoding
1370(ACE, such as ``www.xn--alliancefranaise-npb.nu``). The ACE form of the domain
1371name is then used in all places where arbitrary characters are not allowed by
1372the protocol, such as DNS queries, HTTP :mailheader:`Host` fields, and so
1373on. This conversion is carried out in the application; if possible invisible to
1374the user: The application should transparently convert Unicode domain labels to
1375IDNA on the wire, and convert back ACE labels to Unicode before presenting them
1376to the user.
1377
R David Murraye0fd2f82011-04-13 14:12:18 -04001378Python supports this conversion in several ways: the ``idna`` codec performs
1379conversion between Unicode and ACE, separating an input string into labels
1380based on the separator characters defined in `section 3.1`_ (1) of :rfc:`3490`
1381and converting each label to ACE as required, and conversely separating an input
1382byte string into labels based on the ``.`` separator and converting any ACE
1383labels found into unicode. Furthermore, the :mod:`socket` module
Georg Brandl116aa622007-08-15 14:28:22 +00001384transparently converts Unicode host names to ACE, so that applications need not
1385be concerned about converting host names themselves when they pass them to the
1386socket module. On top of that, modules that have host names as function
Georg Brandl24420152008-05-26 16:32:26 +00001387parameters, such as :mod:`http.client` and :mod:`ftplib`, accept Unicode host
1388names (:mod:`http.client` then also transparently sends an IDNA hostname in the
Georg Brandl116aa622007-08-15 14:28:22 +00001389:mailheader:`Host` field if it sends that field at all).
1390
R David Murraye0fd2f82011-04-13 14:12:18 -04001391.. _section 3.1: http://tools.ietf.org/html/rfc3490#section-3.1
1392
Georg Brandl116aa622007-08-15 14:28:22 +00001393When receiving host names from the wire (such as in reverse name lookup), no
1394automatic conversion to Unicode is performed: Applications wishing to present
1395such host names to the user should decode them to Unicode.
1396
1397The module :mod:`encodings.idna` also implements the nameprep procedure, which
1398performs certain normalizations on host names, to achieve case-insensitivity of
1399international domain names, and to unify similar characters. The nameprep
1400functions can be used directly if desired.
1401
1402
1403.. function:: nameprep(label)
1404
1405 Return the nameprepped version of *label*. The implementation currently assumes
1406 query strings, so ``AllowUnassigned`` is true.
1407
1408
1409.. function:: ToASCII(label)
1410
1411 Convert a label to ASCII, as specified in :rfc:`3490`. ``UseSTD3ASCIIRules`` is
1412 assumed to be false.
1413
1414
1415.. function:: ToUnicode(label)
1416
1417 Convert a label to Unicode, as specified in :rfc:`3490`.
1418
1419
Victor Stinner554f3f02010-06-16 23:33:54 +00001420:mod:`encodings.mbcs` --- Windows ANSI codepage
1421-----------------------------------------------
1422
1423.. module:: encodings.mbcs
1424 :synopsis: Windows ANSI codepage
1425
Victor Stinner3a50e702011-10-18 21:21:00 +02001426Encode operand according to the ANSI codepage (CP_ACP).
Victor Stinner554f3f02010-06-16 23:33:54 +00001427
1428Availability: Windows only.
1429
Victor Stinner3a50e702011-10-18 21:21:00 +02001430.. versionchanged:: 3.3
1431 Support any error handler.
1432
Victor Stinner554f3f02010-06-16 23:33:54 +00001433.. versionchanged:: 3.2
1434 Before 3.2, the *errors* argument was ignored; ``'replace'`` was always used
1435 to encode, and ``'ignore'`` to decode.
1436
1437
Georg Brandl116aa622007-08-15 14:28:22 +00001438:mod:`encodings.utf_8_sig` --- UTF-8 codec with BOM signature
1439-------------------------------------------------------------
1440
1441.. module:: encodings.utf_8_sig
1442 :synopsis: UTF-8 codec with BOM signature
1443.. moduleauthor:: Walter Dörwald
1444
Georg Brandl116aa622007-08-15 14:28:22 +00001445This module implements a variant of the UTF-8 codec: On encoding a UTF-8 encoded
1446BOM will be prepended to the UTF-8 encoded bytes. For the stateful encoder this
1447is only done once (on the first write to the byte stream). For decoding an
1448optional UTF-8 encoded BOM at the start of the data will be skipped.
1449