blob: 46d72b5e9bd971d826bf89000427187806bff064 [file] [log] [blame]
Georg Brandl116aa622007-08-15 14:28:22 +00001:mod:`codecs` --- Codec registry and base classes
2=================================================
3
4.. module:: codecs
5 :synopsis: Encode and decode data and streams.
Antoine Pitroufbd4f802012-08-11 16:51:50 +02006.. moduleauthor:: Marc-André Lemburg <mal@lemburg.com>
7.. sectionauthor:: Marc-André Lemburg <mal@lemburg.com>
Georg Brandl116aa622007-08-15 14:28:22 +00008.. sectionauthor:: Martin v. Löwis <martin@v.loewis.de>
9
Andrew Kuchling2e3743c2014-03-19 16:23:01 -040010**Source code:** :source:`Lib/codecs.py`
Georg Brandl116aa622007-08-15 14:28:22 +000011
12.. index::
13 single: Unicode
14 single: Codecs
15 pair: Codecs; encode
16 pair: Codecs; decode
17 single: streams
18 pair: stackable; streams
19
20This module defines base classes for standard Python codecs (encoders and
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +100021decoders) and provides access to the internal Python codec registry, which
22manages the codec and error handling lookup process. Most standard codecs
23are :term:`text encodings <text encoding>`, which encode text to bytes,
24but there are also codecs provided that encode text to text, and bytes to
25bytes. Custom codecs may encode and decode between arbitrary types, but some
26module features are restricted to use specifically with
27:term:`text encodings <text encoding>`, or with codecs that encode to
28:class:`bytes`.
Georg Brandl116aa622007-08-15 14:28:22 +000029
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +100030The module defines the following functions for encoding and decoding with
31any codec:
Georg Brandl116aa622007-08-15 14:28:22 +000032
Nick Coghlan6cb2b5b2013-10-14 00:22:13 +100033.. function:: encode(obj, encoding='utf-8', errors='strict')
34
35 Encodes *obj* using the codec registered for *encoding*.
36
37 *Errors* may be given to set the desired error handling scheme. The
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +100038 default error handler is ``'strict'`` meaning that encoding errors raise
Nick Coghlan6cb2b5b2013-10-14 00:22:13 +100039 :exc:`ValueError` (or a more codec specific subclass, such as
40 :exc:`UnicodeEncodeError`). Refer to :ref:`codec-base-classes` for more
41 information on codec error handling.
42
43.. function:: decode(obj, encoding='utf-8', errors='strict')
44
45 Decodes *obj* using the codec registered for *encoding*.
46
47 *Errors* may be given to set the desired error handling scheme. The
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +100048 default error handler is ``'strict'`` meaning that decoding errors raise
Nick Coghlan6cb2b5b2013-10-14 00:22:13 +100049 :exc:`ValueError` (or a more codec specific subclass, such as
50 :exc:`UnicodeDecodeError`). Refer to :ref:`codec-base-classes` for more
51 information on codec error handling.
Georg Brandl116aa622007-08-15 14:28:22 +000052
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +100053The full details for each codec can also be looked up directly:
Georg Brandl116aa622007-08-15 14:28:22 +000054
55.. function:: lookup(encoding)
56
57 Looks up the codec info in the Python codec registry and returns a
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +100058 :class:`CodecInfo` object as defined below.
Georg Brandl116aa622007-08-15 14:28:22 +000059
60 Encodings are first looked up in the registry's cache. If not found, the list of
61 registered search functions is scanned. If no :class:`CodecInfo` object is
62 found, a :exc:`LookupError` is raised. Otherwise, the :class:`CodecInfo` object
63 is stored in the cache and returned to the caller.
64
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +100065.. class:: CodecInfo(encode, decode, streamreader=None, streamwriter=None, incrementalencoder=None, incrementaldecoder=None, name=None)
Georg Brandl116aa622007-08-15 14:28:22 +000066
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +100067 Codec details when looking up the codec registry. The constructor
68 arguments are stored in attributes of the same name:
69
70
71 .. attribute:: name
72
73 The name of the encoding.
74
75
76 .. attribute:: encode
77 decode
78
79 The stateless encoding and decoding functions. These must be
80 functions or methods which have the same interface as
81 the :meth:`~Codec.encode` and :meth:`~Codec.decode` methods of Codec
82 instances (see :ref:`Codec Interface <codec-objects>`).
83 The functions or methods are expected to work in a stateless mode.
84
85
86 .. attribute:: incrementalencoder
87 incrementaldecoder
88
89 Incremental encoder and decoder classes or factory functions.
90 These have to provide the interface defined by the base classes
91 :class:`IncrementalEncoder` and :class:`IncrementalDecoder`,
92 respectively. Incremental codecs can maintain state.
93
94
95 .. attribute:: streamwriter
96 streamreader
97
98 Stream writer and reader classes or factory functions. These have to
99 provide the interface defined by the base classes
100 :class:`StreamWriter` and :class:`StreamReader`, respectively.
101 Stream codecs can maintain state.
102
103To simplify access to the various codec components, the module provides
104these additional functions which use :func:`lookup` for the codec lookup:
Georg Brandl116aa622007-08-15 14:28:22 +0000105
106.. function:: getencoder(encoding)
107
108 Look up the codec for the given encoding and return its encoder function.
109
110 Raises a :exc:`LookupError` in case the encoding cannot be found.
111
112
113.. function:: getdecoder(encoding)
114
115 Look up the codec for the given encoding and return its decoder function.
116
117 Raises a :exc:`LookupError` in case the encoding cannot be found.
118
119
120.. function:: getincrementalencoder(encoding)
121
122 Look up the codec for the given encoding and return its incremental encoder
123 class or factory function.
124
125 Raises a :exc:`LookupError` in case the encoding cannot be found or the codec
126 doesn't support an incremental encoder.
127
Georg Brandl116aa622007-08-15 14:28:22 +0000128
129.. function:: getincrementaldecoder(encoding)
130
131 Look up the codec for the given encoding and return its incremental decoder
132 class or factory function.
133
134 Raises a :exc:`LookupError` in case the encoding cannot be found or the codec
135 doesn't support an incremental decoder.
136
Georg Brandl116aa622007-08-15 14:28:22 +0000137
138.. function:: getreader(encoding)
139
140 Look up the codec for the given encoding and return its StreamReader class or
141 factory function.
142
143 Raises a :exc:`LookupError` in case the encoding cannot be found.
144
145
146.. function:: getwriter(encoding)
147
148 Look up the codec for the given encoding and return its StreamWriter class or
149 factory function.
150
151 Raises a :exc:`LookupError` in case the encoding cannot be found.
152
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000153Custom codecs are made available by registering a suitable codec search
154function:
Georg Brandl116aa622007-08-15 14:28:22 +0000155
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000156.. function:: register(search_function)
Georg Brandl116aa622007-08-15 14:28:22 +0000157
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000158 Register a codec search function. Search functions are expected to take one
159 argument, being the encoding name in all lower case letters, and return a
160 :class:`CodecInfo` object. In case a search function cannot find
161 a given encoding, it should return ``None``.
Georg Brandl116aa622007-08-15 14:28:22 +0000162
163 .. note::
164
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000165 Search function registration is not currently reversible,
166 which may cause problems in some cases, such as unit testing or
167 module reloading.
168
169While the builtin :func:`open` and the associated :mod:`io` module are the
170recommended approach for working with encoded text files, this module
171provides additional utility functions and classes that allow the use of a
172wider range of codecs when working with binary files:
173
174.. function:: open(filename, mode='r', encoding=None, errors='strict', buffering=1)
175
176 Open an encoded file using the given *mode* and return an instance of
177 :class:`StreamReaderWriter`, providing transparent encoding/decoding.
178 The default file mode is ``'r'``, meaning to open the file in read mode.
Georg Brandl116aa622007-08-15 14:28:22 +0000179
Christian Heimes18c66892008-02-17 13:31:39 +0000180 .. note::
181
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000182 Underlying encoded files are always opened in binary mode.
183 No automatic conversion of ``'\n'`` is done on reading and writing.
184 The *mode* argument may be any binary mode acceptable to the built-in
185 :func:`open` function; the ``'b'`` is automatically added.
Christian Heimes18c66892008-02-17 13:31:39 +0000186
Georg Brandl116aa622007-08-15 14:28:22 +0000187 *encoding* specifies the encoding which is to be used for the file.
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000188 Any encoding that encodes to and decodes from bytes is allowed, and
189 the data types supported by the file methods depend on the codec used.
Georg Brandl116aa622007-08-15 14:28:22 +0000190
191 *errors* may be given to define the error handling. It defaults to ``'strict'``
192 which causes a :exc:`ValueError` to be raised in case an encoding error occurs.
193
194 *buffering* has the same meaning as for the built-in :func:`open` function. It
195 defaults to line buffered.
196
197
Georg Brandl0d8f0732009-04-05 22:20:44 +0000198.. function:: EncodedFile(file, data_encoding, file_encoding=None, errors='strict')
Georg Brandl116aa622007-08-15 14:28:22 +0000199
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000200 Return a :class:`StreamRecoder` instance, a wrapped version of *file*
201 which provides transparent transcoding. The original file is closed
202 when the wrapped version is closed.
Georg Brandl116aa622007-08-15 14:28:22 +0000203
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000204 Data written to the wrapped file is decoded according to the given
205 *data_encoding* and then written to the original file as bytes using
206 *file_encoding*. Bytes read from the original file are decoded
207 according to *file_encoding*, and the result is encoded
208 using *data_encoding*.
Georg Brandl116aa622007-08-15 14:28:22 +0000209
Georg Brandl0d8f0732009-04-05 22:20:44 +0000210 If *file_encoding* is not given, it defaults to *data_encoding*.
Georg Brandl116aa622007-08-15 14:28:22 +0000211
Georg Brandl0d8f0732009-04-05 22:20:44 +0000212 *errors* may be given to define the error handling. It defaults to
213 ``'strict'``, which causes :exc:`ValueError` to be raised in case an encoding
214 error occurs.
Georg Brandl116aa622007-08-15 14:28:22 +0000215
216
Georg Brandl0d8f0732009-04-05 22:20:44 +0000217.. function:: iterencode(iterator, encoding, errors='strict', **kwargs)
Georg Brandl116aa622007-08-15 14:28:22 +0000218
219 Uses an incremental encoder to iteratively encode the input provided by
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000220 *iterator*. This function is a :term:`generator`.
221 The *errors* argument (as well as any
Georg Brandl9afde1c2007-11-01 20:32:30 +0000222 other keyword argument) is passed through to the incremental encoder.
Georg Brandl116aa622007-08-15 14:28:22 +0000223
Georg Brandl116aa622007-08-15 14:28:22 +0000224
Georg Brandl0d8f0732009-04-05 22:20:44 +0000225.. function:: iterdecode(iterator, encoding, errors='strict', **kwargs)
Georg Brandl116aa622007-08-15 14:28:22 +0000226
227 Uses an incremental decoder to iteratively decode the input provided by
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000228 *iterator*. This function is a :term:`generator`.
229 The *errors* argument (as well as any
Georg Brandl9afde1c2007-11-01 20:32:30 +0000230 other keyword argument) is passed through to the incremental decoder.
Georg Brandl116aa622007-08-15 14:28:22 +0000231
Georg Brandl0d8f0732009-04-05 22:20:44 +0000232
Georg Brandl116aa622007-08-15 14:28:22 +0000233The module also provides the following constants which are useful for reading
234and writing to platform dependent files:
235
236
237.. data:: BOM
238 BOM_BE
239 BOM_LE
240 BOM_UTF8
241 BOM_UTF16
242 BOM_UTF16_BE
243 BOM_UTF16_LE
244 BOM_UTF32
245 BOM_UTF32_BE
246 BOM_UTF32_LE
247
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000248 These constants define various byte sequences,
249 being Unicode byte order marks (BOMs) for several encodings. They are
250 used in UTF-16 and UTF-32 data streams to indicate the byte order used,
251 and in UTF-8 as a Unicode signature. :const:`BOM_UTF16` is either
Georg Brandl116aa622007-08-15 14:28:22 +0000252 :const:`BOM_UTF16_BE` or :const:`BOM_UTF16_LE` depending on the platform's
253 native byte order, :const:`BOM` is an alias for :const:`BOM_UTF16`,
254 :const:`BOM_LE` for :const:`BOM_UTF16_LE` and :const:`BOM_BE` for
255 :const:`BOM_UTF16_BE`. The others represent the BOM in UTF-8 and UTF-32
256 encodings.
257
258
259.. _codec-base-classes:
260
261Codec Base Classes
262------------------
263
264The :mod:`codecs` module defines a set of base classes which define the
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000265interfaces for working with codec objects, and can also be used as the basis
266for custom codec implementations.
Georg Brandl116aa622007-08-15 14:28:22 +0000267
268Each codec has to define four interfaces to make it usable as codec in Python:
269stateless encoder, stateless decoder, stream reader and stream writer. The
270stream reader and writers typically reuse the stateless encoder/decoder to
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000271implement the file protocols. Codec authors also need to define how the
272codec will handle encoding and decoding errors.
Georg Brandl116aa622007-08-15 14:28:22 +0000273
Georg Brandl116aa622007-08-15 14:28:22 +0000274
Nick Coghlanf2126362015-01-07 13:14:47 +1000275.. _surrogateescape:
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000276.. _error-handlers:
277
278Error Handlers
279^^^^^^^^^^^^^^
280
281To simplify and standardize error handling,
282codecs may implement different error handling schemes by
283accepting the *errors* string argument. The following string values are
284defined and implemented by all standard Python codecs:
Georg Brandl116aa622007-08-15 14:28:22 +0000285
Georg Brandl44ea77b2013-03-28 13:28:44 +0100286.. tabularcolumns:: |l|L|
287
Georg Brandl116aa622007-08-15 14:28:22 +0000288+-------------------------+-----------------------------------------------+
289| Value | Meaning |
290+=========================+===============================================+
291| ``'strict'`` | Raise :exc:`UnicodeError` (or a subclass); |
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000292| | this is the default. Implemented in |
293| | :func:`strict_errors`. |
Georg Brandl116aa622007-08-15 14:28:22 +0000294+-------------------------+-----------------------------------------------+
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000295| ``'ignore'`` | Ignore the malformed data and continue |
296| | without further notice. Implemented in |
297| | :func:`ignore_errors`. |
Georg Brandl116aa622007-08-15 14:28:22 +0000298+-------------------------+-----------------------------------------------+
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000299
300The following error handlers are only applicable to
301:term:`text encodings <text encoding>`:
302
303+-------------------------+-----------------------------------------------+
304| Value | Meaning |
305+=========================+===============================================+
Georg Brandl116aa622007-08-15 14:28:22 +0000306| ``'replace'`` | Replace with a suitable replacement |
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000307| | marker; Python will use the official |
308| | ``U+FFFD`` REPLACEMENT CHARACTER for the |
309| | built-in codecs on decoding, and '?' on |
310| | encoding. Implemented in |
311| | :func:`replace_errors`. |
Georg Brandl116aa622007-08-15 14:28:22 +0000312+-------------------------+-----------------------------------------------+
313| ``'xmlcharrefreplace'`` | Replace with the appropriate XML character |
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000314| | reference (only for encoding). Implemented |
315| | in :func:`xmlcharrefreplace_errors`. |
Georg Brandl116aa622007-08-15 14:28:22 +0000316+-------------------------+-----------------------------------------------+
Serhiy Storchaka07985ef2015-01-25 22:56:57 +0200317| ``'backslashreplace'`` | Replace with backslashed escape sequences. |
318| | Implemented in |
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000319| | :func:`backslashreplace_errors`. |
Georg Brandl116aa622007-08-15 14:28:22 +0000320+-------------------------+-----------------------------------------------+
Serhiy Storchaka166ebc42014-11-25 13:57:17 +0200321| ``'namereplace'`` | Replace with ``\N{...}`` escape sequences |
Nick Coghlanf2126362015-01-07 13:14:47 +1000322| | (only for encoding). Implemented in |
323| | :func:`namereplace_errors`. |
Serhiy Storchaka166ebc42014-11-25 13:57:17 +0200324+-------------------------+-----------------------------------------------+
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000325| ``'surrogateescape'`` | On decoding, replace byte with individual |
326| | surrogate code ranging from ``U+DC80`` to |
327| | ``U+DCFF``. This code will then be turned |
328| | back into the same byte when the |
329| | ``'surrogateescape'`` error handler is used |
330| | when encoding the data. (See :pep:`383` for |
331| | more.) |
Martin v. Löwis011e8422009-05-05 04:43:17 +0000332+-------------------------+-----------------------------------------------+
Georg Brandl116aa622007-08-15 14:28:22 +0000333
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000334In addition, the following error handler is specific to the given codecs:
Martin v. Löwisdb12d452009-05-02 18:52:14 +0000335
Serhiy Storchaka58cf6072013-11-19 11:32:41 +0200336+-------------------+------------------------+-------------------------------------------+
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000337| Value | Codecs | Meaning |
Serhiy Storchaka58cf6072013-11-19 11:32:41 +0200338+===================+========================+===========================================+
339|``'surrogatepass'``| utf-8, utf-16, utf-32, | Allow encoding and decoding of surrogate |
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000340| | utf-16-be, utf-16-le, | codes. These codecs normally treat the |
341| | utf-32-be, utf-32-le | presence of surrogates as an error. |
Serhiy Storchaka58cf6072013-11-19 11:32:41 +0200342+-------------------+------------------------+-------------------------------------------+
Martin v. Löwisdb12d452009-05-02 18:52:14 +0000343
344.. versionadded:: 3.1
Martin v. Löwis43c57782009-05-10 08:15:24 +0000345 The ``'surrogateescape'`` and ``'surrogatepass'`` error handlers.
Martin v. Löwisdb12d452009-05-02 18:52:14 +0000346
Serhiy Storchaka58cf6072013-11-19 11:32:41 +0200347.. versionchanged:: 3.4
348 The ``'surrogatepass'`` error handlers now works with utf-16\* and utf-32\* codecs.
349
Berker Peksag87f6c222014-11-25 18:59:20 +0200350.. versionadded:: 3.5
Serhiy Storchaka166ebc42014-11-25 13:57:17 +0200351 The ``'namereplace'`` error handler.
352
Serhiy Storchaka07985ef2015-01-25 22:56:57 +0200353.. versionchanged:: 3.5
354 The ``'backslashreplace'`` error handlers now works with decoding and
355 translating.
356
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000357The set of allowed values can be extended by registering a new named error
358handler:
359
360.. function:: register_error(name, error_handler)
361
362 Register the error handling function *error_handler* under the name *name*.
363 The *error_handler* argument will be called during encoding and decoding
364 in case of an error, when *name* is specified as the errors parameter.
365
366 For encoding, *error_handler* will be called with a :exc:`UnicodeEncodeError`
367 instance, which contains information about the location of the error. The
368 error handler must either raise this or a different exception, or return a
369 tuple with a replacement for the unencodable part of the input and a position
370 where encoding should continue. The replacement may be either :class:`str` or
371 :class:`bytes`. If the replacement is bytes, the encoder will simply copy
372 them into the output buffer. If the replacement is a string, the encoder will
373 encode the replacement. Encoding continues on original input at the
374 specified position. Negative position values will be treated as being
375 relative to the end of the input string. If the resulting position is out of
376 bound an :exc:`IndexError` will be raised.
377
378 Decoding and translating works similarly, except :exc:`UnicodeDecodeError` or
379 :exc:`UnicodeTranslateError` will be passed to the handler and that the
380 replacement from the error handler will be put into the output directly.
381
382
383Previously registered error handlers (including the standard error handlers)
384can be looked up by name:
385
386.. function:: lookup_error(name)
387
388 Return the error handler previously registered under the name *name*.
389
390 Raises a :exc:`LookupError` in case the handler cannot be found.
391
392The following standard error handlers are also made available as module level
393functions:
394
395.. function:: strict_errors(exception)
396
397 Implements the ``'strict'`` error handling: each encoding or
398 decoding error raises a :exc:`UnicodeError`.
399
400
401.. function:: replace_errors(exception)
402
403 Implements the ``'replace'`` error handling (for :term:`text encodings
404 <text encoding>` only): substitutes ``'?'`` for encoding errors
405 (to be encoded by the codec), and ``'\ufffd'`` (the Unicode replacement
Georg Brandl7e91af32015-02-25 13:05:53 +0100406 character) for decoding errors.
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000407
408
409.. function:: ignore_errors(exception)
410
411 Implements the ``'ignore'`` error handling: malformed data is ignored and
412 encoding or decoding is continued without further notice.
413
414
415.. function:: xmlcharrefreplace_errors(exception)
416
417 Implements the ``'xmlcharrefreplace'`` error handling (for encoding with
418 :term:`text encodings <text encoding>` only): the
419 unencodable character is replaced by an appropriate XML character reference.
420
421
422.. function:: backslashreplace_errors(exception)
423
Serhiy Storchaka07985ef2015-01-25 22:56:57 +0200424 Implements the ``'backslashreplace'`` error handling (for
425 :term:`text encodings <text encoding>` only): malformed data is
426 replaced by a backslashed escape sequence.
Georg Brandl116aa622007-08-15 14:28:22 +0000427
Nick Coghlan582acb72015-01-07 00:37:01 +1000428.. function:: namereplace_errors(exception)
429
Nick Coghlanf2126362015-01-07 13:14:47 +1000430 Implements the ``'namereplace'`` error handling (for encoding with
431 :term:`text encodings <text encoding>` only): the
Nick Coghlan582acb72015-01-07 00:37:01 +1000432 unencodable character is replaced by a ``\N{...}`` escape sequence.
433
434 .. versionadded:: 3.5
Georg Brandl116aa622007-08-15 14:28:22 +0000435
436
437.. _codec-objects:
438
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000439Stateless Encoding and Decoding
440^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Georg Brandl116aa622007-08-15 14:28:22 +0000441
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000442The base :class:`Codec` class defines these methods which also define the
443function interfaces of the stateless encoder and decoder:
Georg Brandl116aa622007-08-15 14:28:22 +0000444
445
446.. method:: Codec.encode(input[, errors])
447
448 Encodes the object *input* and returns a tuple (output object, length consumed).
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000449 For instance, :term:`text encoding` converts
450 a string object to a bytes object using a particular
Georg Brandl116aa622007-08-15 14:28:22 +0000451 character set encoding (e.g., ``cp1252`` or ``iso-8859-1``).
452
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000453 The *errors* argument defines the error handling to apply.
454 It defaults to ``'strict'`` handling.
Georg Brandl116aa622007-08-15 14:28:22 +0000455
456 The method may not store state in the :class:`Codec` instance. Use
Berker Peksag41ca8282015-07-30 18:26:10 +0300457 :class:`StreamWriter` for codecs which have to keep state in order to make
458 encoding efficient.
Georg Brandl116aa622007-08-15 14:28:22 +0000459
460 The encoder must be able to handle zero length input and return an empty object
461 of the output object type in this situation.
462
463
464.. method:: Codec.decode(input[, errors])
465
Georg Brandl30c78d62008-05-11 14:52:00 +0000466 Decodes the object *input* and returns a tuple (output object, length
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000467 consumed). For instance, for a :term:`text encoding`, decoding converts
468 a bytes object encoded using a particular
Georg Brandl30c78d62008-05-11 14:52:00 +0000469 character set encoding to a string object.
Georg Brandl116aa622007-08-15 14:28:22 +0000470
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000471 For text encodings and bytes-to-bytes codecs,
472 *input* must be a bytes object or one which provides the read-only
Georg Brandl30c78d62008-05-11 14:52:00 +0000473 buffer interface -- for example, buffer objects and memory mapped files.
Georg Brandl116aa622007-08-15 14:28:22 +0000474
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000475 The *errors* argument defines the error handling to apply.
476 It defaults to ``'strict'`` handling.
Georg Brandl116aa622007-08-15 14:28:22 +0000477
478 The method may not store state in the :class:`Codec` instance. Use
Berker Peksag41ca8282015-07-30 18:26:10 +0300479 :class:`StreamReader` for codecs which have to keep state in order to make
480 decoding efficient.
Georg Brandl116aa622007-08-15 14:28:22 +0000481
482 The decoder must be able to handle zero length input and return an empty object
483 of the output object type in this situation.
484
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000485
486Incremental Encoding and Decoding
487^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
488
Georg Brandl116aa622007-08-15 14:28:22 +0000489The :class:`IncrementalEncoder` and :class:`IncrementalDecoder` classes provide
490the basic interface for incremental encoding and decoding. Encoding/decoding the
491input isn't done with one call to the stateless encoder/decoder function, but
Serhiy Storchakabfdcd432013-10-13 23:09:14 +0300492with multiple calls to the
493:meth:`~IncrementalEncoder.encode`/:meth:`~IncrementalDecoder.decode` method of
494the incremental encoder/decoder. The incremental encoder/decoder keeps track of
495the encoding/decoding process during method calls.
Georg Brandl116aa622007-08-15 14:28:22 +0000496
Serhiy Storchakabfdcd432013-10-13 23:09:14 +0300497The joined output of calls to the
498:meth:`~IncrementalEncoder.encode`/:meth:`~IncrementalDecoder.decode` method is
499the same as if all the single inputs were joined into one, and this input was
Georg Brandl116aa622007-08-15 14:28:22 +0000500encoded/decoded with the stateless encoder/decoder.
501
502
503.. _incremental-encoder-objects:
504
505IncrementalEncoder Objects
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000506~~~~~~~~~~~~~~~~~~~~~~~~~~
Georg Brandl116aa622007-08-15 14:28:22 +0000507
Georg Brandl116aa622007-08-15 14:28:22 +0000508The :class:`IncrementalEncoder` class is used for encoding an input in multiple
509steps. It defines the following methods which every incremental encoder must
510define in order to be compatible with the Python codec registry.
511
512
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000513.. class:: IncrementalEncoder(errors='strict')
Georg Brandl116aa622007-08-15 14:28:22 +0000514
515 Constructor for an :class:`IncrementalEncoder` instance.
516
517 All incremental encoders must provide this constructor interface. They are free
518 to add additional keyword arguments, but only the ones defined here are used by
519 the Python codec registry.
520
521 The :class:`IncrementalEncoder` may implement different error handling schemes
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000522 by providing the *errors* keyword argument. See :ref:`error-handlers` for
523 possible values.
Serhiy Storchaka166ebc42014-11-25 13:57:17 +0200524
Georg Brandl116aa622007-08-15 14:28:22 +0000525 The *errors* argument will be assigned to an attribute of the same name.
526 Assigning to this attribute makes it possible to switch between different error
527 handling strategies during the lifetime of the :class:`IncrementalEncoder`
528 object.
529
Georg Brandl116aa622007-08-15 14:28:22 +0000530
Benjamin Petersone41251e2008-04-25 01:59:09 +0000531 .. method:: encode(object[, final])
Georg Brandl116aa622007-08-15 14:28:22 +0000532
Benjamin Petersone41251e2008-04-25 01:59:09 +0000533 Encodes *object* (taking the current state of the encoder into account)
534 and returns the resulting encoded object. If this is the last call to
535 :meth:`encode` *final* must be true (the default is false).
Georg Brandl116aa622007-08-15 14:28:22 +0000536
537
Benjamin Petersone41251e2008-04-25 01:59:09 +0000538 .. method:: reset()
Georg Brandl116aa622007-08-15 14:28:22 +0000539
Victor Stinnere15dce32011-05-30 22:56:00 +0200540 Reset the encoder to the initial state. The output is discarded: call
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000541 ``.encode(object, final=True)``, passing an empty byte or text string
542 if necessary, to reset the encoder and to get the output.
Georg Brandl116aa622007-08-15 14:28:22 +0000543
544
545.. method:: IncrementalEncoder.getstate()
546
547 Return the current state of the encoder which must be an integer. The
548 implementation should make sure that ``0`` is the most common state. (States
549 that are more complicated than integers can be converted into an integer by
550 marshaling/pickling the state and encoding the bytes of the resulting string
551 into an integer).
552
Georg Brandl116aa622007-08-15 14:28:22 +0000553
554.. method:: IncrementalEncoder.setstate(state)
555
556 Set the state of the encoder to *state*. *state* must be an encoder state
557 returned by :meth:`getstate`.
558
Georg Brandl116aa622007-08-15 14:28:22 +0000559
560.. _incremental-decoder-objects:
561
562IncrementalDecoder Objects
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000563~~~~~~~~~~~~~~~~~~~~~~~~~~
Georg Brandl116aa622007-08-15 14:28:22 +0000564
565The :class:`IncrementalDecoder` class is used for decoding an input in multiple
566steps. It defines the following methods which every incremental decoder must
567define in order to be compatible with the Python codec registry.
568
569
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000570.. class:: IncrementalDecoder(errors='strict')
Georg Brandl116aa622007-08-15 14:28:22 +0000571
572 Constructor for an :class:`IncrementalDecoder` instance.
573
574 All incremental decoders must provide this constructor interface. They are free
575 to add additional keyword arguments, but only the ones defined here are used by
576 the Python codec registry.
577
578 The :class:`IncrementalDecoder` may implement different error handling schemes
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000579 by providing the *errors* keyword argument. See :ref:`error-handlers` for
580 possible values.
Georg Brandl116aa622007-08-15 14:28:22 +0000581
582 The *errors* argument will be assigned to an attribute of the same name.
583 Assigning to this attribute makes it possible to switch between different error
Benjamin Peterson3e4f0552008-09-02 00:31:15 +0000584 handling strategies during the lifetime of the :class:`IncrementalDecoder`
Georg Brandl116aa622007-08-15 14:28:22 +0000585 object.
586
Georg Brandl116aa622007-08-15 14:28:22 +0000587
Benjamin Petersone41251e2008-04-25 01:59:09 +0000588 .. method:: decode(object[, final])
Georg Brandl116aa622007-08-15 14:28:22 +0000589
Benjamin Petersone41251e2008-04-25 01:59:09 +0000590 Decodes *object* (taking the current state of the decoder into account)
591 and returns the resulting decoded object. If this is the last call to
592 :meth:`decode` *final* must be true (the default is false). If *final* is
593 true the decoder must decode the input completely and must flush all
594 buffers. If this isn't possible (e.g. because of incomplete byte sequences
595 at the end of the input) it must initiate error handling just like in the
596 stateless case (which might raise an exception).
Georg Brandl116aa622007-08-15 14:28:22 +0000597
598
Benjamin Petersone41251e2008-04-25 01:59:09 +0000599 .. method:: reset()
Georg Brandl116aa622007-08-15 14:28:22 +0000600
Benjamin Petersone41251e2008-04-25 01:59:09 +0000601 Reset the decoder to the initial state.
Georg Brandl116aa622007-08-15 14:28:22 +0000602
603
Benjamin Petersone41251e2008-04-25 01:59:09 +0000604 .. method:: getstate()
Georg Brandl116aa622007-08-15 14:28:22 +0000605
Benjamin Petersone41251e2008-04-25 01:59:09 +0000606 Return the current state of the decoder. This must be a tuple with two
607 items, the first must be the buffer containing the still undecoded
608 input. The second must be an integer and can be additional state
609 info. (The implementation should make sure that ``0`` is the most common
610 additional state info.) If this additional state info is ``0`` it must be
611 possible to set the decoder to the state which has no input buffered and
612 ``0`` as the additional state info, so that feeding the previously
613 buffered input to the decoder returns it to the previous state without
614 producing any output. (Additional state info that is more complicated than
615 integers can be converted into an integer by marshaling/pickling the info
616 and encoding the bytes of the resulting string into an integer.)
Georg Brandl116aa622007-08-15 14:28:22 +0000617
Georg Brandl116aa622007-08-15 14:28:22 +0000618
Benjamin Petersone41251e2008-04-25 01:59:09 +0000619 .. method:: setstate(state)
Georg Brandl116aa622007-08-15 14:28:22 +0000620
Benjamin Petersone41251e2008-04-25 01:59:09 +0000621 Set the state of the encoder to *state*. *state* must be a decoder state
622 returned by :meth:`getstate`.
623
Georg Brandl116aa622007-08-15 14:28:22 +0000624
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000625Stream Encoding and Decoding
626^^^^^^^^^^^^^^^^^^^^^^^^^^^^
627
628
Georg Brandl116aa622007-08-15 14:28:22 +0000629The :class:`StreamWriter` and :class:`StreamReader` classes provide generic
630working interfaces which can be used to implement new encoding submodules very
631easily. See :mod:`encodings.utf_8` for an example of how this is done.
632
633
634.. _stream-writer-objects:
635
636StreamWriter Objects
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000637~~~~~~~~~~~~~~~~~~~~
Georg Brandl116aa622007-08-15 14:28:22 +0000638
639The :class:`StreamWriter` class is a subclass of :class:`Codec` and defines the
640following methods which every stream writer must define in order to be
641compatible with the Python codec registry.
642
643
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000644.. class:: StreamWriter(stream, errors='strict')
Georg Brandl116aa622007-08-15 14:28:22 +0000645
646 Constructor for a :class:`StreamWriter` instance.
647
648 All stream writers must provide this constructor interface. They are free to add
649 additional keyword arguments, but only the ones defined here are used by the
650 Python codec registry.
651
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000652 The *stream* argument must be a file-like object open for writing
653 text or binary data, as appropriate for the specific codec.
Georg Brandl116aa622007-08-15 14:28:22 +0000654
655 The :class:`StreamWriter` may implement different error handling schemes by
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000656 providing the *errors* keyword argument. See :ref:`error-handlers` for
657 the standard error handlers the underlying stream codec may support.
Serhiy Storchaka166ebc42014-11-25 13:57:17 +0200658
Georg Brandl116aa622007-08-15 14:28:22 +0000659 The *errors* argument will be assigned to an attribute of the same name.
660 Assigning to this attribute makes it possible to switch between different error
661 handling strategies during the lifetime of the :class:`StreamWriter` object.
662
Benjamin Petersone41251e2008-04-25 01:59:09 +0000663 .. method:: write(object)
Georg Brandl116aa622007-08-15 14:28:22 +0000664
Benjamin Petersone41251e2008-04-25 01:59:09 +0000665 Writes the object's contents encoded to the stream.
Georg Brandl116aa622007-08-15 14:28:22 +0000666
667
Benjamin Petersone41251e2008-04-25 01:59:09 +0000668 .. method:: writelines(list)
Georg Brandl116aa622007-08-15 14:28:22 +0000669
Benjamin Petersone41251e2008-04-25 01:59:09 +0000670 Writes the concatenated list of strings to the stream (possibly by reusing
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000671 the :meth:`write` method). The standard bytes-to-bytes codecs
672 do not support this method.
Georg Brandl116aa622007-08-15 14:28:22 +0000673
674
Benjamin Petersone41251e2008-04-25 01:59:09 +0000675 .. method:: reset()
Georg Brandl116aa622007-08-15 14:28:22 +0000676
Benjamin Petersone41251e2008-04-25 01:59:09 +0000677 Flushes and resets the codec buffers used for keeping state.
Georg Brandl116aa622007-08-15 14:28:22 +0000678
Benjamin Petersone41251e2008-04-25 01:59:09 +0000679 Calling this method should ensure that the data on the output is put into
680 a clean state that allows appending of new fresh data without having to
681 rescan the whole stream to recover state.
682
Georg Brandl116aa622007-08-15 14:28:22 +0000683
684In addition to the above methods, the :class:`StreamWriter` must also inherit
685all other methods and attributes from the underlying stream.
686
687
688.. _stream-reader-objects:
689
690StreamReader Objects
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000691~~~~~~~~~~~~~~~~~~~~
Georg Brandl116aa622007-08-15 14:28:22 +0000692
693The :class:`StreamReader` class is a subclass of :class:`Codec` and defines the
694following methods which every stream reader must define in order to be
695compatible with the Python codec registry.
696
697
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000698.. class:: StreamReader(stream, errors='strict')
Georg Brandl116aa622007-08-15 14:28:22 +0000699
700 Constructor for a :class:`StreamReader` instance.
701
702 All stream readers must provide this constructor interface. They are free to add
703 additional keyword arguments, but only the ones defined here are used by the
704 Python codec registry.
705
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000706 The *stream* argument must be a file-like object open for reading
707 text or binary data, as appropriate for the specific codec.
Georg Brandl116aa622007-08-15 14:28:22 +0000708
709 The :class:`StreamReader` may implement different error handling schemes by
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000710 providing the *errors* keyword argument. See :ref:`error-handlers` for
711 the standard error handlers the underlying stream codec may support.
Georg Brandl116aa622007-08-15 14:28:22 +0000712
713 The *errors* argument will be assigned to an attribute of the same name.
714 Assigning to this attribute makes it possible to switch between different error
715 handling strategies during the lifetime of the :class:`StreamReader` object.
716
717 The set of allowed values for the *errors* argument can be extended with
718 :func:`register_error`.
719
720
Benjamin Petersone41251e2008-04-25 01:59:09 +0000721 .. method:: read([size[, chars, [firstline]]])
Georg Brandl116aa622007-08-15 14:28:22 +0000722
Benjamin Petersone41251e2008-04-25 01:59:09 +0000723 Decodes data from the stream and returns the resulting object.
Georg Brandl116aa622007-08-15 14:28:22 +0000724
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000725 The *chars* argument indicates the number of decoded
726 code points or bytes to return. The :func:`read` method will
727 never return more data than requested, but it might return less,
728 if there is not enough available.
Georg Brandl116aa622007-08-15 14:28:22 +0000729
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000730 The *size* argument indicates the approximate maximum
731 number of encoded bytes or code points to read
732 for decoding. The decoder can modify this setting as
Benjamin Petersone41251e2008-04-25 01:59:09 +0000733 appropriate. The default value -1 indicates to read and decode as much as
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000734 possible. This parameter is intended to
735 prevent having to decode huge files in one step.
Georg Brandl116aa622007-08-15 14:28:22 +0000736
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000737 The *firstline* flag indicates that
738 it would be sufficient to only return the first
Benjamin Petersone41251e2008-04-25 01:59:09 +0000739 line, if there are decoding errors on later lines.
Georg Brandl116aa622007-08-15 14:28:22 +0000740
Benjamin Petersone41251e2008-04-25 01:59:09 +0000741 The method should use a greedy read strategy meaning that it should read
742 as much data as is allowed within the definition of the encoding and the
743 given size, e.g. if optional encoding endings or state markers are
744 available on the stream, these should be read too.
Georg Brandl116aa622007-08-15 14:28:22 +0000745
Georg Brandl116aa622007-08-15 14:28:22 +0000746
Benjamin Petersone41251e2008-04-25 01:59:09 +0000747 .. method:: readline([size[, keepends]])
Georg Brandl116aa622007-08-15 14:28:22 +0000748
Benjamin Petersone41251e2008-04-25 01:59:09 +0000749 Read one line from the input stream and return the decoded data.
Georg Brandl116aa622007-08-15 14:28:22 +0000750
Benjamin Petersone41251e2008-04-25 01:59:09 +0000751 *size*, if given, is passed as size argument to the stream's
Serhiy Storchakacca40ff2013-07-11 18:26:13 +0300752 :meth:`read` method.
Georg Brandl116aa622007-08-15 14:28:22 +0000753
Benjamin Petersone41251e2008-04-25 01:59:09 +0000754 If *keepends* is false line-endings will be stripped from the lines
755 returned.
Georg Brandl116aa622007-08-15 14:28:22 +0000756
Georg Brandl116aa622007-08-15 14:28:22 +0000757
Benjamin Petersone41251e2008-04-25 01:59:09 +0000758 .. method:: readlines([sizehint[, keepends]])
Georg Brandl116aa622007-08-15 14:28:22 +0000759
Benjamin Petersone41251e2008-04-25 01:59:09 +0000760 Read all lines available on the input stream and return them as a list of
761 lines.
Georg Brandl116aa622007-08-15 14:28:22 +0000762
Benjamin Petersone41251e2008-04-25 01:59:09 +0000763 Line-endings are implemented using the codec's decoder method and are
764 included in the list entries if *keepends* is true.
Georg Brandl116aa622007-08-15 14:28:22 +0000765
Benjamin Petersone41251e2008-04-25 01:59:09 +0000766 *sizehint*, if given, is passed as the *size* argument to the stream's
767 :meth:`read` method.
Georg Brandl116aa622007-08-15 14:28:22 +0000768
769
Benjamin Petersone41251e2008-04-25 01:59:09 +0000770 .. method:: reset()
Georg Brandl116aa622007-08-15 14:28:22 +0000771
Benjamin Petersone41251e2008-04-25 01:59:09 +0000772 Resets the codec buffers used for keeping state.
Georg Brandl116aa622007-08-15 14:28:22 +0000773
Benjamin Petersone41251e2008-04-25 01:59:09 +0000774 Note that no stream repositioning should take place. This method is
775 primarily intended to be able to recover from decoding errors.
776
Georg Brandl116aa622007-08-15 14:28:22 +0000777
778In addition to the above methods, the :class:`StreamReader` must also inherit
779all other methods and attributes from the underlying stream.
780
Georg Brandl116aa622007-08-15 14:28:22 +0000781.. _stream-reader-writer:
782
783StreamReaderWriter Objects
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000784~~~~~~~~~~~~~~~~~~~~~~~~~~
Georg Brandl116aa622007-08-15 14:28:22 +0000785
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000786The :class:`StreamReaderWriter` is a convenience class that allows wrapping
787streams which work in both read and write modes.
Georg Brandl116aa622007-08-15 14:28:22 +0000788
789The design is such that one can use the factory functions returned by the
790:func:`lookup` function to construct the instance.
791
792
793.. class:: StreamReaderWriter(stream, Reader, Writer, errors)
794
795 Creates a :class:`StreamReaderWriter` instance. *stream* must be a file-like
796 object. *Reader* and *Writer* must be factory functions or classes providing the
797 :class:`StreamReader` and :class:`StreamWriter` interface resp. Error handling
798 is done in the same way as defined for the stream readers and writers.
799
800:class:`StreamReaderWriter` instances define the combined interfaces of
801:class:`StreamReader` and :class:`StreamWriter` classes. They inherit all other
802methods and attributes from the underlying stream.
803
804
805.. _stream-recoder-objects:
806
807StreamRecoder Objects
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000808~~~~~~~~~~~~~~~~~~~~~
Georg Brandl116aa622007-08-15 14:28:22 +0000809
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000810The :class:`StreamRecoder` translates data from one encoding to another,
Georg Brandl116aa622007-08-15 14:28:22 +0000811which is sometimes useful when dealing with different encoding environments.
812
813The design is such that one can use the factory functions returned by the
814:func:`lookup` function to construct the instance.
815
816
817.. class:: StreamRecoder(stream, encode, decode, Reader, Writer, errors)
818
819 Creates a :class:`StreamRecoder` instance which implements a two-way conversion:
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000820 *encode* and *decode* work on the frontend — the data visible to
821 code calling :meth:`read` and :meth:`write`, while *Reader* and *Writer*
822 work on the backend — the data in *stream*.
Georg Brandl116aa622007-08-15 14:28:22 +0000823
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000824 You can use these objects to do transparent transcodings from e.g. Latin-1
Georg Brandl116aa622007-08-15 14:28:22 +0000825 to UTF-8 and back.
826
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000827 The *stream* argument must be a file-like object.
Georg Brandl116aa622007-08-15 14:28:22 +0000828
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000829 The *encode* and *decode* arguments must
830 adhere to the :class:`Codec` interface. *Reader* and
Georg Brandl116aa622007-08-15 14:28:22 +0000831 *Writer* must be factory functions or classes providing objects of the
832 :class:`StreamReader` and :class:`StreamWriter` interface respectively.
833
Georg Brandl116aa622007-08-15 14:28:22 +0000834 Error handling is done in the same way as defined for the stream readers and
835 writers.
836
Benjamin Petersone41251e2008-04-25 01:59:09 +0000837
Georg Brandl116aa622007-08-15 14:28:22 +0000838:class:`StreamRecoder` instances define the combined interfaces of
839:class:`StreamReader` and :class:`StreamWriter` classes. They inherit all other
840methods and attributes from the underlying stream.
841
842
843.. _encodings-overview:
844
845Encodings and Unicode
846---------------------
847
Georg Brandl3be472b2015-01-14 08:26:30 +0100848Strings are stored internally as sequences of code points in
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000849range ``0x0``-``0x10FFFF``. (See :pep:`393` for
850more details about the implementation.)
851Once a string object is used outside of CPU and memory, endianness
852and how these arrays are stored as bytes become an issue. As with other
853codecs, serialising a string into a sequence of bytes is known as *encoding*,
854and recreating the string from the sequence of bytes is known as *decoding*.
855There are a variety of different text serialisation codecs, which are
856collectivity referred to as :term:`text encodings <text encoding>`.
857
858The simplest text encoding (called ``'latin-1'`` or ``'iso-8859-1'``) maps
Georg Brandl3be472b2015-01-14 08:26:30 +0100859the code points 0-255 to the bytes ``0x0``-``0xff``, which means that a string
860object that contains code points above ``U+00FF`` can't be encoded with this
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +1000861codec. Doing so will raise a :exc:`UnicodeEncodeError` that looks
862like the following (although the details of the error message may differ):
863``UnicodeEncodeError: 'latin-1' codec can't encode character '\u1234' in
864position 3: ordinal not in range(256)``.
Georg Brandl116aa622007-08-15 14:28:22 +0000865
866There's another group of encodings (the so called charmap encodings) that choose
Georg Brandl3be472b2015-01-14 08:26:30 +0100867a different subset of all Unicode code points and how these code points are
Georg Brandl116aa622007-08-15 14:28:22 +0000868mapped to the bytes ``0x0``-``0xff``. To see how this is done simply open
869e.g. :file:`encodings/cp1252.py` (which is an encoding that is used primarily on
870Windows). There's a string constant with 256 characters that shows you which
871character is mapped to which byte value.
872
Georg Brandl3be472b2015-01-14 08:26:30 +0100873All of these encodings can only encode 256 of the 1114112 code points
Georg Brandl30c78d62008-05-11 14:52:00 +0000874defined in Unicode. A simple and straightforward way that can store each Unicode
Georg Brandl3be472b2015-01-14 08:26:30 +0100875code point, is to store each code point as four consecutive bytes. There are two
Ezio Melottifbb39812011-10-25 10:40:38 +0300876possibilities: store the bytes in big endian or in little endian order. These
877two encodings are called ``UTF-32-BE`` and ``UTF-32-LE`` respectively. Their
878disadvantage is that if e.g. you use ``UTF-32-BE`` on a little endian machine you
879will always have to swap bytes on encoding and decoding. ``UTF-32`` avoids this
880problem: bytes will always be in natural endianness. When these bytes are read
Georg Brandl116aa622007-08-15 14:28:22 +0000881by a CPU with a different endianness, then bytes have to be swapped though. To
Ezio Melottifbb39812011-10-25 10:40:38 +0300882be able to detect the endianness of a ``UTF-16`` or ``UTF-32`` byte sequence,
883there's the so called BOM ("Byte Order Mark"). This is the Unicode character
884``U+FEFF``. This character can be prepended to every ``UTF-16`` or ``UTF-32``
885byte sequence. The byte swapped version of this character (``0xFFFE``) is an
886illegal character that may not appear in a Unicode text. So when the
887first character in an ``UTF-16`` or ``UTF-32`` byte sequence
Georg Brandl116aa622007-08-15 14:28:22 +0000888appears to be a ``U+FFFE`` the bytes have to be swapped on decoding.
Ezio Melottifbb39812011-10-25 10:40:38 +0300889Unfortunately the character ``U+FEFF`` had a second purpose as
890a ``ZERO WIDTH NO-BREAK SPACE``: a character that has no width and doesn't allow
Georg Brandl116aa622007-08-15 14:28:22 +0000891a word to be split. It can e.g. be used to give hints to a ligature algorithm.
892With Unicode 4.0 using ``U+FEFF`` as a ``ZERO WIDTH NO-BREAK SPACE`` has been
893deprecated (with ``U+2060`` (``WORD JOINER``) assuming this role). Nevertheless
Ezio Melottifbb39812011-10-25 10:40:38 +0300894Unicode software still must be able to handle ``U+FEFF`` in both roles: as a BOM
Georg Brandl116aa622007-08-15 14:28:22 +0000895it's a device to determine the storage layout of the encoded bytes, and vanishes
Georg Brandl30c78d62008-05-11 14:52:00 +0000896once the byte sequence has been decoded into a string; as a ``ZERO WIDTH
Georg Brandl116aa622007-08-15 14:28:22 +0000897NO-BREAK SPACE`` it's a normal character that will be decoded like any other.
898
899There's another encoding that is able to encoding the full range of Unicode
900characters: UTF-8. UTF-8 is an 8-bit encoding, which means there are no issues
901with byte order in UTF-8. Each byte in a UTF-8 byte sequence consists of two
Ezio Melottifbb39812011-10-25 10:40:38 +0300902parts: marker bits (the most significant bits) and payload bits. The marker bits
Ezio Melotti222b2082011-09-01 08:11:28 +0300903are a sequence of zero to four ``1`` bits followed by a ``0`` bit. Unicode characters are
Georg Brandl116aa622007-08-15 14:28:22 +0000904encoded like this (with x being payload bits, which when concatenated give the
905Unicode character):
906
907+-----------------------------------+----------------------------------------------+
908| Range | Encoding |
909+===================================+==============================================+
910| ``U-00000000`` ... ``U-0000007F`` | 0xxxxxxx |
911+-----------------------------------+----------------------------------------------+
912| ``U-00000080`` ... ``U-000007FF`` | 110xxxxx 10xxxxxx |
913+-----------------------------------+----------------------------------------------+
914| ``U-00000800`` ... ``U-0000FFFF`` | 1110xxxx 10xxxxxx 10xxxxxx |
915+-----------------------------------+----------------------------------------------+
Ezio Melotti222b2082011-09-01 08:11:28 +0300916| ``U-00010000`` ... ``U-0010FFFF`` | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx |
Georg Brandl116aa622007-08-15 14:28:22 +0000917+-----------------------------------+----------------------------------------------+
918
919The least significant bit of the Unicode character is the rightmost x bit.
920
921As UTF-8 is an 8-bit encoding no BOM is required and any ``U+FEFF`` character in
Georg Brandl30c78d62008-05-11 14:52:00 +0000922the decoded string (even if it's the first character) is treated as a ``ZERO
923WIDTH NO-BREAK SPACE``.
Georg Brandl116aa622007-08-15 14:28:22 +0000924
925Without external information it's impossible to reliably determine which
Georg Brandl30c78d62008-05-11 14:52:00 +0000926encoding was used for encoding a string. Each charmap encoding can
Georg Brandl116aa622007-08-15 14:28:22 +0000927decode any random byte sequence. However that's not possible with UTF-8, as
928UTF-8 byte sequences have a structure that doesn't allow arbitrary byte
Thomas Wouters89d996e2007-09-08 17:39:28 +0000929sequences. To increase the reliability with which a UTF-8 encoding can be
Georg Brandl116aa622007-08-15 14:28:22 +0000930detected, Microsoft invented a variant of UTF-8 (that Python 2.5 calls
931``"utf-8-sig"``) for its Notepad program: Before any of the Unicode characters
932is written to the file, a UTF-8 encoded BOM (which looks like this as a byte
933sequence: ``0xef``, ``0xbb``, ``0xbf``) is written. As it's rather improbable
934that any charmap encoded file starts with these byte values (which would e.g.
935map to
936
937 | LATIN SMALL LETTER I WITH DIAERESIS
938 | RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK
939 | INVERTED QUESTION MARK
940
Ezio Melottifbb39812011-10-25 10:40:38 +0300941in iso-8859-1), this increases the probability that a ``utf-8-sig`` encoding can be
Georg Brandl116aa622007-08-15 14:28:22 +0000942correctly guessed from the byte sequence. So here the BOM is not used to be able
943to determine the byte order used for generating the byte sequence, but as a
944signature that helps in guessing the encoding. On encoding the utf-8-sig codec
945will write ``0xef``, ``0xbb``, ``0xbf`` as the first three bytes to the file. On
Ezio Melottifbb39812011-10-25 10:40:38 +0300946decoding ``utf-8-sig`` will skip those three bytes if they appear as the first
947three bytes in the file. In UTF-8, the use of the BOM is discouraged and
948should generally be avoided.
Georg Brandl116aa622007-08-15 14:28:22 +0000949
950
951.. _standard-encodings:
952
953Standard Encodings
954------------------
955
956Python comes with a number of codecs built-in, either implemented as C functions
957or with dictionaries as mapping tables. The following table lists the codecs by
958name, together with a few common aliases, and the languages for which the
959encoding is likely used. Neither the list of aliases nor the list of languages
960is meant to be exhaustive. Notice that spelling alternatives that only differ in
Georg Brandla6053b42009-09-01 08:11:14 +0000961case or use a hyphen instead of an underscore are also valid aliases; therefore,
962e.g. ``'utf-8'`` is a valid alias for the ``'utf_8'`` codec.
Georg Brandl116aa622007-08-15 14:28:22 +0000963
Alexander Belopolsky1d521462011-02-25 19:19:57 +0000964.. impl-detail::
965
966 Some common encodings can bypass the codecs lookup machinery to
967 improve performance. These optimization opportunities are only
968 recognized by CPython for a limited set of aliases: utf-8, utf8,
969 latin-1, latin1, iso-8859-1, mbcs (Windows only), ascii, utf-16,
970 and utf-32. Using alternative spellings for these encodings may
971 result in slower execution.
972
Georg Brandl116aa622007-08-15 14:28:22 +0000973Many of the character sets support the same languages. They vary in individual
974characters (e.g. whether the EURO SIGN is supported or not), and in the
975assignment of characters to code positions. For the European languages in
976particular, the following variants typically exist:
977
978* an ISO 8859 codeset
979
980* a Microsoft Windows code page, which is typically derived from a 8859 codeset,
981 but replaces control characters with additional graphic characters
982
983* an IBM EBCDIC code page
984
985* an IBM PC code page, which is ASCII compatible
986
Georg Brandl44ea77b2013-03-28 13:28:44 +0100987.. tabularcolumns:: |l|p{0.3\linewidth}|p{0.3\linewidth}|
988
Georg Brandl116aa622007-08-15 14:28:22 +0000989+-----------------+--------------------------------+--------------------------------+
990| Codec | Aliases | Languages |
991+=================+================================+================================+
992| ascii | 646, us-ascii | English |
993+-----------------+--------------------------------+--------------------------------+
994| big5 | big5-tw, csbig5 | Traditional Chinese |
995+-----------------+--------------------------------+--------------------------------+
996| big5hkscs | big5-hkscs, hkscs | Traditional Chinese |
997+-----------------+--------------------------------+--------------------------------+
998| cp037 | IBM037, IBM039 | English |
999+-----------------+--------------------------------+--------------------------------+
R David Murray47d083c2014-03-07 21:00:34 -05001000| cp273 | 273, IBM273, csIBM273 | German |
1001| | | |
1002| | | .. versionadded:: 3.4 |
1003+-----------------+--------------------------------+--------------------------------+
Georg Brandl116aa622007-08-15 14:28:22 +00001004| cp424 | EBCDIC-CP-HE, IBM424 | Hebrew |
1005+-----------------+--------------------------------+--------------------------------+
1006| cp437 | 437, IBM437 | English |
1007+-----------------+--------------------------------+--------------------------------+
1008| cp500 | EBCDIC-CP-BE, EBCDIC-CP-CH, | Western Europe |
1009| | IBM500 | |
1010+-----------------+--------------------------------+--------------------------------+
Amaury Forgeot d'Arcae6388d2009-07-15 19:21:18 +00001011| cp720 | | Arabic |
1012+-----------------+--------------------------------+--------------------------------+
Georg Brandl116aa622007-08-15 14:28:22 +00001013| cp737 | | Greek |
1014+-----------------+--------------------------------+--------------------------------+
1015| cp775 | IBM775 | Baltic languages |
1016+-----------------+--------------------------------+--------------------------------+
1017| cp850 | 850, IBM850 | Western Europe |
1018+-----------------+--------------------------------+--------------------------------+
1019| cp852 | 852, IBM852 | Central and Eastern Europe |
1020+-----------------+--------------------------------+--------------------------------+
1021| cp855 | 855, IBM855 | Bulgarian, Byelorussian, |
1022| | | Macedonian, Russian, Serbian |
1023+-----------------+--------------------------------+--------------------------------+
1024| cp856 | | Hebrew |
1025+-----------------+--------------------------------+--------------------------------+
1026| cp857 | 857, IBM857 | Turkish |
1027+-----------------+--------------------------------+--------------------------------+
Benjamin Peterson5a6214a2010-06-27 22:41:29 +00001028| cp858 | 858, IBM858 | Western Europe |
1029+-----------------+--------------------------------+--------------------------------+
Georg Brandl116aa622007-08-15 14:28:22 +00001030| cp860 | 860, IBM860 | Portuguese |
1031+-----------------+--------------------------------+--------------------------------+
1032| cp861 | 861, CP-IS, IBM861 | Icelandic |
1033+-----------------+--------------------------------+--------------------------------+
1034| cp862 | 862, IBM862 | Hebrew |
1035+-----------------+--------------------------------+--------------------------------+
1036| cp863 | 863, IBM863 | Canadian |
1037+-----------------+--------------------------------+--------------------------------+
1038| cp864 | IBM864 | Arabic |
1039+-----------------+--------------------------------+--------------------------------+
1040| cp865 | 865, IBM865 | Danish, Norwegian |
1041+-----------------+--------------------------------+--------------------------------+
1042| cp866 | 866, IBM866 | Russian |
1043+-----------------+--------------------------------+--------------------------------+
1044| cp869 | 869, CP-GR, IBM869 | Greek |
1045+-----------------+--------------------------------+--------------------------------+
1046| cp874 | | Thai |
1047+-----------------+--------------------------------+--------------------------------+
1048| cp875 | | Greek |
1049+-----------------+--------------------------------+--------------------------------+
1050| cp932 | 932, ms932, mskanji, ms-kanji | Japanese |
1051+-----------------+--------------------------------+--------------------------------+
1052| cp949 | 949, ms949, uhc | Korean |
1053+-----------------+--------------------------------+--------------------------------+
1054| cp950 | 950, ms950 | Traditional Chinese |
1055+-----------------+--------------------------------+--------------------------------+
1056| cp1006 | | Urdu |
1057+-----------------+--------------------------------+--------------------------------+
1058| cp1026 | ibm1026 | Turkish |
1059+-----------------+--------------------------------+--------------------------------+
Serhiy Storchakabe0c3252013-11-23 18:52:23 +02001060| cp1125 | 1125, ibm1125, cp866u, ruscii | Ukrainian |
1061| | | |
1062| | | .. versionadded:: 3.4 |
1063+-----------------+--------------------------------+--------------------------------+
Georg Brandl116aa622007-08-15 14:28:22 +00001064| cp1140 | ibm1140 | Western Europe |
1065+-----------------+--------------------------------+--------------------------------+
1066| cp1250 | windows-1250 | Central and Eastern Europe |
1067+-----------------+--------------------------------+--------------------------------+
1068| cp1251 | windows-1251 | Bulgarian, Byelorussian, |
1069| | | Macedonian, Russian, Serbian |
1070+-----------------+--------------------------------+--------------------------------+
1071| cp1252 | windows-1252 | Western Europe |
1072+-----------------+--------------------------------+--------------------------------+
1073| cp1253 | windows-1253 | Greek |
1074+-----------------+--------------------------------+--------------------------------+
1075| cp1254 | windows-1254 | Turkish |
1076+-----------------+--------------------------------+--------------------------------+
1077| cp1255 | windows-1255 | Hebrew |
1078+-----------------+--------------------------------+--------------------------------+
Benjamin Peterson4ac9ce42009-10-04 14:49:41 +00001079| cp1256 | windows-1256 | Arabic |
Georg Brandl116aa622007-08-15 14:28:22 +00001080+-----------------+--------------------------------+--------------------------------+
1081| cp1257 | windows-1257 | Baltic languages |
1082+-----------------+--------------------------------+--------------------------------+
1083| cp1258 | windows-1258 | Vietnamese |
1084+-----------------+--------------------------------+--------------------------------+
Victor Stinner2f3ca9f2011-10-27 01:38:56 +02001085| cp65001 | | Windows only: Windows UTF-8 |
1086| | | (``CP_UTF8``) |
1087| | | |
1088| | | .. versionadded:: 3.3 |
1089+-----------------+--------------------------------+--------------------------------+
Georg Brandl116aa622007-08-15 14:28:22 +00001090| euc_jp | eucjp, ujis, u-jis | Japanese |
1091+-----------------+--------------------------------+--------------------------------+
1092| euc_jis_2004 | jisx0213, eucjis2004 | Japanese |
1093+-----------------+--------------------------------+--------------------------------+
1094| euc_jisx0213 | eucjisx0213 | Japanese |
1095+-----------------+--------------------------------+--------------------------------+
1096| euc_kr | euckr, korean, ksc5601, | Korean |
1097| | ks_c-5601, ks_c-5601-1987, | |
1098| | ksx1001, ks_x-1001 | |
1099+-----------------+--------------------------------+--------------------------------+
1100| gb2312 | chinese, csiso58gb231280, euc- | Simplified Chinese |
1101| | cn, euccn, eucgb2312-cn, | |
1102| | gb2312-1980, gb2312-80, iso- | |
1103| | ir-58 | |
1104+-----------------+--------------------------------+--------------------------------+
1105| gbk | 936, cp936, ms936 | Unified Chinese |
1106+-----------------+--------------------------------+--------------------------------+
1107| gb18030 | gb18030-2000 | Unified Chinese |
1108+-----------------+--------------------------------+--------------------------------+
1109| hz | hzgb, hz-gb, hz-gb-2312 | Simplified Chinese |
1110+-----------------+--------------------------------+--------------------------------+
1111| iso2022_jp | csiso2022jp, iso2022jp, | Japanese |
1112| | iso-2022-jp | |
1113+-----------------+--------------------------------+--------------------------------+
1114| iso2022_jp_1 | iso2022jp-1, iso-2022-jp-1 | Japanese |
1115+-----------------+--------------------------------+--------------------------------+
1116| iso2022_jp_2 | iso2022jp-2, iso-2022-jp-2 | Japanese, Korean, Simplified |
1117| | | Chinese, Western Europe, Greek |
1118+-----------------+--------------------------------+--------------------------------+
1119| iso2022_jp_2004 | iso2022jp-2004, | Japanese |
1120| | iso-2022-jp-2004 | |
1121+-----------------+--------------------------------+--------------------------------+
1122| iso2022_jp_3 | iso2022jp-3, iso-2022-jp-3 | Japanese |
1123+-----------------+--------------------------------+--------------------------------+
1124| iso2022_jp_ext | iso2022jp-ext, iso-2022-jp-ext | Japanese |
1125+-----------------+--------------------------------+--------------------------------+
1126| iso2022_kr | csiso2022kr, iso2022kr, | Korean |
1127| | iso-2022-kr | |
1128+-----------------+--------------------------------+--------------------------------+
1129| latin_1 | iso-8859-1, iso8859-1, 8859, | West Europe |
1130| | cp819, latin, latin1, L1 | |
1131+-----------------+--------------------------------+--------------------------------+
1132| iso8859_2 | iso-8859-2, latin2, L2 | Central and Eastern Europe |
1133+-----------------+--------------------------------+--------------------------------+
1134| iso8859_3 | iso-8859-3, latin3, L3 | Esperanto, Maltese |
1135+-----------------+--------------------------------+--------------------------------+
Christian Heimesc3f30c42008-02-22 16:37:40 +00001136| iso8859_4 | iso-8859-4, latin4, L4 | Baltic languages |
Georg Brandl116aa622007-08-15 14:28:22 +00001137+-----------------+--------------------------------+--------------------------------+
1138| iso8859_5 | iso-8859-5, cyrillic | Bulgarian, Byelorussian, |
1139| | | Macedonian, Russian, Serbian |
1140+-----------------+--------------------------------+--------------------------------+
1141| iso8859_6 | iso-8859-6, arabic | Arabic |
1142+-----------------+--------------------------------+--------------------------------+
1143| iso8859_7 | iso-8859-7, greek, greek8 | Greek |
1144+-----------------+--------------------------------+--------------------------------+
1145| iso8859_8 | iso-8859-8, hebrew | Hebrew |
1146+-----------------+--------------------------------+--------------------------------+
1147| iso8859_9 | iso-8859-9, latin5, L5 | Turkish |
1148+-----------------+--------------------------------+--------------------------------+
1149| iso8859_10 | iso-8859-10, latin6, L6 | Nordic languages |
1150+-----------------+--------------------------------+--------------------------------+
Victor Stinnerbfd97672015-09-24 09:04:05 +02001151| iso8859_11 | iso-8859-11, thai | Thai languages |
1152+-----------------+--------------------------------+--------------------------------+
Georg Brandl93dc9eb2010-03-14 10:56:14 +00001153| iso8859_13 | iso-8859-13, latin7, L7 | Baltic languages |
Georg Brandl116aa622007-08-15 14:28:22 +00001154+-----------------+--------------------------------+--------------------------------+
1155| iso8859_14 | iso-8859-14, latin8, L8 | Celtic languages |
1156+-----------------+--------------------------------+--------------------------------+
Georg Brandl93dc9eb2010-03-14 10:56:14 +00001157| iso8859_15 | iso-8859-15, latin9, L9 | Western Europe |
1158+-----------------+--------------------------------+--------------------------------+
1159| iso8859_16 | iso-8859-16, latin10, L10 | South-Eastern Europe |
Georg Brandl116aa622007-08-15 14:28:22 +00001160+-----------------+--------------------------------+--------------------------------+
1161| johab | cp1361, ms1361 | Korean |
1162+-----------------+--------------------------------+--------------------------------+
1163| koi8_r | | Russian |
1164+-----------------+--------------------------------+--------------------------------+
Serhiy Storchakaf0eeedf2015-05-12 23:24:19 +03001165| koi8_t | | Tajik |
1166| | | |
1167| | | .. versionadded:: 3.5 |
1168+-----------------+--------------------------------+--------------------------------+
Georg Brandl116aa622007-08-15 14:28:22 +00001169| koi8_u | | Ukrainian |
1170+-----------------+--------------------------------+--------------------------------+
Serhiy Storchakaad8a1c32015-05-12 23:16:55 +03001171| kz1048 | kz_1048, strk1048_2002, rk1048 | Kazakh |
1172| | | |
1173| | | .. versionadded:: 3.5 |
1174+-----------------+--------------------------------+--------------------------------+
Georg Brandl116aa622007-08-15 14:28:22 +00001175| mac_cyrillic | maccyrillic | Bulgarian, Byelorussian, |
1176| | | Macedonian, Russian, Serbian |
1177+-----------------+--------------------------------+--------------------------------+
1178| mac_greek | macgreek | Greek |
1179+-----------------+--------------------------------+--------------------------------+
1180| mac_iceland | maciceland | Icelandic |
1181+-----------------+--------------------------------+--------------------------------+
1182| mac_latin2 | maclatin2, maccentraleurope | Central and Eastern Europe |
1183+-----------------+--------------------------------+--------------------------------+
Benjamin Peterson23110e72010-08-21 02:54:44 +00001184| mac_roman | macroman, macintosh | Western Europe |
Georg Brandl116aa622007-08-15 14:28:22 +00001185+-----------------+--------------------------------+--------------------------------+
1186| mac_turkish | macturkish | Turkish |
1187+-----------------+--------------------------------+--------------------------------+
1188| ptcp154 | csptcp154, pt154, cp154, | Kazakh |
1189| | cyrillic-asian | |
1190+-----------------+--------------------------------+--------------------------------+
1191| shift_jis | csshiftjis, shiftjis, sjis, | Japanese |
1192| | s_jis | |
1193+-----------------+--------------------------------+--------------------------------+
1194| shift_jis_2004 | shiftjis2004, sjis_2004, | Japanese |
1195| | sjis2004 | |
1196+-----------------+--------------------------------+--------------------------------+
1197| shift_jisx0213 | shiftjisx0213, sjisx0213, | Japanese |
1198| | s_jisx0213 | |
1199+-----------------+--------------------------------+--------------------------------+
Walter Dörwald41980ca2007-08-16 21:55:45 +00001200| utf_32 | U32, utf32 | all languages |
1201+-----------------+--------------------------------+--------------------------------+
1202| utf_32_be | UTF-32BE | all languages |
1203+-----------------+--------------------------------+--------------------------------+
1204| utf_32_le | UTF-32LE | all languages |
1205+-----------------+--------------------------------+--------------------------------+
Georg Brandl116aa622007-08-15 14:28:22 +00001206| utf_16 | U16, utf16 | all languages |
1207+-----------------+--------------------------------+--------------------------------+
Victor Stinner53a9dd72010-12-08 22:25:45 +00001208| utf_16_be | UTF-16BE | all languages |
Georg Brandl116aa622007-08-15 14:28:22 +00001209+-----------------+--------------------------------+--------------------------------+
Victor Stinner53a9dd72010-12-08 22:25:45 +00001210| utf_16_le | UTF-16LE | all languages |
Georg Brandl116aa622007-08-15 14:28:22 +00001211+-----------------+--------------------------------+--------------------------------+
1212| utf_7 | U7, unicode-1-1-utf-7 | all languages |
1213+-----------------+--------------------------------+--------------------------------+
1214| utf_8 | U8, UTF, utf8 | all languages |
1215+-----------------+--------------------------------+--------------------------------+
1216| utf_8_sig | | all languages |
1217+-----------------+--------------------------------+--------------------------------+
1218
Serhiy Storchaka58cf6072013-11-19 11:32:41 +02001219.. versionchanged:: 3.4
1220 The utf-16\* and utf-32\* encoders no longer allow surrogate code points
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +10001221 (``U+D800``--``U+DFFF``) to be encoded.
1222 The utf-32\* decoders no longer decode
Serhiy Storchaka58cf6072013-11-19 11:32:41 +02001223 byte sequences that correspond to surrogate code points.
1224
1225
Nick Coghlan650e3222013-05-23 20:24:02 +10001226Python Specific Encodings
1227-------------------------
1228
1229A number of predefined codecs are specific to Python, so their codec names have
1230no meaning outside Python. These are listed in the tables below based on the
1231expected input and output types (note that while text encodings are the most
1232common use case for codecs, the underlying codec infrastructure supports
1233arbitrary data transforms rather than just text encodings). For asymmetric
1234codecs, the stated purpose describes the encoding direction.
1235
Nick Coghlan9c1aed82013-11-23 11:13:36 +10001236Text Encodings
1237^^^^^^^^^^^^^^
1238
Nick Coghlan650e3222013-05-23 20:24:02 +10001239The following codecs provide :class:`str` to :class:`bytes` encoding and
1240:term:`bytes-like object` to :class:`str` decoding, similar to the Unicode text
1241encodings.
Georg Brandl226878c2007-08-31 10:15:37 +00001242
Georg Brandl44ea77b2013-03-28 13:28:44 +01001243.. tabularcolumns:: |l|p{0.3\linewidth}|p{0.3\linewidth}|
1244
Georg Brandl30c78d62008-05-11 14:52:00 +00001245+--------------------+---------+---------------------------+
1246| Codec | Aliases | Purpose |
1247+====================+=========+===========================+
1248| idna | | Implements :rfc:`3490`, |
1249| | | see also |
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +10001250| | | :mod:`encodings.idna`. |
1251| | | Only ``errors='strict'`` |
1252| | | is supported. |
Georg Brandl30c78d62008-05-11 14:52:00 +00001253+--------------------+---------+---------------------------+
1254| mbcs | dbcs | Windows only: Encode |
1255| | | operand according to the |
1256| | | ANSI codepage (CP_ACP) |
1257+--------------------+---------+---------------------------+
1258| palmos | | Encoding of PalmOS 3.5 |
1259+--------------------+---------+---------------------------+
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +10001260| punycode | | Implements :rfc:`3492`. |
1261| | | Stateful codecs are not |
1262| | | supported. |
Georg Brandl30c78d62008-05-11 14:52:00 +00001263+--------------------+---------+---------------------------+
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +10001264| raw_unicode_escape | | Latin-1 encoding with |
1265| | | ``\uXXXX`` and |
1266| | | ``\UXXXXXXXX`` for other |
1267| | | code points. Existing |
1268| | | backslashes are not |
1269| | | escaped in any way. |
1270| | | It is used in the Python |
1271| | | pickle protocol. |
Georg Brandl30c78d62008-05-11 14:52:00 +00001272+--------------------+---------+---------------------------+
1273| undefined | | Raise an exception for |
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +10001274| | | all conversions, even |
1275| | | empty strings. The error |
1276| | | handler is ignored. |
Georg Brandl30c78d62008-05-11 14:52:00 +00001277+--------------------+---------+---------------------------+
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +10001278| unicode_escape | | Encoding suitable as the |
1279| | | contents of a Unicode |
1280| | | literal in ASCII-encoded |
1281| | | Python source code, |
1282| | | except that quotes are |
1283| | | not escaped. Decodes from |
1284| | | Latin-1 source code. |
1285| | | Beware that Python source |
1286| | | code actually uses UTF-8 |
1287| | | by default. |
Georg Brandl30c78d62008-05-11 14:52:00 +00001288+--------------------+---------+---------------------------+
1289| unicode_internal | | Return the internal |
1290| | | representation of the |
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +10001291| | | operand. Stateful codecs |
1292| | | are not supported. |
Victor Stinner9f4b1e92011-11-10 20:56:30 +01001293| | | |
1294| | | .. deprecated:: 3.3 |
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +10001295| | | This representation is |
1296| | | obsoleted by |
1297| | | :pep:`393`. |
Georg Brandl30c78d62008-05-11 14:52:00 +00001298+--------------------+---------+---------------------------+
Georg Brandl116aa622007-08-15 14:28:22 +00001299
Nick Coghlan9c1aed82013-11-23 11:13:36 +10001300.. _binary-transforms:
1301
1302Binary Transforms
1303^^^^^^^^^^^^^^^^^
1304
1305The following codecs provide binary transforms: :term:`bytes-like object`
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +10001306to :class:`bytes` mappings. They are not supported by :meth:`bytes.decode`
1307(which only produces :class:`str` output).
Nick Coghlan650e3222013-05-23 20:24:02 +10001308
Georg Brandl02524622010-12-02 18:06:51 +00001309
Nick Coghlan9c1aed82013-11-23 11:13:36 +10001310.. tabularcolumns:: |l|L|L|L|
Georg Brandl44ea77b2013-03-28 13:28:44 +01001311
Nick Coghlan9c1aed82013-11-23 11:13:36 +10001312+----------------------+------------------+------------------------------+------------------------------+
1313| Codec | Aliases | Purpose | Encoder / decoder |
1314+======================+==================+==============================+==============================+
Martin Panter06171bd2015-09-12 00:34:28 +00001315| base64_codec [#b64]_ | base64, base_64 | Convert operand to multiline | :meth:`base64.encodebytes` / |
1316| | | MIME base64 (the result | :meth:`base64.decodebytes` |
1317| | | always includes a trailing | |
Nick Coghlan9c1aed82013-11-23 11:13:36 +10001318| | | ``'\n'``) | |
1319| | | | |
1320| | | .. versionchanged:: 3.4 | |
1321| | | accepts any | |
1322| | | :term:`bytes-like object` | |
1323| | | as input for encoding and | |
1324| | | decoding | |
1325+----------------------+------------------+------------------------------+------------------------------+
1326| bz2_codec | bz2 | Compress the operand | :meth:`bz2.compress` / |
1327| | | using bz2 | :meth:`bz2.decompress` |
1328+----------------------+------------------+------------------------------+------------------------------+
Martin Panter06171bd2015-09-12 00:34:28 +00001329| hex_codec | hex | Convert operand to | :meth:`binascii.b2a_hex` / |
1330| | | hexadecimal | :meth:`binascii.a2b_hex` |
Nick Coghlan9c1aed82013-11-23 11:13:36 +10001331| | | representation, with two | |
1332| | | digits per byte | |
1333+----------------------+------------------+------------------------------+------------------------------+
Martin Panter06171bd2015-09-12 00:34:28 +00001334| quopri_codec | quopri, | Convert operand to MIME | :meth:`quopri.encode` with |
1335| | quotedprintable, | quoted printable | ``quotetabs=True`` / |
1336| | quoted_printable | | :meth:`quopri.decode` |
Nick Coghlan9c1aed82013-11-23 11:13:36 +10001337+----------------------+------------------+------------------------------+------------------------------+
1338| uu_codec | uu | Convert the operand using | :meth:`uu.encode` / |
1339| | | uuencode | :meth:`uu.decode` |
1340+----------------------+------------------+------------------------------+------------------------------+
1341| zlib_codec | zip, zlib | Compress the operand | :meth:`zlib.compress` / |
1342| | | using gzip | :meth:`zlib.decompress` |
1343+----------------------+------------------+------------------------------+------------------------------+
Georg Brandl02524622010-12-02 18:06:51 +00001344
Nick Coghlanfdf239a2013-10-03 00:43:22 +10001345.. [#b64] In addition to :term:`bytes-like objects <bytes-like object>`,
1346 ``'base64_codec'`` also accepts ASCII-only instances of :class:`str` for
1347 decoding
Nick Coghlan650e3222013-05-23 20:24:02 +10001348
Nick Coghlan9c1aed82013-11-23 11:13:36 +10001349.. versionadded:: 3.2
1350 Restoration of the binary transforms.
Nick Coghlan650e3222013-05-23 20:24:02 +10001351
Nick Coghlan9c1aed82013-11-23 11:13:36 +10001352.. versionchanged:: 3.4
1353 Restoration of the aliases for the binary transforms.
Georg Brandl02524622010-12-02 18:06:51 +00001354
Georg Brandl44ea77b2013-03-28 13:28:44 +01001355
Nick Coghlan9c1aed82013-11-23 11:13:36 +10001356.. _text-transforms:
1357
1358Text Transforms
1359^^^^^^^^^^^^^^^
1360
1361The following codec provides a text transform: a :class:`str` to :class:`str`
Nick Coghlanb9fdb7a2015-01-07 00:22:00 +10001362mapping. It is not supported by :meth:`str.encode` (which only produces
1363:class:`bytes` output).
Nick Coghlan9c1aed82013-11-23 11:13:36 +10001364
1365.. tabularcolumns:: |l|l|L|
1366
1367+--------------------+---------+---------------------------+
1368| Codec | Aliases | Purpose |
1369+====================+=========+===========================+
1370| rot_13 | rot13 | Returns the Caesar-cypher |
1371| | | encryption of the operand |
1372+--------------------+---------+---------------------------+
Georg Brandl02524622010-12-02 18:06:51 +00001373
1374.. versionadded:: 3.2
Nick Coghlan9c1aed82013-11-23 11:13:36 +10001375 Restoration of the ``rot_13`` text transform.
1376
1377.. versionchanged:: 3.4
1378 Restoration of the ``rot13`` alias.
Georg Brandl02524622010-12-02 18:06:51 +00001379
Georg Brandl116aa622007-08-15 14:28:22 +00001380
1381:mod:`encodings.idna` --- Internationalized Domain Names in Applications
1382------------------------------------------------------------------------
1383
1384.. module:: encodings.idna
1385 :synopsis: Internationalized Domain Names implementation
1386.. moduleauthor:: Martin v. Löwis
1387
Georg Brandl116aa622007-08-15 14:28:22 +00001388This module implements :rfc:`3490` (Internationalized Domain Names in
1389Applications) and :rfc:`3492` (Nameprep: A Stringprep Profile for
1390Internationalized Domain Names (IDN)). It builds upon the ``punycode`` encoding
1391and :mod:`stringprep`.
1392
1393These RFCs together define a protocol to support non-ASCII characters in domain
1394names. A domain name containing non-ASCII characters (such as
1395``www.Alliancefrançaise.nu``) is converted into an ASCII-compatible encoding
1396(ACE, such as ``www.xn--alliancefranaise-npb.nu``). The ACE form of the domain
1397name is then used in all places where arbitrary characters are not allowed by
1398the protocol, such as DNS queries, HTTP :mailheader:`Host` fields, and so
1399on. This conversion is carried out in the application; if possible invisible to
1400the user: The application should transparently convert Unicode domain labels to
1401IDNA on the wire, and convert back ACE labels to Unicode before presenting them
1402to the user.
1403
R David Murraye0fd2f82011-04-13 14:12:18 -04001404Python supports this conversion in several ways: the ``idna`` codec performs
1405conversion between Unicode and ACE, separating an input string into labels
1406based on the separator characters defined in `section 3.1`_ (1) of :rfc:`3490`
1407and converting each label to ACE as required, and conversely separating an input
1408byte string into labels based on the ``.`` separator and converting any ACE
1409labels found into unicode. Furthermore, the :mod:`socket` module
Georg Brandl116aa622007-08-15 14:28:22 +00001410transparently converts Unicode host names to ACE, so that applications need not
1411be concerned about converting host names themselves when they pass them to the
1412socket module. On top of that, modules that have host names as function
Georg Brandl24420152008-05-26 16:32:26 +00001413parameters, such as :mod:`http.client` and :mod:`ftplib`, accept Unicode host
1414names (:mod:`http.client` then also transparently sends an IDNA hostname in the
Georg Brandl116aa622007-08-15 14:28:22 +00001415:mailheader:`Host` field if it sends that field at all).
1416
R David Murraye0fd2f82011-04-13 14:12:18 -04001417.. _section 3.1: http://tools.ietf.org/html/rfc3490#section-3.1
1418
Georg Brandl116aa622007-08-15 14:28:22 +00001419When receiving host names from the wire (such as in reverse name lookup), no
1420automatic conversion to Unicode is performed: Applications wishing to present
1421such host names to the user should decode them to Unicode.
1422
1423The module :mod:`encodings.idna` also implements the nameprep procedure, which
1424performs certain normalizations on host names, to achieve case-insensitivity of
1425international domain names, and to unify similar characters. The nameprep
1426functions can be used directly if desired.
1427
1428
1429.. function:: nameprep(label)
1430
1431 Return the nameprepped version of *label*. The implementation currently assumes
1432 query strings, so ``AllowUnassigned`` is true.
1433
1434
1435.. function:: ToASCII(label)
1436
1437 Convert a label to ASCII, as specified in :rfc:`3490`. ``UseSTD3ASCIIRules`` is
1438 assumed to be false.
1439
1440
1441.. function:: ToUnicode(label)
1442
1443 Convert a label to Unicode, as specified in :rfc:`3490`.
1444
1445
Victor Stinner554f3f02010-06-16 23:33:54 +00001446:mod:`encodings.mbcs` --- Windows ANSI codepage
1447-----------------------------------------------
1448
1449.. module:: encodings.mbcs
1450 :synopsis: Windows ANSI codepage
1451
Victor Stinner3a50e702011-10-18 21:21:00 +02001452Encode operand according to the ANSI codepage (CP_ACP).
Victor Stinner554f3f02010-06-16 23:33:54 +00001453
1454Availability: Windows only.
1455
Victor Stinner3a50e702011-10-18 21:21:00 +02001456.. versionchanged:: 3.3
1457 Support any error handler.
1458
Victor Stinner554f3f02010-06-16 23:33:54 +00001459.. versionchanged:: 3.2
1460 Before 3.2, the *errors* argument was ignored; ``'replace'`` was always used
1461 to encode, and ``'ignore'`` to decode.
1462
1463
Georg Brandl116aa622007-08-15 14:28:22 +00001464:mod:`encodings.utf_8_sig` --- UTF-8 codec with BOM signature
1465-------------------------------------------------------------
1466
1467.. module:: encodings.utf_8_sig
1468 :synopsis: UTF-8 codec with BOM signature
1469.. moduleauthor:: Walter Dörwald
1470
Georg Brandl116aa622007-08-15 14:28:22 +00001471This module implements a variant of the UTF-8 codec: On encoding a UTF-8 encoded
1472BOM will be prepended to the UTF-8 encoded bytes. For the stateful encoder this
1473is only done once (on the first write to the byte stream). For decoding an
1474optional UTF-8 encoded BOM at the start of the data will be skipped.