blob: fb3af3bf37074843fb9bb77505d0c2d0446987aa [file] [log] [blame]
Georg Brandl116aa622007-08-15 14:28:22 +00001:mod:`codecs` --- Codec registry and base classes
2=================================================
3
4.. module:: codecs
5 :synopsis: Encode and decode data and streams.
Antoine Pitroufbd4f802012-08-11 16:51:50 +02006.. moduleauthor:: Marc-André Lemburg <mal@lemburg.com>
7.. sectionauthor:: Marc-André Lemburg <mal@lemburg.com>
Georg Brandl116aa622007-08-15 14:28:22 +00008.. sectionauthor:: Martin v. Löwis <martin@v.loewis.de>
9
10
11.. index::
12 single: Unicode
13 single: Codecs
14 pair: Codecs; encode
15 pair: Codecs; decode
16 single: streams
17 pair: stackable; streams
18
19This module defines base classes for standard Python codecs (encoders and
20decoders) and provides access to the internal Python codec registry which
21manages the codec and error handling lookup process.
22
23It defines the following functions:
24
Victor Stinneref5b4e32014-05-14 17:08:45 +020025.. function:: encode(obj, [encoding[, errors]])
Nick Coghlan6cb2b5b2013-10-14 00:22:13 +100026
Victor Stinneref5b4e32014-05-14 17:08:45 +020027 Encodes *obj* using the codec registered for *encoding*. The default
28 encoding is ``utf-8``.
Nick Coghlan6cb2b5b2013-10-14 00:22:13 +100029
30 *Errors* may be given to set the desired error handling scheme. The
31 default error handler is ``strict`` meaning that encoding errors raise
32 :exc:`ValueError` (or a more codec specific subclass, such as
33 :exc:`UnicodeEncodeError`). Refer to :ref:`codec-base-classes` for more
34 information on codec error handling.
35
Victor Stinneref5b4e32014-05-14 17:08:45 +020036.. function:: decode(obj, [encoding[, errors]])
Nick Coghlan6cb2b5b2013-10-14 00:22:13 +100037
Victor Stinneref5b4e32014-05-14 17:08:45 +020038 Decodes *obj* using the codec registered for *encoding*. The default
39 encoding is ``utf-8``.
Nick Coghlan6cb2b5b2013-10-14 00:22:13 +100040
41 *Errors* may be given to set the desired error handling scheme. The
42 default error handler is ``strict`` meaning that decoding errors raise
43 :exc:`ValueError` (or a more codec specific subclass, such as
44 :exc:`UnicodeDecodeError`). Refer to :ref:`codec-base-classes` for more
45 information on codec error handling.
Georg Brandl116aa622007-08-15 14:28:22 +000046
47.. function:: register(search_function)
48
49 Register a codec search function. Search functions are expected to take one
50 argument, the encoding name in all lower case letters, and return a
51 :class:`CodecInfo` object having the following attributes:
52
53 * ``name`` The name of the encoding;
54
Walter Dörwald62073e02008-10-23 13:21:33 +000055 * ``encode`` The stateless encoding function;
Georg Brandl116aa622007-08-15 14:28:22 +000056
Walter Dörwald62073e02008-10-23 13:21:33 +000057 * ``decode`` The stateless decoding function;
Georg Brandl116aa622007-08-15 14:28:22 +000058
59 * ``incrementalencoder`` An incremental encoder class or factory function;
60
61 * ``incrementaldecoder`` An incremental decoder class or factory function;
62
63 * ``streamwriter`` A stream writer class or factory function;
64
65 * ``streamreader`` A stream reader class or factory function.
66
67 The various functions or classes take the following arguments:
68
Walter Dörwald62073e02008-10-23 13:21:33 +000069 *encode* and *decode*: These must be functions or methods which have the same
Serhiy Storchakabfdcd432013-10-13 23:09:14 +030070 interface as the :meth:`~Codec.encode`/:meth:`~Codec.decode` methods of Codec
71 instances (see :ref:`Codec Interface <codec-objects>`). The functions/methods
72 are expected to work in a stateless mode.
Georg Brandl116aa622007-08-15 14:28:22 +000073
Benjamin Peterson3e4f0552008-09-02 00:31:15 +000074 *incrementalencoder* and *incrementaldecoder*: These have to be factory
Georg Brandl116aa622007-08-15 14:28:22 +000075 functions providing the following interface:
76
Georg Brandl495f7b52009-10-27 15:28:25 +000077 ``factory(errors='strict')``
Georg Brandl116aa622007-08-15 14:28:22 +000078
79 The factory functions must return objects providing the interfaces defined by
Benjamin Peterson3e4f0552008-09-02 00:31:15 +000080 the base classes :class:`IncrementalEncoder` and :class:`IncrementalDecoder`,
Georg Brandl116aa622007-08-15 14:28:22 +000081 respectively. Incremental codecs can maintain state.
82
83 *streamreader* and *streamwriter*: These have to be factory functions providing
84 the following interface:
85
Georg Brandl495f7b52009-10-27 15:28:25 +000086 ``factory(stream, errors='strict')``
Georg Brandl116aa622007-08-15 14:28:22 +000087
88 The factory functions must return objects providing the interfaces defined by
Georg Brandl9c2505b2013-10-06 13:17:04 +020089 the base classes :class:`StreamReader` and :class:`StreamWriter`, respectively.
Georg Brandl116aa622007-08-15 14:28:22 +000090 Stream codecs can maintain state.
91
Georg Brandl495f7b52009-10-27 15:28:25 +000092 Possible values for errors are
93
94 * ``'strict'``: raise an exception in case of an encoding error
95 * ``'replace'``: replace malformed data with a suitable replacement marker,
96 such as ``'?'`` or ``'\ufffd'``
97 * ``'ignore'``: ignore malformed data and continue without further notice
98 * ``'xmlcharrefreplace'``: replace with the appropriate XML character
99 reference (for encoding only)
100 * ``'backslashreplace'``: replace with backslashed escape sequences (for
Ezio Melottie33721e2010-02-27 13:54:27 +0000101 encoding only)
Andrew Kuchlingc7b6c502013-06-16 12:58:48 -0400102 * ``'surrogateescape'``: on decoding, replace with code points in the Unicode
103 Private Use Area ranging from U+DC80 to U+DCFF. These private code
104 points will then be turned back into the same bytes when the
105 ``surrogateescape`` error handler is used when encoding the data.
106 (See :pep:`383` for more.)
Georg Brandl495f7b52009-10-27 15:28:25 +0000107
108 as well as any other error handling name defined via :func:`register_error`.
Georg Brandl116aa622007-08-15 14:28:22 +0000109
110 In case a search function cannot find a given encoding, it should return
111 ``None``.
112
113
114.. function:: lookup(encoding)
115
116 Looks up the codec info in the Python codec registry and returns a
117 :class:`CodecInfo` object as defined above.
118
119 Encodings are first looked up in the registry's cache. If not found, the list of
120 registered search functions is scanned. If no :class:`CodecInfo` object is
121 found, a :exc:`LookupError` is raised. Otherwise, the :class:`CodecInfo` object
122 is stored in the cache and returned to the caller.
123
124To simplify access to the various codecs, the module provides these additional
125functions which use :func:`lookup` for the codec lookup:
126
127
128.. function:: getencoder(encoding)
129
130 Look up the codec for the given encoding and return its encoder function.
131
132 Raises a :exc:`LookupError` in case the encoding cannot be found.
133
134
135.. function:: getdecoder(encoding)
136
137 Look up the codec for the given encoding and return its decoder function.
138
139 Raises a :exc:`LookupError` in case the encoding cannot be found.
140
141
142.. function:: getincrementalencoder(encoding)
143
144 Look up the codec for the given encoding and return its incremental encoder
145 class or factory function.
146
147 Raises a :exc:`LookupError` in case the encoding cannot be found or the codec
148 doesn't support an incremental encoder.
149
Georg Brandl116aa622007-08-15 14:28:22 +0000150
151.. function:: getincrementaldecoder(encoding)
152
153 Look up the codec for the given encoding and return its incremental decoder
154 class or factory function.
155
156 Raises a :exc:`LookupError` in case the encoding cannot be found or the codec
157 doesn't support an incremental decoder.
158
Georg Brandl116aa622007-08-15 14:28:22 +0000159
160.. function:: getreader(encoding)
161
162 Look up the codec for the given encoding and return its StreamReader class or
163 factory function.
164
165 Raises a :exc:`LookupError` in case the encoding cannot be found.
166
167
168.. function:: getwriter(encoding)
169
170 Look up the codec for the given encoding and return its StreamWriter class or
171 factory function.
172
173 Raises a :exc:`LookupError` in case the encoding cannot be found.
174
175
176.. function:: register_error(name, error_handler)
177
178 Register the error handling function *error_handler* under the name *name*.
179 *error_handler* will be called during encoding and decoding in case of an error,
180 when *name* is specified as the errors parameter.
181
182 For encoding *error_handler* will be called with a :exc:`UnicodeEncodeError`
Benjamin Peterson19603552012-12-02 11:26:10 -0500183 instance, which contains information about the location of the error. The
184 error handler must either raise this or a different exception or return a
185 tuple with a replacement for the unencodable part of the input and a position
186 where encoding should continue. The replacement may be either :class:`str` or
187 :class:`bytes`. If the replacement is bytes, the encoder will simply copy
188 them into the output buffer. If the replacement is a string, the encoder will
189 encode the replacement. Encoding continues on original input at the
190 specified position. Negative position values will be treated as being
191 relative to the end of the input string. If the resulting position is out of
192 bound an :exc:`IndexError` will be raised.
Georg Brandl116aa622007-08-15 14:28:22 +0000193
194 Decoding and translating works similar, except :exc:`UnicodeDecodeError` or
195 :exc:`UnicodeTranslateError` will be passed to the handler and that the
196 replacement from the error handler will be put into the output directly.
197
198
199.. function:: lookup_error(name)
200
201 Return the error handler previously registered under the name *name*.
202
203 Raises a :exc:`LookupError` in case the handler cannot be found.
204
205
206.. function:: strict_errors(exception)
207
Georg Brandl495f7b52009-10-27 15:28:25 +0000208 Implements the ``strict`` error handling: each encoding or decoding error
209 raises a :exc:`UnicodeError`.
Georg Brandl116aa622007-08-15 14:28:22 +0000210
211
212.. function:: replace_errors(exception)
213
Georg Brandl495f7b52009-10-27 15:28:25 +0000214 Implements the ``replace`` error handling: malformed data is replaced with a
215 suitable replacement character such as ``'?'`` in bytestrings and
216 ``'\ufffd'`` in Unicode strings.
Georg Brandl116aa622007-08-15 14:28:22 +0000217
218
219.. function:: ignore_errors(exception)
220
Georg Brandl495f7b52009-10-27 15:28:25 +0000221 Implements the ``ignore`` error handling: malformed data is ignored and
222 encoding or decoding is continued without further notice.
Georg Brandl116aa622007-08-15 14:28:22 +0000223
224
Thomas Wouters89d996e2007-09-08 17:39:28 +0000225.. function:: xmlcharrefreplace_errors(exception)
Georg Brandl116aa622007-08-15 14:28:22 +0000226
Georg Brandl495f7b52009-10-27 15:28:25 +0000227 Implements the ``xmlcharrefreplace`` error handling (for encoding only): the
228 unencodable character is replaced by an appropriate XML character reference.
Georg Brandl116aa622007-08-15 14:28:22 +0000229
230
Thomas Wouters89d996e2007-09-08 17:39:28 +0000231.. function:: backslashreplace_errors(exception)
Georg Brandl116aa622007-08-15 14:28:22 +0000232
Georg Brandl495f7b52009-10-27 15:28:25 +0000233 Implements the ``backslashreplace`` error handling (for encoding only): the
234 unencodable character is replaced by a backslashed escape sequence.
Georg Brandl116aa622007-08-15 14:28:22 +0000235
236To simplify working with encoded files or stream, the module also defines these
237utility functions:
238
239
240.. function:: open(filename, mode[, encoding[, errors[, buffering]]])
241
242 Open an encoded file using the given *mode* and return a wrapped version
Christian Heimes18c66892008-02-17 13:31:39 +0000243 providing transparent encoding/decoding. The default file mode is ``'r'``
244 meaning to open the file in read mode.
Georg Brandl116aa622007-08-15 14:28:22 +0000245
246 .. note::
247
Georg Brandl30c78d62008-05-11 14:52:00 +0000248 The wrapped version's methods will accept and return strings only. Bytes
249 arguments will be rejected.
Georg Brandl116aa622007-08-15 14:28:22 +0000250
Christian Heimes18c66892008-02-17 13:31:39 +0000251 .. note::
252
253 Files are always opened in binary mode, even if no binary mode was
254 specified. This is done to avoid data loss due to encodings using 8-bit
Georg Brandl30c78d62008-05-11 14:52:00 +0000255 values. This means that no automatic conversion of ``b'\n'`` is done
Christian Heimes18c66892008-02-17 13:31:39 +0000256 on reading and writing.
257
Georg Brandl116aa622007-08-15 14:28:22 +0000258 *encoding* specifies the encoding which is to be used for the file.
259
260 *errors* may be given to define the error handling. It defaults to ``'strict'``
261 which causes a :exc:`ValueError` to be raised in case an encoding error occurs.
262
263 *buffering* has the same meaning as for the built-in :func:`open` function. It
264 defaults to line buffered.
265
266
Georg Brandl0d8f0732009-04-05 22:20:44 +0000267.. function:: EncodedFile(file, data_encoding, file_encoding=None, errors='strict')
Georg Brandl116aa622007-08-15 14:28:22 +0000268
269 Return a wrapped version of file which provides transparent encoding
270 translation.
271
Georg Brandl30c78d62008-05-11 14:52:00 +0000272 Bytes written to the wrapped file are interpreted according to the given
Georg Brandl0d8f0732009-04-05 22:20:44 +0000273 *data_encoding* and then written to the original file as bytes using the
274 *file_encoding*.
Georg Brandl116aa622007-08-15 14:28:22 +0000275
Georg Brandl0d8f0732009-04-05 22:20:44 +0000276 If *file_encoding* is not given, it defaults to *data_encoding*.
Georg Brandl116aa622007-08-15 14:28:22 +0000277
Georg Brandl0d8f0732009-04-05 22:20:44 +0000278 *errors* may be given to define the error handling. It defaults to
279 ``'strict'``, which causes :exc:`ValueError` to be raised in case an encoding
280 error occurs.
Georg Brandl116aa622007-08-15 14:28:22 +0000281
282
Georg Brandl0d8f0732009-04-05 22:20:44 +0000283.. function:: iterencode(iterator, encoding, errors='strict', **kwargs)
Georg Brandl116aa622007-08-15 14:28:22 +0000284
285 Uses an incremental encoder to iteratively encode the input provided by
Georg Brandl0d8f0732009-04-05 22:20:44 +0000286 *iterator*. This function is a :term:`generator`. *errors* (as well as any
Georg Brandl9afde1c2007-11-01 20:32:30 +0000287 other keyword argument) is passed through to the incremental encoder.
Georg Brandl116aa622007-08-15 14:28:22 +0000288
Georg Brandl116aa622007-08-15 14:28:22 +0000289
Georg Brandl0d8f0732009-04-05 22:20:44 +0000290.. function:: iterdecode(iterator, encoding, errors='strict', **kwargs)
Georg Brandl116aa622007-08-15 14:28:22 +0000291
292 Uses an incremental decoder to iteratively decode the input provided by
Georg Brandl0d8f0732009-04-05 22:20:44 +0000293 *iterator*. This function is a :term:`generator`. *errors* (as well as any
Georg Brandl9afde1c2007-11-01 20:32:30 +0000294 other keyword argument) is passed through to the incremental decoder.
Georg Brandl116aa622007-08-15 14:28:22 +0000295
Georg Brandl0d8f0732009-04-05 22:20:44 +0000296
Georg Brandl116aa622007-08-15 14:28:22 +0000297The module also provides the following constants which are useful for reading
298and writing to platform dependent files:
299
300
301.. data:: BOM
302 BOM_BE
303 BOM_LE
304 BOM_UTF8
305 BOM_UTF16
306 BOM_UTF16_BE
307 BOM_UTF16_LE
308 BOM_UTF32
309 BOM_UTF32_BE
310 BOM_UTF32_LE
311
312 These constants define various encodings of the Unicode byte order mark (BOM)
313 used in UTF-16 and UTF-32 data streams to indicate the byte order used in the
314 stream or file and in UTF-8 as a Unicode signature. :const:`BOM_UTF16` is either
315 :const:`BOM_UTF16_BE` or :const:`BOM_UTF16_LE` depending on the platform's
316 native byte order, :const:`BOM` is an alias for :const:`BOM_UTF16`,
317 :const:`BOM_LE` for :const:`BOM_UTF16_LE` and :const:`BOM_BE` for
318 :const:`BOM_UTF16_BE`. The others represent the BOM in UTF-8 and UTF-32
319 encodings.
320
321
322.. _codec-base-classes:
323
324Codec Base Classes
325------------------
326
327The :mod:`codecs` module defines a set of base classes which define the
Georg Brandlf08a9dd2008-06-10 16:57:31 +0000328interface and can also be used to easily write your own codecs for use in
329Python.
Georg Brandl116aa622007-08-15 14:28:22 +0000330
331Each codec has to define four interfaces to make it usable as codec in Python:
332stateless encoder, stateless decoder, stream reader and stream writer. The
333stream reader and writers typically reuse the stateless encoder/decoder to
334implement the file protocols.
335
336The :class:`Codec` class defines the interface for stateless encoders/decoders.
337
Serhiy Storchakabfdcd432013-10-13 23:09:14 +0300338To simplify and standardize error handling, the :meth:`~Codec.encode` and
339:meth:`~Codec.decode` methods may implement different error handling schemes by
Georg Brandl116aa622007-08-15 14:28:22 +0000340providing the *errors* string argument. The following string values are defined
341and implemented by all standard Python codecs:
342
Georg Brandl44ea77b2013-03-28 13:28:44 +0100343.. tabularcolumns:: |l|L|
344
Georg Brandl116aa622007-08-15 14:28:22 +0000345+-------------------------+-----------------------------------------------+
346| Value | Meaning |
347+=========================+===============================================+
348| ``'strict'`` | Raise :exc:`UnicodeError` (or a subclass); |
349| | this is the default. |
350+-------------------------+-----------------------------------------------+
351| ``'ignore'`` | Ignore the character and continue with the |
352| | next. |
353+-------------------------+-----------------------------------------------+
354| ``'replace'`` | Replace with a suitable replacement |
355| | character; Python will use the official |
356| | U+FFFD REPLACEMENT CHARACTER for the built-in |
357| | Unicode codecs on decoding and '?' on |
358| | encoding. |
359+-------------------------+-----------------------------------------------+
360| ``'xmlcharrefreplace'`` | Replace with the appropriate XML character |
361| | reference (only for encoding). |
362+-------------------------+-----------------------------------------------+
363| ``'backslashreplace'`` | Replace with backslashed escape sequences |
364| | (only for encoding). |
365+-------------------------+-----------------------------------------------+
Martin v. Löwis3d2eca02009-06-29 06:35:26 +0000366| ``'surrogateescape'`` | Replace byte with surrogate U+DCxx, as defined|
367| | in :pep:`383`. |
Martin v. Löwis011e8422009-05-05 04:43:17 +0000368+-------------------------+-----------------------------------------------+
Georg Brandl116aa622007-08-15 14:28:22 +0000369
Serhiy Storchaka58cf6072013-11-19 11:32:41 +0200370In addition, the following error handlers are specific to Unicode encoding
371schemes:
Martin v. Löwisdb12d452009-05-02 18:52:14 +0000372
Serhiy Storchaka58cf6072013-11-19 11:32:41 +0200373+-------------------+------------------------+-------------------------------------------+
374| Value | Codec | Meaning |
375+===================+========================+===========================================+
376|``'surrogatepass'``| utf-8, utf-16, utf-32, | Allow encoding and decoding of surrogate |
377| | utf-16-be, utf-16-le, | codes in all the Unicode encoding schemes.|
378| | utf-32-be, utf-32-le | |
379+-------------------+------------------------+-------------------------------------------+
Martin v. Löwisdb12d452009-05-02 18:52:14 +0000380
381.. versionadded:: 3.1
Martin v. Löwis43c57782009-05-10 08:15:24 +0000382 The ``'surrogateescape'`` and ``'surrogatepass'`` error handlers.
Martin v. Löwisdb12d452009-05-02 18:52:14 +0000383
Serhiy Storchaka58cf6072013-11-19 11:32:41 +0200384.. versionchanged:: 3.4
385 The ``'surrogatepass'`` error handlers now works with utf-16\* and utf-32\* codecs.
386
Georg Brandl116aa622007-08-15 14:28:22 +0000387The set of allowed values can be extended via :meth:`register_error`.
388
389
390.. _codec-objects:
391
392Codec Objects
393^^^^^^^^^^^^^
394
395The :class:`Codec` class defines these methods which also define the function
396interfaces of the stateless encoder and decoder:
397
398
399.. method:: Codec.encode(input[, errors])
400
401 Encodes the object *input* and returns a tuple (output object, length consumed).
Georg Brandl30c78d62008-05-11 14:52:00 +0000402 Encoding converts a string object to a bytes object using a particular
Georg Brandl116aa622007-08-15 14:28:22 +0000403 character set encoding (e.g., ``cp1252`` or ``iso-8859-1``).
404
405 *errors* defines the error handling to apply. It defaults to ``'strict'``
406 handling.
407
408 The method may not store state in the :class:`Codec` instance. Use
409 :class:`StreamCodec` for codecs which have to keep state in order to make
410 encoding/decoding efficient.
411
412 The encoder must be able to handle zero length input and return an empty object
413 of the output object type in this situation.
414
415
416.. method:: Codec.decode(input[, errors])
417
Georg Brandl30c78d62008-05-11 14:52:00 +0000418 Decodes the object *input* and returns a tuple (output object, length
419 consumed). Decoding converts a bytes object encoded using a particular
420 character set encoding to a string object.
Georg Brandl116aa622007-08-15 14:28:22 +0000421
Georg Brandl30c78d62008-05-11 14:52:00 +0000422 *input* must be a bytes object or one which provides the read-only character
423 buffer interface -- for example, buffer objects and memory mapped files.
Georg Brandl116aa622007-08-15 14:28:22 +0000424
425 *errors* defines the error handling to apply. It defaults to ``'strict'``
426 handling.
427
428 The method may not store state in the :class:`Codec` instance. Use
429 :class:`StreamCodec` for codecs which have to keep state in order to make
430 encoding/decoding efficient.
431
432 The decoder must be able to handle zero length input and return an empty object
433 of the output object type in this situation.
434
435The :class:`IncrementalEncoder` and :class:`IncrementalDecoder` classes provide
436the basic interface for incremental encoding and decoding. Encoding/decoding the
437input isn't done with one call to the stateless encoder/decoder function, but
Serhiy Storchakabfdcd432013-10-13 23:09:14 +0300438with multiple calls to the
439:meth:`~IncrementalEncoder.encode`/:meth:`~IncrementalDecoder.decode` method of
440the incremental encoder/decoder. The incremental encoder/decoder keeps track of
441the encoding/decoding process during method calls.
Georg Brandl116aa622007-08-15 14:28:22 +0000442
Serhiy Storchakabfdcd432013-10-13 23:09:14 +0300443The joined output of calls to the
444:meth:`~IncrementalEncoder.encode`/:meth:`~IncrementalDecoder.decode` method is
445the same as if all the single inputs were joined into one, and this input was
Georg Brandl116aa622007-08-15 14:28:22 +0000446encoded/decoded with the stateless encoder/decoder.
447
448
449.. _incremental-encoder-objects:
450
451IncrementalEncoder Objects
452^^^^^^^^^^^^^^^^^^^^^^^^^^
453
Georg Brandl116aa622007-08-15 14:28:22 +0000454The :class:`IncrementalEncoder` class is used for encoding an input in multiple
455steps. It defines the following methods which every incremental encoder must
456define in order to be compatible with the Python codec registry.
457
458
459.. class:: IncrementalEncoder([errors])
460
461 Constructor for an :class:`IncrementalEncoder` instance.
462
463 All incremental encoders must provide this constructor interface. They are free
464 to add additional keyword arguments, but only the ones defined here are used by
465 the Python codec registry.
466
467 The :class:`IncrementalEncoder` may implement different error handling schemes
468 by providing the *errors* keyword argument. These parameters are predefined:
469
470 * ``'strict'`` Raise :exc:`ValueError` (or a subclass); this is the default.
471
472 * ``'ignore'`` Ignore the character and continue with the next.
473
474 * ``'replace'`` Replace with a suitable replacement character
475
476 * ``'xmlcharrefreplace'`` Replace with the appropriate XML character reference
477
478 * ``'backslashreplace'`` Replace with backslashed escape sequences.
479
480 The *errors* argument will be assigned to an attribute of the same name.
481 Assigning to this attribute makes it possible to switch between different error
482 handling strategies during the lifetime of the :class:`IncrementalEncoder`
483 object.
484
485 The set of allowed values for the *errors* argument can be extended with
486 :func:`register_error`.
487
488
Benjamin Petersone41251e2008-04-25 01:59:09 +0000489 .. method:: encode(object[, final])
Georg Brandl116aa622007-08-15 14:28:22 +0000490
Benjamin Petersone41251e2008-04-25 01:59:09 +0000491 Encodes *object* (taking the current state of the encoder into account)
492 and returns the resulting encoded object. If this is the last call to
493 :meth:`encode` *final* must be true (the default is false).
Georg Brandl116aa622007-08-15 14:28:22 +0000494
495
Benjamin Petersone41251e2008-04-25 01:59:09 +0000496 .. method:: reset()
Georg Brandl116aa622007-08-15 14:28:22 +0000497
Victor Stinnere15dce32011-05-30 22:56:00 +0200498 Reset the encoder to the initial state. The output is discarded: call
499 ``.encode('', final=True)`` to reset the encoder and to get the output.
Georg Brandl116aa622007-08-15 14:28:22 +0000500
501
502.. method:: IncrementalEncoder.getstate()
503
504 Return the current state of the encoder which must be an integer. The
505 implementation should make sure that ``0`` is the most common state. (States
506 that are more complicated than integers can be converted into an integer by
507 marshaling/pickling the state and encoding the bytes of the resulting string
508 into an integer).
509
Georg Brandl116aa622007-08-15 14:28:22 +0000510
511.. method:: IncrementalEncoder.setstate(state)
512
513 Set the state of the encoder to *state*. *state* must be an encoder state
514 returned by :meth:`getstate`.
515
Georg Brandl116aa622007-08-15 14:28:22 +0000516
517.. _incremental-decoder-objects:
518
519IncrementalDecoder Objects
520^^^^^^^^^^^^^^^^^^^^^^^^^^
521
522The :class:`IncrementalDecoder` class is used for decoding an input in multiple
523steps. It defines the following methods which every incremental decoder must
524define in order to be compatible with the Python codec registry.
525
526
527.. class:: IncrementalDecoder([errors])
528
529 Constructor for an :class:`IncrementalDecoder` instance.
530
531 All incremental decoders must provide this constructor interface. They are free
532 to add additional keyword arguments, but only the ones defined here are used by
533 the Python codec registry.
534
535 The :class:`IncrementalDecoder` may implement different error handling schemes
536 by providing the *errors* keyword argument. These parameters are predefined:
537
538 * ``'strict'`` Raise :exc:`ValueError` (or a subclass); this is the default.
539
540 * ``'ignore'`` Ignore the character and continue with the next.
541
542 * ``'replace'`` Replace with a suitable replacement character.
543
544 The *errors* argument will be assigned to an attribute of the same name.
545 Assigning to this attribute makes it possible to switch between different error
Benjamin Peterson3e4f0552008-09-02 00:31:15 +0000546 handling strategies during the lifetime of the :class:`IncrementalDecoder`
Georg Brandl116aa622007-08-15 14:28:22 +0000547 object.
548
549 The set of allowed values for the *errors* argument can be extended with
550 :func:`register_error`.
551
552
Benjamin Petersone41251e2008-04-25 01:59:09 +0000553 .. method:: decode(object[, final])
Georg Brandl116aa622007-08-15 14:28:22 +0000554
Benjamin Petersone41251e2008-04-25 01:59:09 +0000555 Decodes *object* (taking the current state of the decoder into account)
556 and returns the resulting decoded object. If this is the last call to
557 :meth:`decode` *final* must be true (the default is false). If *final* is
558 true the decoder must decode the input completely and must flush all
559 buffers. If this isn't possible (e.g. because of incomplete byte sequences
560 at the end of the input) it must initiate error handling just like in the
561 stateless case (which might raise an exception).
Georg Brandl116aa622007-08-15 14:28:22 +0000562
563
Benjamin Petersone41251e2008-04-25 01:59:09 +0000564 .. method:: reset()
Georg Brandl116aa622007-08-15 14:28:22 +0000565
Benjamin Petersone41251e2008-04-25 01:59:09 +0000566 Reset the decoder to the initial state.
Georg Brandl116aa622007-08-15 14:28:22 +0000567
568
Benjamin Petersone41251e2008-04-25 01:59:09 +0000569 .. method:: getstate()
Georg Brandl116aa622007-08-15 14:28:22 +0000570
Benjamin Petersone41251e2008-04-25 01:59:09 +0000571 Return the current state of the decoder. This must be a tuple with two
572 items, the first must be the buffer containing the still undecoded
573 input. The second must be an integer and can be additional state
574 info. (The implementation should make sure that ``0`` is the most common
575 additional state info.) If this additional state info is ``0`` it must be
576 possible to set the decoder to the state which has no input buffered and
577 ``0`` as the additional state info, so that feeding the previously
578 buffered input to the decoder returns it to the previous state without
579 producing any output. (Additional state info that is more complicated than
580 integers can be converted into an integer by marshaling/pickling the info
581 and encoding the bytes of the resulting string into an integer.)
Georg Brandl116aa622007-08-15 14:28:22 +0000582
Georg Brandl116aa622007-08-15 14:28:22 +0000583
Benjamin Petersone41251e2008-04-25 01:59:09 +0000584 .. method:: setstate(state)
Georg Brandl116aa622007-08-15 14:28:22 +0000585
Benjamin Petersone41251e2008-04-25 01:59:09 +0000586 Set the state of the encoder to *state*. *state* must be a decoder state
587 returned by :meth:`getstate`.
588
Georg Brandl116aa622007-08-15 14:28:22 +0000589
Georg Brandl116aa622007-08-15 14:28:22 +0000590The :class:`StreamWriter` and :class:`StreamReader` classes provide generic
591working interfaces which can be used to implement new encoding submodules very
592easily. See :mod:`encodings.utf_8` for an example of how this is done.
593
594
595.. _stream-writer-objects:
596
597StreamWriter Objects
598^^^^^^^^^^^^^^^^^^^^
599
600The :class:`StreamWriter` class is a subclass of :class:`Codec` and defines the
601following methods which every stream writer must define in order to be
602compatible with the Python codec registry.
603
604
605.. class:: StreamWriter(stream[, errors])
606
607 Constructor for a :class:`StreamWriter` instance.
608
609 All stream writers must provide this constructor interface. They are free to add
610 additional keyword arguments, but only the ones defined here are used by the
611 Python codec registry.
612
613 *stream* must be a file-like object open for writing binary data.
614
615 The :class:`StreamWriter` may implement different error handling schemes by
616 providing the *errors* keyword argument. These parameters are predefined:
617
618 * ``'strict'`` Raise :exc:`ValueError` (or a subclass); this is the default.
619
620 * ``'ignore'`` Ignore the character and continue with the next.
621
622 * ``'replace'`` Replace with a suitable replacement character
623
624 * ``'xmlcharrefreplace'`` Replace with the appropriate XML character reference
625
626 * ``'backslashreplace'`` Replace with backslashed escape sequences.
627
628 The *errors* argument will be assigned to an attribute of the same name.
629 Assigning to this attribute makes it possible to switch between different error
630 handling strategies during the lifetime of the :class:`StreamWriter` object.
631
632 The set of allowed values for the *errors* argument can be extended with
633 :func:`register_error`.
634
635
Benjamin Petersone41251e2008-04-25 01:59:09 +0000636 .. method:: write(object)
Georg Brandl116aa622007-08-15 14:28:22 +0000637
Benjamin Petersone41251e2008-04-25 01:59:09 +0000638 Writes the object's contents encoded to the stream.
Georg Brandl116aa622007-08-15 14:28:22 +0000639
640
Benjamin Petersone41251e2008-04-25 01:59:09 +0000641 .. method:: writelines(list)
Georg Brandl116aa622007-08-15 14:28:22 +0000642
Benjamin Petersone41251e2008-04-25 01:59:09 +0000643 Writes the concatenated list of strings to the stream (possibly by reusing
644 the :meth:`write` method).
Georg Brandl116aa622007-08-15 14:28:22 +0000645
646
Benjamin Petersone41251e2008-04-25 01:59:09 +0000647 .. method:: reset()
Georg Brandl116aa622007-08-15 14:28:22 +0000648
Benjamin Petersone41251e2008-04-25 01:59:09 +0000649 Flushes and resets the codec buffers used for keeping state.
Georg Brandl116aa622007-08-15 14:28:22 +0000650
Benjamin Petersone41251e2008-04-25 01:59:09 +0000651 Calling this method should ensure that the data on the output is put into
652 a clean state that allows appending of new fresh data without having to
653 rescan the whole stream to recover state.
654
Georg Brandl116aa622007-08-15 14:28:22 +0000655
656In addition to the above methods, the :class:`StreamWriter` must also inherit
657all other methods and attributes from the underlying stream.
658
659
660.. _stream-reader-objects:
661
662StreamReader Objects
663^^^^^^^^^^^^^^^^^^^^
664
665The :class:`StreamReader` class is a subclass of :class:`Codec` and defines the
666following methods which every stream reader must define in order to be
667compatible with the Python codec registry.
668
669
670.. class:: StreamReader(stream[, errors])
671
672 Constructor for a :class:`StreamReader` instance.
673
674 All stream readers must provide this constructor interface. They are free to add
675 additional keyword arguments, but only the ones defined here are used by the
676 Python codec registry.
677
678 *stream* must be a file-like object open for reading (binary) data.
679
680 The :class:`StreamReader` may implement different error handling schemes by
681 providing the *errors* keyword argument. These parameters are defined:
682
683 * ``'strict'`` Raise :exc:`ValueError` (or a subclass); this is the default.
684
685 * ``'ignore'`` Ignore the character and continue with the next.
686
687 * ``'replace'`` Replace with a suitable replacement character.
688
689 The *errors* argument will be assigned to an attribute of the same name.
690 Assigning to this attribute makes it possible to switch between different error
691 handling strategies during the lifetime of the :class:`StreamReader` object.
692
693 The set of allowed values for the *errors* argument can be extended with
694 :func:`register_error`.
695
696
Benjamin Petersone41251e2008-04-25 01:59:09 +0000697 .. method:: read([size[, chars, [firstline]]])
Georg Brandl116aa622007-08-15 14:28:22 +0000698
Benjamin Petersone41251e2008-04-25 01:59:09 +0000699 Decodes data from the stream and returns the resulting object.
Georg Brandl116aa622007-08-15 14:28:22 +0000700
Benjamin Petersone41251e2008-04-25 01:59:09 +0000701 *chars* indicates the number of characters to read from the
702 stream. :func:`read` will never return more than *chars* characters, but
703 it might return less, if there are not enough characters available.
Georg Brandl116aa622007-08-15 14:28:22 +0000704
Benjamin Petersone41251e2008-04-25 01:59:09 +0000705 *size* indicates the approximate maximum number of bytes to read from the
706 stream for decoding purposes. The decoder can modify this setting as
707 appropriate. The default value -1 indicates to read and decode as much as
708 possible. *size* is intended to prevent having to decode huge files in
709 one step.
Georg Brandl116aa622007-08-15 14:28:22 +0000710
Benjamin Petersone41251e2008-04-25 01:59:09 +0000711 *firstline* indicates that it would be sufficient to only return the first
712 line, if there are decoding errors on later lines.
Georg Brandl116aa622007-08-15 14:28:22 +0000713
Benjamin Petersone41251e2008-04-25 01:59:09 +0000714 The method should use a greedy read strategy meaning that it should read
715 as much data as is allowed within the definition of the encoding and the
716 given size, e.g. if optional encoding endings or state markers are
717 available on the stream, these should be read too.
Georg Brandl116aa622007-08-15 14:28:22 +0000718
Georg Brandl116aa622007-08-15 14:28:22 +0000719
Benjamin Petersone41251e2008-04-25 01:59:09 +0000720 .. method:: readline([size[, keepends]])
Georg Brandl116aa622007-08-15 14:28:22 +0000721
Benjamin Petersone41251e2008-04-25 01:59:09 +0000722 Read one line from the input stream and return the decoded data.
Georg Brandl116aa622007-08-15 14:28:22 +0000723
Benjamin Petersone41251e2008-04-25 01:59:09 +0000724 *size*, if given, is passed as size argument to the stream's
Serhiy Storchakacca40ff2013-07-11 18:26:13 +0300725 :meth:`read` method.
Georg Brandl116aa622007-08-15 14:28:22 +0000726
Benjamin Petersone41251e2008-04-25 01:59:09 +0000727 If *keepends* is false line-endings will be stripped from the lines
728 returned.
Georg Brandl116aa622007-08-15 14:28:22 +0000729
Georg Brandl116aa622007-08-15 14:28:22 +0000730
Benjamin Petersone41251e2008-04-25 01:59:09 +0000731 .. method:: readlines([sizehint[, keepends]])
Georg Brandl116aa622007-08-15 14:28:22 +0000732
Benjamin Petersone41251e2008-04-25 01:59:09 +0000733 Read all lines available on the input stream and return them as a list of
734 lines.
Georg Brandl116aa622007-08-15 14:28:22 +0000735
Benjamin Petersone41251e2008-04-25 01:59:09 +0000736 Line-endings are implemented using the codec's decoder method and are
737 included in the list entries if *keepends* is true.
Georg Brandl116aa622007-08-15 14:28:22 +0000738
Benjamin Petersone41251e2008-04-25 01:59:09 +0000739 *sizehint*, if given, is passed as the *size* argument to the stream's
740 :meth:`read` method.
Georg Brandl116aa622007-08-15 14:28:22 +0000741
742
Benjamin Petersone41251e2008-04-25 01:59:09 +0000743 .. method:: reset()
Georg Brandl116aa622007-08-15 14:28:22 +0000744
Benjamin Petersone41251e2008-04-25 01:59:09 +0000745 Resets the codec buffers used for keeping state.
Georg Brandl116aa622007-08-15 14:28:22 +0000746
Benjamin Petersone41251e2008-04-25 01:59:09 +0000747 Note that no stream repositioning should take place. This method is
748 primarily intended to be able to recover from decoding errors.
749
Georg Brandl116aa622007-08-15 14:28:22 +0000750
751In addition to the above methods, the :class:`StreamReader` must also inherit
752all other methods and attributes from the underlying stream.
753
754The next two base classes are included for convenience. They are not needed by
755the codec registry, but may provide useful in practice.
756
757
758.. _stream-reader-writer:
759
760StreamReaderWriter Objects
761^^^^^^^^^^^^^^^^^^^^^^^^^^
762
763The :class:`StreamReaderWriter` allows wrapping streams which work in both read
764and write modes.
765
766The design is such that one can use the factory functions returned by the
767:func:`lookup` function to construct the instance.
768
769
770.. class:: StreamReaderWriter(stream, Reader, Writer, errors)
771
772 Creates a :class:`StreamReaderWriter` instance. *stream* must be a file-like
773 object. *Reader* and *Writer* must be factory functions or classes providing the
774 :class:`StreamReader` and :class:`StreamWriter` interface resp. Error handling
775 is done in the same way as defined for the stream readers and writers.
776
777:class:`StreamReaderWriter` instances define the combined interfaces of
778:class:`StreamReader` and :class:`StreamWriter` classes. They inherit all other
779methods and attributes from the underlying stream.
780
781
782.. _stream-recoder-objects:
783
784StreamRecoder Objects
785^^^^^^^^^^^^^^^^^^^^^
786
787The :class:`StreamRecoder` provide a frontend - backend view of encoding data
788which is sometimes useful when dealing with different encoding environments.
789
790The design is such that one can use the factory functions returned by the
791:func:`lookup` function to construct the instance.
792
793
794.. class:: StreamRecoder(stream, encode, decode, Reader, Writer, errors)
795
796 Creates a :class:`StreamRecoder` instance which implements a two-way conversion:
797 *encode* and *decode* work on the frontend (the input to :meth:`read` and output
798 of :meth:`write`) while *Reader* and *Writer* work on the backend (reading and
799 writing to the stream).
800
801 You can use these objects to do transparent direct recodings from e.g. Latin-1
802 to UTF-8 and back.
803
804 *stream* must be a file-like object.
805
806 *encode*, *decode* must adhere to the :class:`Codec` interface. *Reader*,
807 *Writer* must be factory functions or classes providing objects of the
808 :class:`StreamReader` and :class:`StreamWriter` interface respectively.
809
810 *encode* and *decode* are needed for the frontend translation, *Reader* and
Georg Brandl30c78d62008-05-11 14:52:00 +0000811 *Writer* for the backend translation.
Georg Brandl116aa622007-08-15 14:28:22 +0000812
813 Error handling is done in the same way as defined for the stream readers and
814 writers.
815
Benjamin Petersone41251e2008-04-25 01:59:09 +0000816
Georg Brandl116aa622007-08-15 14:28:22 +0000817:class:`StreamRecoder` instances define the combined interfaces of
818:class:`StreamReader` and :class:`StreamWriter` classes. They inherit all other
819methods and attributes from the underlying stream.
820
821
822.. _encodings-overview:
823
824Encodings and Unicode
825---------------------
826
Ezio Melotti7a03f642011-10-25 10:30:19 +0300827Strings are stored internally as sequences of codepoints in range ``0 - 10FFFF``
828(see :pep:`393` for more details about the implementation).
829Once a string object is used outside of CPU and memory, CPU endianness
Georg Brandl116aa622007-08-15 14:28:22 +0000830and how these arrays are stored as bytes become an issue. Transforming a
Georg Brandl30c78d62008-05-11 14:52:00 +0000831string object into a sequence of bytes is called encoding and recreating the
832string object from the sequence of bytes is known as decoding. There are many
Georg Brandl116aa622007-08-15 14:28:22 +0000833different methods for how this transformation can be done (these methods are
834also called encodings). The simplest method is to map the codepoints 0-255 to
Georg Brandl30c78d62008-05-11 14:52:00 +0000835the bytes ``0x0``-``0xff``. This means that a string object that contains
Georg Brandl116aa622007-08-15 14:28:22 +0000836codepoints above ``U+00FF`` can't be encoded with this method (which is called
Georg Brandl30c78d62008-05-11 14:52:00 +0000837``'latin-1'`` or ``'iso-8859-1'``). :func:`str.encode` will raise a
Georg Brandl116aa622007-08-15 14:28:22 +0000838:exc:`UnicodeEncodeError` that looks like this: ``UnicodeEncodeError: 'latin-1'
Georg Brandl30c78d62008-05-11 14:52:00 +0000839codec can't encode character '\u1234' in position 3: ordinal not in
Georg Brandl116aa622007-08-15 14:28:22 +0000840range(256)``.
841
842There's another group of encodings (the so called charmap encodings) that choose
Georg Brandl30c78d62008-05-11 14:52:00 +0000843a different subset of all Unicode code points and how these codepoints are
Georg Brandl116aa622007-08-15 14:28:22 +0000844mapped to the bytes ``0x0``-``0xff``. To see how this is done simply open
845e.g. :file:`encodings/cp1252.py` (which is an encoding that is used primarily on
846Windows). There's a string constant with 256 characters that shows you which
847character is mapped to which byte value.
848
Ezio Melottifbb39812011-10-25 10:40:38 +0300849All of these encodings can only encode 256 of the 1114112 codepoints
Georg Brandl30c78d62008-05-11 14:52:00 +0000850defined in Unicode. A simple and straightforward way that can store each Unicode
Ezio Melottifbb39812011-10-25 10:40:38 +0300851code point, is to store each codepoint as four consecutive bytes. There are two
852possibilities: store the bytes in big endian or in little endian order. These
853two encodings are called ``UTF-32-BE`` and ``UTF-32-LE`` respectively. Their
854disadvantage is that if e.g. you use ``UTF-32-BE`` on a little endian machine you
855will always have to swap bytes on encoding and decoding. ``UTF-32`` avoids this
856problem: bytes will always be in natural endianness. When these bytes are read
Georg Brandl116aa622007-08-15 14:28:22 +0000857by a CPU with a different endianness, then bytes have to be swapped though. To
Ezio Melottifbb39812011-10-25 10:40:38 +0300858be able to detect the endianness of a ``UTF-16`` or ``UTF-32`` byte sequence,
859there's the so called BOM ("Byte Order Mark"). This is the Unicode character
860``U+FEFF``. This character can be prepended to every ``UTF-16`` or ``UTF-32``
861byte sequence. The byte swapped version of this character (``0xFFFE``) is an
862illegal character that may not appear in a Unicode text. So when the
863first character in an ``UTF-16`` or ``UTF-32`` byte sequence
Georg Brandl116aa622007-08-15 14:28:22 +0000864appears to be a ``U+FFFE`` the bytes have to be swapped on decoding.
Ezio Melottifbb39812011-10-25 10:40:38 +0300865Unfortunately the character ``U+FEFF`` had a second purpose as
866a ``ZERO WIDTH NO-BREAK SPACE``: a character that has no width and doesn't allow
Georg Brandl116aa622007-08-15 14:28:22 +0000867a word to be split. It can e.g. be used to give hints to a ligature algorithm.
868With Unicode 4.0 using ``U+FEFF`` as a ``ZERO WIDTH NO-BREAK SPACE`` has been
869deprecated (with ``U+2060`` (``WORD JOINER``) assuming this role). Nevertheless
Ezio Melottifbb39812011-10-25 10:40:38 +0300870Unicode software still must be able to handle ``U+FEFF`` in both roles: as a BOM
Georg Brandl116aa622007-08-15 14:28:22 +0000871it's a device to determine the storage layout of the encoded bytes, and vanishes
Georg Brandl30c78d62008-05-11 14:52:00 +0000872once the byte sequence has been decoded into a string; as a ``ZERO WIDTH
Georg Brandl116aa622007-08-15 14:28:22 +0000873NO-BREAK SPACE`` it's a normal character that will be decoded like any other.
874
875There's another encoding that is able to encoding the full range of Unicode
876characters: UTF-8. UTF-8 is an 8-bit encoding, which means there are no issues
877with byte order in UTF-8. Each byte in a UTF-8 byte sequence consists of two
Ezio Melottifbb39812011-10-25 10:40:38 +0300878parts: marker bits (the most significant bits) and payload bits. The marker bits
Ezio Melotti222b2082011-09-01 08:11:28 +0300879are a sequence of zero to four ``1`` bits followed by a ``0`` bit. Unicode characters are
Georg Brandl116aa622007-08-15 14:28:22 +0000880encoded like this (with x being payload bits, which when concatenated give the
881Unicode character):
882
883+-----------------------------------+----------------------------------------------+
884| Range | Encoding |
885+===================================+==============================================+
886| ``U-00000000`` ... ``U-0000007F`` | 0xxxxxxx |
887+-----------------------------------+----------------------------------------------+
888| ``U-00000080`` ... ``U-000007FF`` | 110xxxxx 10xxxxxx |
889+-----------------------------------+----------------------------------------------+
890| ``U-00000800`` ... ``U-0000FFFF`` | 1110xxxx 10xxxxxx 10xxxxxx |
891+-----------------------------------+----------------------------------------------+
Ezio Melotti222b2082011-09-01 08:11:28 +0300892| ``U-00010000`` ... ``U-0010FFFF`` | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx |
Georg Brandl116aa622007-08-15 14:28:22 +0000893+-----------------------------------+----------------------------------------------+
894
895The least significant bit of the Unicode character is the rightmost x bit.
896
897As UTF-8 is an 8-bit encoding no BOM is required and any ``U+FEFF`` character in
Georg Brandl30c78d62008-05-11 14:52:00 +0000898the decoded string (even if it's the first character) is treated as a ``ZERO
899WIDTH NO-BREAK SPACE``.
Georg Brandl116aa622007-08-15 14:28:22 +0000900
901Without external information it's impossible to reliably determine which
Georg Brandl30c78d62008-05-11 14:52:00 +0000902encoding was used for encoding a string. Each charmap encoding can
Georg Brandl116aa622007-08-15 14:28:22 +0000903decode any random byte sequence. However that's not possible with UTF-8, as
904UTF-8 byte sequences have a structure that doesn't allow arbitrary byte
Thomas Wouters89d996e2007-09-08 17:39:28 +0000905sequences. To increase the reliability with which a UTF-8 encoding can be
Georg Brandl116aa622007-08-15 14:28:22 +0000906detected, Microsoft invented a variant of UTF-8 (that Python 2.5 calls
907``"utf-8-sig"``) for its Notepad program: Before any of the Unicode characters
908is written to the file, a UTF-8 encoded BOM (which looks like this as a byte
909sequence: ``0xef``, ``0xbb``, ``0xbf``) is written. As it's rather improbable
910that any charmap encoded file starts with these byte values (which would e.g.
911map to
912
913 | LATIN SMALL LETTER I WITH DIAERESIS
914 | RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK
915 | INVERTED QUESTION MARK
916
Ezio Melottifbb39812011-10-25 10:40:38 +0300917in iso-8859-1), this increases the probability that a ``utf-8-sig`` encoding can be
Georg Brandl116aa622007-08-15 14:28:22 +0000918correctly guessed from the byte sequence. So here the BOM is not used to be able
919to determine the byte order used for generating the byte sequence, but as a
920signature that helps in guessing the encoding. On encoding the utf-8-sig codec
921will write ``0xef``, ``0xbb``, ``0xbf`` as the first three bytes to the file. On
Ezio Melottifbb39812011-10-25 10:40:38 +0300922decoding ``utf-8-sig`` will skip those three bytes if they appear as the first
923three bytes in the file. In UTF-8, the use of the BOM is discouraged and
924should generally be avoided.
Georg Brandl116aa622007-08-15 14:28:22 +0000925
926
927.. _standard-encodings:
928
929Standard Encodings
930------------------
931
932Python comes with a number of codecs built-in, either implemented as C functions
933or with dictionaries as mapping tables. The following table lists the codecs by
934name, together with a few common aliases, and the languages for which the
935encoding is likely used. Neither the list of aliases nor the list of languages
936is meant to be exhaustive. Notice that spelling alternatives that only differ in
Georg Brandla6053b42009-09-01 08:11:14 +0000937case or use a hyphen instead of an underscore are also valid aliases; therefore,
938e.g. ``'utf-8'`` is a valid alias for the ``'utf_8'`` codec.
Georg Brandl116aa622007-08-15 14:28:22 +0000939
Alexander Belopolsky1d521462011-02-25 19:19:57 +0000940.. impl-detail::
941
942 Some common encodings can bypass the codecs lookup machinery to
943 improve performance. These optimization opportunities are only
944 recognized by CPython for a limited set of aliases: utf-8, utf8,
945 latin-1, latin1, iso-8859-1, mbcs (Windows only), ascii, utf-16,
946 and utf-32. Using alternative spellings for these encodings may
947 result in slower execution.
948
Georg Brandl116aa622007-08-15 14:28:22 +0000949Many of the character sets support the same languages. They vary in individual
950characters (e.g. whether the EURO SIGN is supported or not), and in the
951assignment of characters to code positions. For the European languages in
952particular, the following variants typically exist:
953
954* an ISO 8859 codeset
955
956* a Microsoft Windows code page, which is typically derived from a 8859 codeset,
957 but replaces control characters with additional graphic characters
958
959* an IBM EBCDIC code page
960
961* an IBM PC code page, which is ASCII compatible
962
Georg Brandl44ea77b2013-03-28 13:28:44 +0100963.. tabularcolumns:: |l|p{0.3\linewidth}|p{0.3\linewidth}|
964
Georg Brandl116aa622007-08-15 14:28:22 +0000965+-----------------+--------------------------------+--------------------------------+
966| Codec | Aliases | Languages |
967+=================+================================+================================+
968| ascii | 646, us-ascii | English |
969+-----------------+--------------------------------+--------------------------------+
970| big5 | big5-tw, csbig5 | Traditional Chinese |
971+-----------------+--------------------------------+--------------------------------+
972| big5hkscs | big5-hkscs, hkscs | Traditional Chinese |
973+-----------------+--------------------------------+--------------------------------+
974| cp037 | IBM037, IBM039 | English |
975+-----------------+--------------------------------+--------------------------------+
R David Murrayc4c7b1c2014-03-07 21:00:34 -0500976| cp273 | 273, IBM273, csIBM273 | German |
977| | | |
978| | | .. versionadded:: 3.4 |
979+-----------------+--------------------------------+--------------------------------+
Georg Brandl116aa622007-08-15 14:28:22 +0000980| cp424 | EBCDIC-CP-HE, IBM424 | Hebrew |
981+-----------------+--------------------------------+--------------------------------+
982| cp437 | 437, IBM437 | English |
983+-----------------+--------------------------------+--------------------------------+
984| cp500 | EBCDIC-CP-BE, EBCDIC-CP-CH, | Western Europe |
985| | IBM500 | |
986+-----------------+--------------------------------+--------------------------------+
Amaury Forgeot d'Arcae6388d2009-07-15 19:21:18 +0000987| cp720 | | Arabic |
988+-----------------+--------------------------------+--------------------------------+
Georg Brandl116aa622007-08-15 14:28:22 +0000989| cp737 | | Greek |
990+-----------------+--------------------------------+--------------------------------+
991| cp775 | IBM775 | Baltic languages |
992+-----------------+--------------------------------+--------------------------------+
993| cp850 | 850, IBM850 | Western Europe |
994+-----------------+--------------------------------+--------------------------------+
995| cp852 | 852, IBM852 | Central and Eastern Europe |
996+-----------------+--------------------------------+--------------------------------+
997| cp855 | 855, IBM855 | Bulgarian, Byelorussian, |
998| | | Macedonian, Russian, Serbian |
999+-----------------+--------------------------------+--------------------------------+
1000| cp856 | | Hebrew |
1001+-----------------+--------------------------------+--------------------------------+
1002| cp857 | 857, IBM857 | Turkish |
1003+-----------------+--------------------------------+--------------------------------+
Benjamin Peterson5a6214a2010-06-27 22:41:29 +00001004| cp858 | 858, IBM858 | Western Europe |
1005+-----------------+--------------------------------+--------------------------------+
Georg Brandl116aa622007-08-15 14:28:22 +00001006| cp860 | 860, IBM860 | Portuguese |
1007+-----------------+--------------------------------+--------------------------------+
1008| cp861 | 861, CP-IS, IBM861 | Icelandic |
1009+-----------------+--------------------------------+--------------------------------+
1010| cp862 | 862, IBM862 | Hebrew |
1011+-----------------+--------------------------------+--------------------------------+
1012| cp863 | 863, IBM863 | Canadian |
1013+-----------------+--------------------------------+--------------------------------+
1014| cp864 | IBM864 | Arabic |
1015+-----------------+--------------------------------+--------------------------------+
1016| cp865 | 865, IBM865 | Danish, Norwegian |
1017+-----------------+--------------------------------+--------------------------------+
1018| cp866 | 866, IBM866 | Russian |
1019+-----------------+--------------------------------+--------------------------------+
1020| cp869 | 869, CP-GR, IBM869 | Greek |
1021+-----------------+--------------------------------+--------------------------------+
1022| cp874 | | Thai |
1023+-----------------+--------------------------------+--------------------------------+
1024| cp875 | | Greek |
1025+-----------------+--------------------------------+--------------------------------+
1026| cp932 | 932, ms932, mskanji, ms-kanji | Japanese |
1027+-----------------+--------------------------------+--------------------------------+
1028| cp949 | 949, ms949, uhc | Korean |
1029+-----------------+--------------------------------+--------------------------------+
1030| cp950 | 950, ms950 | Traditional Chinese |
1031+-----------------+--------------------------------+--------------------------------+
1032| cp1006 | | Urdu |
1033+-----------------+--------------------------------+--------------------------------+
1034| cp1026 | ibm1026 | Turkish |
1035+-----------------+--------------------------------+--------------------------------+
Serhiy Storchakabe0c3252013-11-23 18:52:23 +02001036| cp1125 | 1125, ibm1125, cp866u, ruscii | Ukrainian |
1037| | | |
1038| | | .. versionadded:: 3.4 |
1039+-----------------+--------------------------------+--------------------------------+
Georg Brandl116aa622007-08-15 14:28:22 +00001040| cp1140 | ibm1140 | Western Europe |
1041+-----------------+--------------------------------+--------------------------------+
1042| cp1250 | windows-1250 | Central and Eastern Europe |
1043+-----------------+--------------------------------+--------------------------------+
1044| cp1251 | windows-1251 | Bulgarian, Byelorussian, |
1045| | | Macedonian, Russian, Serbian |
1046+-----------------+--------------------------------+--------------------------------+
1047| cp1252 | windows-1252 | Western Europe |
1048+-----------------+--------------------------------+--------------------------------+
1049| cp1253 | windows-1253 | Greek |
1050+-----------------+--------------------------------+--------------------------------+
1051| cp1254 | windows-1254 | Turkish |
1052+-----------------+--------------------------------+--------------------------------+
1053| cp1255 | windows-1255 | Hebrew |
1054+-----------------+--------------------------------+--------------------------------+
Benjamin Peterson4ac9ce42009-10-04 14:49:41 +00001055| cp1256 | windows-1256 | Arabic |
Georg Brandl116aa622007-08-15 14:28:22 +00001056+-----------------+--------------------------------+--------------------------------+
1057| cp1257 | windows-1257 | Baltic languages |
1058+-----------------+--------------------------------+--------------------------------+
1059| cp1258 | windows-1258 | Vietnamese |
1060+-----------------+--------------------------------+--------------------------------+
Victor Stinner2f3ca9f2011-10-27 01:38:56 +02001061| cp65001 | | Windows only: Windows UTF-8 |
1062| | | (``CP_UTF8``) |
1063| | | |
1064| | | .. versionadded:: 3.3 |
1065+-----------------+--------------------------------+--------------------------------+
Georg Brandl116aa622007-08-15 14:28:22 +00001066| euc_jp | eucjp, ujis, u-jis | Japanese |
1067+-----------------+--------------------------------+--------------------------------+
1068| euc_jis_2004 | jisx0213, eucjis2004 | Japanese |
1069+-----------------+--------------------------------+--------------------------------+
1070| euc_jisx0213 | eucjisx0213 | Japanese |
1071+-----------------+--------------------------------+--------------------------------+
1072| euc_kr | euckr, korean, ksc5601, | Korean |
1073| | ks_c-5601, ks_c-5601-1987, | |
1074| | ksx1001, ks_x-1001 | |
1075+-----------------+--------------------------------+--------------------------------+
1076| gb2312 | chinese, csiso58gb231280, euc- | Simplified Chinese |
1077| | cn, euccn, eucgb2312-cn, | |
1078| | gb2312-1980, gb2312-80, iso- | |
1079| | ir-58 | |
1080+-----------------+--------------------------------+--------------------------------+
1081| gbk | 936, cp936, ms936 | Unified Chinese |
1082+-----------------+--------------------------------+--------------------------------+
1083| gb18030 | gb18030-2000 | Unified Chinese |
1084+-----------------+--------------------------------+--------------------------------+
1085| hz | hzgb, hz-gb, hz-gb-2312 | Simplified Chinese |
1086+-----------------+--------------------------------+--------------------------------+
1087| iso2022_jp | csiso2022jp, iso2022jp, | Japanese |
1088| | iso-2022-jp | |
1089+-----------------+--------------------------------+--------------------------------+
1090| iso2022_jp_1 | iso2022jp-1, iso-2022-jp-1 | Japanese |
1091+-----------------+--------------------------------+--------------------------------+
1092| iso2022_jp_2 | iso2022jp-2, iso-2022-jp-2 | Japanese, Korean, Simplified |
1093| | | Chinese, Western Europe, Greek |
1094+-----------------+--------------------------------+--------------------------------+
1095| iso2022_jp_2004 | iso2022jp-2004, | Japanese |
1096| | iso-2022-jp-2004 | |
1097+-----------------+--------------------------------+--------------------------------+
1098| iso2022_jp_3 | iso2022jp-3, iso-2022-jp-3 | Japanese |
1099+-----------------+--------------------------------+--------------------------------+
1100| iso2022_jp_ext | iso2022jp-ext, iso-2022-jp-ext | Japanese |
1101+-----------------+--------------------------------+--------------------------------+
1102| iso2022_kr | csiso2022kr, iso2022kr, | Korean |
1103| | iso-2022-kr | |
1104+-----------------+--------------------------------+--------------------------------+
1105| latin_1 | iso-8859-1, iso8859-1, 8859, | West Europe |
1106| | cp819, latin, latin1, L1 | |
1107+-----------------+--------------------------------+--------------------------------+
1108| iso8859_2 | iso-8859-2, latin2, L2 | Central and Eastern Europe |
1109+-----------------+--------------------------------+--------------------------------+
1110| iso8859_3 | iso-8859-3, latin3, L3 | Esperanto, Maltese |
1111+-----------------+--------------------------------+--------------------------------+
Christian Heimesc3f30c42008-02-22 16:37:40 +00001112| iso8859_4 | iso-8859-4, latin4, L4 | Baltic languages |
Georg Brandl116aa622007-08-15 14:28:22 +00001113+-----------------+--------------------------------+--------------------------------+
1114| iso8859_5 | iso-8859-5, cyrillic | Bulgarian, Byelorussian, |
1115| | | Macedonian, Russian, Serbian |
1116+-----------------+--------------------------------+--------------------------------+
1117| iso8859_6 | iso-8859-6, arabic | Arabic |
1118+-----------------+--------------------------------+--------------------------------+
1119| iso8859_7 | iso-8859-7, greek, greek8 | Greek |
1120+-----------------+--------------------------------+--------------------------------+
1121| iso8859_8 | iso-8859-8, hebrew | Hebrew |
1122+-----------------+--------------------------------+--------------------------------+
1123| iso8859_9 | iso-8859-9, latin5, L5 | Turkish |
1124+-----------------+--------------------------------+--------------------------------+
1125| iso8859_10 | iso-8859-10, latin6, L6 | Nordic languages |
1126+-----------------+--------------------------------+--------------------------------+
Georg Brandl93dc9eb2010-03-14 10:56:14 +00001127| iso8859_13 | iso-8859-13, latin7, L7 | Baltic languages |
Georg Brandl116aa622007-08-15 14:28:22 +00001128+-----------------+--------------------------------+--------------------------------+
1129| iso8859_14 | iso-8859-14, latin8, L8 | Celtic languages |
1130+-----------------+--------------------------------+--------------------------------+
Georg Brandl93dc9eb2010-03-14 10:56:14 +00001131| iso8859_15 | iso-8859-15, latin9, L9 | Western Europe |
1132+-----------------+--------------------------------+--------------------------------+
1133| iso8859_16 | iso-8859-16, latin10, L10 | South-Eastern Europe |
Georg Brandl116aa622007-08-15 14:28:22 +00001134+-----------------+--------------------------------+--------------------------------+
1135| johab | cp1361, ms1361 | Korean |
1136+-----------------+--------------------------------+--------------------------------+
1137| koi8_r | | Russian |
1138+-----------------+--------------------------------+--------------------------------+
1139| koi8_u | | Ukrainian |
1140+-----------------+--------------------------------+--------------------------------+
1141| mac_cyrillic | maccyrillic | Bulgarian, Byelorussian, |
1142| | | Macedonian, Russian, Serbian |
1143+-----------------+--------------------------------+--------------------------------+
1144| mac_greek | macgreek | Greek |
1145+-----------------+--------------------------------+--------------------------------+
1146| mac_iceland | maciceland | Icelandic |
1147+-----------------+--------------------------------+--------------------------------+
1148| mac_latin2 | maclatin2, maccentraleurope | Central and Eastern Europe |
1149+-----------------+--------------------------------+--------------------------------+
Benjamin Peterson23110e72010-08-21 02:54:44 +00001150| mac_roman | macroman, macintosh | Western Europe |
Georg Brandl116aa622007-08-15 14:28:22 +00001151+-----------------+--------------------------------+--------------------------------+
1152| mac_turkish | macturkish | Turkish |
1153+-----------------+--------------------------------+--------------------------------+
1154| ptcp154 | csptcp154, pt154, cp154, | Kazakh |
1155| | cyrillic-asian | |
1156+-----------------+--------------------------------+--------------------------------+
1157| shift_jis | csshiftjis, shiftjis, sjis, | Japanese |
1158| | s_jis | |
1159+-----------------+--------------------------------+--------------------------------+
1160| shift_jis_2004 | shiftjis2004, sjis_2004, | Japanese |
1161| | sjis2004 | |
1162+-----------------+--------------------------------+--------------------------------+
1163| shift_jisx0213 | shiftjisx0213, sjisx0213, | Japanese |
1164| | s_jisx0213 | |
1165+-----------------+--------------------------------+--------------------------------+
Walter Dörwald41980ca2007-08-16 21:55:45 +00001166| utf_32 | U32, utf32 | all languages |
1167+-----------------+--------------------------------+--------------------------------+
1168| utf_32_be | UTF-32BE | all languages |
1169+-----------------+--------------------------------+--------------------------------+
1170| utf_32_le | UTF-32LE | all languages |
1171+-----------------+--------------------------------+--------------------------------+
Georg Brandl116aa622007-08-15 14:28:22 +00001172| utf_16 | U16, utf16 | all languages |
1173+-----------------+--------------------------------+--------------------------------+
Victor Stinner53a9dd72010-12-08 22:25:45 +00001174| utf_16_be | UTF-16BE | all languages |
Georg Brandl116aa622007-08-15 14:28:22 +00001175+-----------------+--------------------------------+--------------------------------+
Victor Stinner53a9dd72010-12-08 22:25:45 +00001176| utf_16_le | UTF-16LE | all languages |
Georg Brandl116aa622007-08-15 14:28:22 +00001177+-----------------+--------------------------------+--------------------------------+
1178| utf_7 | U7, unicode-1-1-utf-7 | all languages |
1179+-----------------+--------------------------------+--------------------------------+
1180| utf_8 | U8, UTF, utf8 | all languages |
1181+-----------------+--------------------------------+--------------------------------+
1182| utf_8_sig | | all languages |
1183+-----------------+--------------------------------+--------------------------------+
1184
Serhiy Storchaka58cf6072013-11-19 11:32:41 +02001185.. versionchanged:: 3.4
1186 The utf-16\* and utf-32\* encoders no longer allow surrogate code points
1187 (U+D800--U+DFFF) to be encoded. The utf-32\* decoders no longer decode
1188 byte sequences that correspond to surrogate code points.
1189
1190
Nick Coghlan650e3222013-05-23 20:24:02 +10001191Python Specific Encodings
1192-------------------------
1193
1194A number of predefined codecs are specific to Python, so their codec names have
1195no meaning outside Python. These are listed in the tables below based on the
1196expected input and output types (note that while text encodings are the most
1197common use case for codecs, the underlying codec infrastructure supports
1198arbitrary data transforms rather than just text encodings). For asymmetric
1199codecs, the stated purpose describes the encoding direction.
1200
Nick Coghlan9c1aed82013-11-23 11:13:36 +10001201Text Encodings
1202^^^^^^^^^^^^^^
1203
Nick Coghlan650e3222013-05-23 20:24:02 +10001204The following codecs provide :class:`str` to :class:`bytes` encoding and
1205:term:`bytes-like object` to :class:`str` decoding, similar to the Unicode text
1206encodings.
Georg Brandl226878c2007-08-31 10:15:37 +00001207
Georg Brandl44ea77b2013-03-28 13:28:44 +01001208.. tabularcolumns:: |l|p{0.3\linewidth}|p{0.3\linewidth}|
1209
Georg Brandl30c78d62008-05-11 14:52:00 +00001210+--------------------+---------+---------------------------+
1211| Codec | Aliases | Purpose |
1212+====================+=========+===========================+
1213| idna | | Implements :rfc:`3490`, |
1214| | | see also |
1215| | | :mod:`encodings.idna` |
1216+--------------------+---------+---------------------------+
1217| mbcs | dbcs | Windows only: Encode |
1218| | | operand according to the |
1219| | | ANSI codepage (CP_ACP) |
1220+--------------------+---------+---------------------------+
1221| palmos | | Encoding of PalmOS 3.5 |
1222+--------------------+---------+---------------------------+
1223| punycode | | Implements :rfc:`3492` |
1224+--------------------+---------+---------------------------+
1225| raw_unicode_escape | | Produce a string that is |
1226| | | suitable as raw Unicode |
1227| | | literal in Python source |
1228| | | code |
1229+--------------------+---------+---------------------------+
1230| undefined | | Raise an exception for |
1231| | | all conversions. Can be |
1232| | | used as the system |
1233| | | encoding if no automatic |
1234| | | coercion between byte and |
1235| | | Unicode strings is |
1236| | | desired. |
1237+--------------------+---------+---------------------------+
1238| unicode_escape | | Produce a string that is |
1239| | | suitable as Unicode |
1240| | | literal in Python source |
1241| | | code |
1242+--------------------+---------+---------------------------+
1243| unicode_internal | | Return the internal |
1244| | | representation of the |
1245| | | operand |
Victor Stinner9f4b1e92011-11-10 20:56:30 +01001246| | | |
1247| | | .. deprecated:: 3.3 |
Georg Brandl30c78d62008-05-11 14:52:00 +00001248+--------------------+---------+---------------------------+
Georg Brandl116aa622007-08-15 14:28:22 +00001249
Nick Coghlan9c1aed82013-11-23 11:13:36 +10001250.. _binary-transforms:
1251
1252Binary Transforms
1253^^^^^^^^^^^^^^^^^
1254
1255The following codecs provide binary transforms: :term:`bytes-like object`
1256to :class:`bytes` mappings.
Nick Coghlan650e3222013-05-23 20:24:02 +10001257
Georg Brandl02524622010-12-02 18:06:51 +00001258
Nick Coghlan9c1aed82013-11-23 11:13:36 +10001259.. tabularcolumns:: |l|L|L|L|
Georg Brandl44ea77b2013-03-28 13:28:44 +01001260
Nick Coghlan9c1aed82013-11-23 11:13:36 +10001261+----------------------+------------------+------------------------------+------------------------------+
1262| Codec | Aliases | Purpose | Encoder / decoder |
1263+======================+==================+==============================+==============================+
1264| base64_codec [#b64]_ | base64, base_64 | Convert operand to MIME | :meth:`base64.b64encode` / |
1265| | | base64 (the result always | :meth:`base64.b64decode` |
1266| | | includes a trailing | |
1267| | | ``'\n'``) | |
1268| | | | |
1269| | | .. versionchanged:: 3.4 | |
1270| | | accepts any | |
1271| | | :term:`bytes-like object` | |
1272| | | as input for encoding and | |
1273| | | decoding | |
1274+----------------------+------------------+------------------------------+------------------------------+
1275| bz2_codec | bz2 | Compress the operand | :meth:`bz2.compress` / |
1276| | | using bz2 | :meth:`bz2.decompress` |
1277+----------------------+------------------+------------------------------+------------------------------+
1278| hex_codec | hex | Convert operand to | :meth:`base64.b16encode` / |
1279| | | hexadecimal | :meth:`base64.b16decode` |
1280| | | representation, with two | |
1281| | | digits per byte | |
1282+----------------------+------------------+------------------------------+------------------------------+
1283| quopri_codec | quopri, | Convert operand to MIME | :meth:`quopri.encodestring` /|
1284| | quotedprintable, | quoted printable | :meth:`quopri.decodestring` |
1285| | quoted_printable | | |
1286+----------------------+------------------+------------------------------+------------------------------+
1287| uu_codec | uu | Convert the operand using | :meth:`uu.encode` / |
1288| | | uuencode | :meth:`uu.decode` |
1289+----------------------+------------------+------------------------------+------------------------------+
1290| zlib_codec | zip, zlib | Compress the operand | :meth:`zlib.compress` / |
1291| | | using gzip | :meth:`zlib.decompress` |
1292+----------------------+------------------+------------------------------+------------------------------+
Georg Brandl02524622010-12-02 18:06:51 +00001293
Nick Coghlanfdf239a2013-10-03 00:43:22 +10001294.. [#b64] In addition to :term:`bytes-like objects <bytes-like object>`,
1295 ``'base64_codec'`` also accepts ASCII-only instances of :class:`str` for
1296 decoding
Nick Coghlan650e3222013-05-23 20:24:02 +10001297
Nick Coghlan9c1aed82013-11-23 11:13:36 +10001298.. versionadded:: 3.2
1299 Restoration of the binary transforms.
Nick Coghlan650e3222013-05-23 20:24:02 +10001300
Nick Coghlan9c1aed82013-11-23 11:13:36 +10001301.. versionchanged:: 3.4
1302 Restoration of the aliases for the binary transforms.
Georg Brandl02524622010-12-02 18:06:51 +00001303
Georg Brandl44ea77b2013-03-28 13:28:44 +01001304
Nick Coghlan9c1aed82013-11-23 11:13:36 +10001305.. _text-transforms:
1306
1307Text Transforms
1308^^^^^^^^^^^^^^^
1309
1310The following codec provides a text transform: a :class:`str` to :class:`str`
1311mapping.
1312
1313.. tabularcolumns:: |l|l|L|
1314
1315+--------------------+---------+---------------------------+
1316| Codec | Aliases | Purpose |
1317+====================+=========+===========================+
1318| rot_13 | rot13 | Returns the Caesar-cypher |
1319| | | encryption of the operand |
1320+--------------------+---------+---------------------------+
Georg Brandl02524622010-12-02 18:06:51 +00001321
1322.. versionadded:: 3.2
Nick Coghlan9c1aed82013-11-23 11:13:36 +10001323 Restoration of the ``rot_13`` text transform.
1324
1325.. versionchanged:: 3.4
1326 Restoration of the ``rot13`` alias.
Georg Brandl02524622010-12-02 18:06:51 +00001327
Georg Brandl116aa622007-08-15 14:28:22 +00001328
1329:mod:`encodings.idna` --- Internationalized Domain Names in Applications
1330------------------------------------------------------------------------
1331
1332.. module:: encodings.idna
1333 :synopsis: Internationalized Domain Names implementation
1334.. moduleauthor:: Martin v. Löwis
1335
Georg Brandl116aa622007-08-15 14:28:22 +00001336This module implements :rfc:`3490` (Internationalized Domain Names in
1337Applications) and :rfc:`3492` (Nameprep: A Stringprep Profile for
1338Internationalized Domain Names (IDN)). It builds upon the ``punycode`` encoding
1339and :mod:`stringprep`.
1340
1341These RFCs together define a protocol to support non-ASCII characters in domain
1342names. A domain name containing non-ASCII characters (such as
1343``www.Alliancefrançaise.nu``) is converted into an ASCII-compatible encoding
1344(ACE, such as ``www.xn--alliancefranaise-npb.nu``). The ACE form of the domain
1345name is then used in all places where arbitrary characters are not allowed by
1346the protocol, such as DNS queries, HTTP :mailheader:`Host` fields, and so
1347on. This conversion is carried out in the application; if possible invisible to
1348the user: The application should transparently convert Unicode domain labels to
1349IDNA on the wire, and convert back ACE labels to Unicode before presenting them
1350to the user.
1351
R David Murraye0fd2f82011-04-13 14:12:18 -04001352Python supports this conversion in several ways: the ``idna`` codec performs
1353conversion between Unicode and ACE, separating an input string into labels
1354based on the separator characters defined in `section 3.1`_ (1) of :rfc:`3490`
1355and converting each label to ACE as required, and conversely separating an input
1356byte string into labels based on the ``.`` separator and converting any ACE
1357labels found into unicode. Furthermore, the :mod:`socket` module
Georg Brandl116aa622007-08-15 14:28:22 +00001358transparently converts Unicode host names to ACE, so that applications need not
1359be concerned about converting host names themselves when they pass them to the
1360socket module. On top of that, modules that have host names as function
Georg Brandl24420152008-05-26 16:32:26 +00001361parameters, such as :mod:`http.client` and :mod:`ftplib`, accept Unicode host
1362names (:mod:`http.client` then also transparently sends an IDNA hostname in the
Georg Brandl116aa622007-08-15 14:28:22 +00001363:mailheader:`Host` field if it sends that field at all).
1364
R David Murraye0fd2f82011-04-13 14:12:18 -04001365.. _section 3.1: http://tools.ietf.org/html/rfc3490#section-3.1
1366
Georg Brandl116aa622007-08-15 14:28:22 +00001367When receiving host names from the wire (such as in reverse name lookup), no
1368automatic conversion to Unicode is performed: Applications wishing to present
1369such host names to the user should decode them to Unicode.
1370
1371The module :mod:`encodings.idna` also implements the nameprep procedure, which
1372performs certain normalizations on host names, to achieve case-insensitivity of
1373international domain names, and to unify similar characters. The nameprep
1374functions can be used directly if desired.
1375
1376
1377.. function:: nameprep(label)
1378
1379 Return the nameprepped version of *label*. The implementation currently assumes
1380 query strings, so ``AllowUnassigned`` is true.
1381
1382
1383.. function:: ToASCII(label)
1384
1385 Convert a label to ASCII, as specified in :rfc:`3490`. ``UseSTD3ASCIIRules`` is
1386 assumed to be false.
1387
1388
1389.. function:: ToUnicode(label)
1390
1391 Convert a label to Unicode, as specified in :rfc:`3490`.
1392
1393
Victor Stinner554f3f02010-06-16 23:33:54 +00001394:mod:`encodings.mbcs` --- Windows ANSI codepage
1395-----------------------------------------------
1396
1397.. module:: encodings.mbcs
1398 :synopsis: Windows ANSI codepage
1399
Victor Stinner3a50e702011-10-18 21:21:00 +02001400Encode operand according to the ANSI codepage (CP_ACP).
Victor Stinner554f3f02010-06-16 23:33:54 +00001401
1402Availability: Windows only.
1403
Victor Stinner3a50e702011-10-18 21:21:00 +02001404.. versionchanged:: 3.3
1405 Support any error handler.
1406
Victor Stinner554f3f02010-06-16 23:33:54 +00001407.. versionchanged:: 3.2
1408 Before 3.2, the *errors* argument was ignored; ``'replace'`` was always used
1409 to encode, and ``'ignore'`` to decode.
1410
1411
Georg Brandl116aa622007-08-15 14:28:22 +00001412:mod:`encodings.utf_8_sig` --- UTF-8 codec with BOM signature
1413-------------------------------------------------------------
1414
1415.. module:: encodings.utf_8_sig
1416 :synopsis: UTF-8 codec with BOM signature
1417.. moduleauthor:: Walter Dörwald
1418
Georg Brandl116aa622007-08-15 14:28:22 +00001419This module implements a variant of the UTF-8 codec: On encoding a UTF-8 encoded
1420BOM will be prepended to the UTF-8 encoded bytes. For the stateful encoder this
1421is only done once (on the first write to the byte stream). For decoding an
1422optional UTF-8 encoded BOM at the start of the data will be skipped.
1423