blob: f0e179b65b96d613e6f935c2ad1176a26c7cb956 [file] [log] [blame]
Georg Brandl8ec7f652007-08-15 14:28:01 +00001
2:mod:`codecs` --- Codec registry and base classes
3=================================================
4
5.. module:: codecs
6 :synopsis: Encode and decode data and streams.
7.. moduleauthor:: Marc-Andre Lemburg <mal@lemburg.com>
8.. sectionauthor:: Marc-Andre Lemburg <mal@lemburg.com>
9.. sectionauthor:: Martin v. Löwis <martin@v.loewis.de>
10
11
12.. index::
13 single: Unicode
14 single: Codecs
15 pair: Codecs; encode
16 pair: Codecs; decode
17 single: streams
18 pair: stackable; streams
19
20This module defines base classes for standard Python codecs (encoders and
21decoders) and provides access to the internal Python codec registry which
22manages the codec and error handling lookup process.
23
24It defines the following functions:
25
Nick Coghlan6a987492013-11-04 20:05:16 +100026.. function:: encode(obj, encoding='ascii', errors='strict')
27
28 Encodes *obj* using the codec registered for *encoding*.
29
30 *Errors* may be given to set the desired error handling scheme. The
31 default error handler is ``strict`` meaning that encoding errors raise
32 :exc:`ValueError` (or a more codec specific subclass, such as
33 :exc:`UnicodeEncodeError`). Refer to :ref:`codec-base-classes` for more
34 information on codec error handling.
35
36 .. versionadded:: 2.4
37
38.. function:: decode(obj, encoding='ascii', errors='strict')
39
40 Decodes *obj* using the codec registered for *encoding*.
41
42 *Errors* may be given to set the desired error handling scheme. The
43 default error handler is ``strict`` meaning that decoding errors raise
44 :exc:`ValueError` (or a more codec specific subclass, such as
45 :exc:`UnicodeDecodeError`). Refer to :ref:`codec-base-classes` for more
46 information on codec error handling.
47
48 .. versionadded:: 2.4
Georg Brandl8ec7f652007-08-15 14:28:01 +000049
50.. function:: register(search_function)
51
52 Register a codec search function. Search functions are expected to take one
53 argument, the encoding name in all lower case letters, and return a
54 :class:`CodecInfo` object having the following attributes:
55
56 * ``name`` The name of the encoding;
57
Walter Dörwald611e48c2008-10-23 13:11:39 +000058 * ``encode`` The stateless encoding function;
Georg Brandl8ec7f652007-08-15 14:28:01 +000059
Walter Dörwald611e48c2008-10-23 13:11:39 +000060 * ``decode`` The stateless decoding function;
Georg Brandl8ec7f652007-08-15 14:28:01 +000061
62 * ``incrementalencoder`` An incremental encoder class or factory function;
63
64 * ``incrementaldecoder`` An incremental decoder class or factory function;
65
66 * ``streamwriter`` A stream writer class or factory function;
67
68 * ``streamreader`` A stream reader class or factory function.
69
70 The various functions or classes take the following arguments:
71
Walter Dörwald611e48c2008-10-23 13:11:39 +000072 *encode* and *decode*: These must be functions or methods which have the same
Serhiy Storchakab33336f2013-10-13 23:09:00 +030073 interface as the :meth:`~Codec.encode`/:meth:`~Codec.decode` methods of Codec
74 instances (see :ref:`Codec Interface <codec-objects>`). The functions/methods
75 are expected to work in a stateless mode.
Georg Brandl8ec7f652007-08-15 14:28:01 +000076
Georg Brandl2ba93212008-09-01 14:15:55 +000077 *incrementalencoder* and *incrementaldecoder*: These have to be factory
Georg Brandl8ec7f652007-08-15 14:28:01 +000078 functions providing the following interface:
79
Georg Brandlf4ffae22009-10-22 15:42:32 +000080 ``factory(errors='strict')``
Georg Brandl8ec7f652007-08-15 14:28:01 +000081
82 The factory functions must return objects providing the interfaces defined by
Georg Brandl2ba93212008-09-01 14:15:55 +000083 the base classes :class:`IncrementalEncoder` and :class:`IncrementalDecoder`,
Georg Brandl8ec7f652007-08-15 14:28:01 +000084 respectively. Incremental codecs can maintain state.
85
86 *streamreader* and *streamwriter*: These have to be factory functions providing
87 the following interface:
88
Georg Brandlf4ffae22009-10-22 15:42:32 +000089 ``factory(stream, errors='strict')``
Georg Brandl8ec7f652007-08-15 14:28:01 +000090
91 The factory functions must return objects providing the interfaces defined by
Georg Brandl42dac472013-10-06 13:17:04 +020092 the base classes :class:`StreamReader` and :class:`StreamWriter`, respectively.
Georg Brandl8ec7f652007-08-15 14:28:01 +000093 Stream codecs can maintain state.
94
Georg Brandlf4ffae22009-10-22 15:42:32 +000095 Possible values for errors are
96
97 * ``'strict'``: raise an exception in case of an encoding error
98 * ``'replace'``: replace malformed data with a suitable replacement marker,
99 such as ``'?'`` or ``'\ufffd'``
100 * ``'ignore'``: ignore malformed data and continue without further notice
101 * ``'xmlcharrefreplace'``: replace with the appropriate XML character
102 reference (for encoding only)
103 * ``'backslashreplace'``: replace with backslashed escape sequences (for
Ezio Melotti8dd547f2010-02-27 13:50:35 +0000104 encoding only)
Georg Brandlf4ffae22009-10-22 15:42:32 +0000105
106 as well as any other error handling name defined via :func:`register_error`.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000107
108 In case a search function cannot find a given encoding, it should return
109 ``None``.
110
111
112.. function:: lookup(encoding)
113
114 Looks up the codec info in the Python codec registry and returns a
115 :class:`CodecInfo` object as defined above.
116
117 Encodings are first looked up in the registry's cache. If not found, the list of
118 registered search functions is scanned. If no :class:`CodecInfo` object is
119 found, a :exc:`LookupError` is raised. Otherwise, the :class:`CodecInfo` object
120 is stored in the cache and returned to the caller.
121
122To simplify access to the various codecs, the module provides these additional
123functions which use :func:`lookup` for the codec lookup:
124
125
126.. function:: getencoder(encoding)
127
128 Look up the codec for the given encoding and return its encoder function.
129
130 Raises a :exc:`LookupError` in case the encoding cannot be found.
131
132
133.. function:: getdecoder(encoding)
134
135 Look up the codec for the given encoding and return its decoder function.
136
137 Raises a :exc:`LookupError` in case the encoding cannot be found.
138
139
140.. function:: getincrementalencoder(encoding)
141
142 Look up the codec for the given encoding and return its incremental encoder
143 class or factory function.
144
145 Raises a :exc:`LookupError` in case the encoding cannot be found or the codec
146 doesn't support an incremental encoder.
147
148 .. versionadded:: 2.5
149
150
151.. function:: getincrementaldecoder(encoding)
152
153 Look up the codec for the given encoding and return its incremental decoder
154 class or factory function.
155
156 Raises a :exc:`LookupError` in case the encoding cannot be found or the codec
157 doesn't support an incremental decoder.
158
159 .. versionadded:: 2.5
160
161
162.. function:: getreader(encoding)
163
164 Look up the codec for the given encoding and return its StreamReader class or
165 factory function.
166
167 Raises a :exc:`LookupError` in case the encoding cannot be found.
168
169
170.. function:: getwriter(encoding)
171
172 Look up the codec for the given encoding and return its StreamWriter class or
173 factory function.
174
175 Raises a :exc:`LookupError` in case the encoding cannot be found.
176
177
178.. function:: register_error(name, error_handler)
179
180 Register the error handling function *error_handler* under the name *name*.
181 *error_handler* will be called during encoding and decoding in case of an error,
182 when *name* is specified as the errors parameter.
183
184 For encoding *error_handler* will be called with a :exc:`UnicodeEncodeError`
185 instance, which contains information about the location of the error. The error
186 handler must either raise this or a different exception or return a tuple with a
187 replacement for the unencodable part of the input and a position where encoding
188 should continue. The encoder will encode the replacement and continue encoding
189 the original input at the specified position. Negative position values will be
190 treated as being relative to the end of the input string. If the resulting
191 position is out of bound an :exc:`IndexError` will be raised.
192
193 Decoding and translating works similar, except :exc:`UnicodeDecodeError` or
194 :exc:`UnicodeTranslateError` will be passed to the handler and that the
195 replacement from the error handler will be put into the output directly.
196
197
198.. function:: lookup_error(name)
199
200 Return the error handler previously registered under the name *name*.
201
202 Raises a :exc:`LookupError` in case the handler cannot be found.
203
204
205.. function:: strict_errors(exception)
206
Georg Brandlf4ffae22009-10-22 15:42:32 +0000207 Implements the ``strict`` error handling: each encoding or decoding error
208 raises a :exc:`UnicodeError`.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000209
210
211.. function:: replace_errors(exception)
212
Georg Brandlf4ffae22009-10-22 15:42:32 +0000213 Implements the ``replace`` error handling: malformed data is replaced with a
214 suitable replacement character such as ``'?'`` in bytestrings and
215 ``'\ufffd'`` in Unicode strings.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000216
217
218.. function:: ignore_errors(exception)
219
Georg Brandlf4ffae22009-10-22 15:42:32 +0000220 Implements the ``ignore`` error handling: malformed data is ignored and
221 encoding or decoding is continued without further notice.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000222
223
Walter Dörwald90014e02007-09-01 18:18:09 +0000224.. function:: xmlcharrefreplace_errors(exception)
Georg Brandl8ec7f652007-08-15 14:28:01 +0000225
Georg Brandlf4ffae22009-10-22 15:42:32 +0000226 Implements the ``xmlcharrefreplace`` error handling (for encoding only): the
227 unencodable character is replaced by an appropriate XML character reference.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000228
229
Walter Dörwald90014e02007-09-01 18:18:09 +0000230.. function:: backslashreplace_errors(exception)
Georg Brandl8ec7f652007-08-15 14:28:01 +0000231
Georg Brandlf4ffae22009-10-22 15:42:32 +0000232 Implements the ``backslashreplace`` error handling (for encoding only): the
233 unencodable character is replaced by a backslashed escape sequence.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000234
235To simplify working with encoded files or stream, the module also defines these
236utility functions:
237
238
239.. function:: open(filename, mode[, encoding[, errors[, buffering]]])
240
241 Open an encoded file using the given *mode* and return a wrapped version
Georg Brandl5e203f52008-02-17 11:33:38 +0000242 providing transparent encoding/decoding. The default file mode is ``'r'``
243 meaning to open the file in read mode.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000244
245 .. note::
246
247 The wrapped version will only accept the object format defined by the codecs,
248 i.e. Unicode objects for most built-in codecs. Output is also codec-dependent
249 and will usually be Unicode as well.
250
Georg Brandl5e203f52008-02-17 11:33:38 +0000251 .. note::
252
253 Files are always opened in binary mode, even if no binary mode was
254 specified. This is done to avoid data loss due to encodings using 8-bit
255 values. This means that no automatic conversion of ``'\n'`` is done
256 on reading and writing.
257
Georg Brandl8ec7f652007-08-15 14:28:01 +0000258 *encoding* specifies the encoding which is to be used for the file.
259
260 *errors* may be given to define the error handling. It defaults to ``'strict'``
261 which causes a :exc:`ValueError` to be raised in case an encoding error occurs.
262
263 *buffering* has the same meaning as for the built-in :func:`open` function. It
264 defaults to line buffered.
265
266
267.. function:: EncodedFile(file, input[, output[, errors]])
268
269 Return a wrapped version of file which provides transparent encoding
270 translation.
271
272 Strings written to the wrapped file are interpreted according to the given
273 *input* encoding and then written to the original file as strings using the
274 *output* encoding. The intermediate encoding will usually be Unicode but depends
275 on the specified codecs.
276
277 If *output* is not given, it defaults to *input*.
278
279 *errors* may be given to define the error handling. It defaults to ``'strict'``,
280 which causes :exc:`ValueError` to be raised in case an encoding error occurs.
281
282
283.. function:: iterencode(iterable, encoding[, errors])
284
285 Uses an incremental encoder to iteratively encode the input provided by
Georg Brandlcf3fb252007-10-21 10:52:38 +0000286 *iterable*. This function is a :term:`generator`. *errors* (as well as any
287 other keyword argument) is passed through to the incremental encoder.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000288
289 .. versionadded:: 2.5
290
291
292.. function:: iterdecode(iterable, encoding[, errors])
293
294 Uses an incremental decoder to iteratively decode the input provided by
Georg Brandlcf3fb252007-10-21 10:52:38 +0000295 *iterable*. This function is a :term:`generator`. *errors* (as well as any
296 other keyword argument) is passed through to the incremental decoder.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000297
298 .. versionadded:: 2.5
299
300The module also provides the following constants which are useful for reading
301and writing to platform dependent files:
302
303
304.. data:: BOM
305 BOM_BE
306 BOM_LE
307 BOM_UTF8
308 BOM_UTF16
309 BOM_UTF16_BE
310 BOM_UTF16_LE
311 BOM_UTF32
312 BOM_UTF32_BE
313 BOM_UTF32_LE
314
315 These constants define various encodings of the Unicode byte order mark (BOM)
316 used in UTF-16 and UTF-32 data streams to indicate the byte order used in the
317 stream or file and in UTF-8 as a Unicode signature. :const:`BOM_UTF16` is either
318 :const:`BOM_UTF16_BE` or :const:`BOM_UTF16_LE` depending on the platform's
319 native byte order, :const:`BOM` is an alias for :const:`BOM_UTF16`,
320 :const:`BOM_LE` for :const:`BOM_UTF16_LE` and :const:`BOM_BE` for
321 :const:`BOM_UTF16_BE`. The others represent the BOM in UTF-8 and UTF-32
322 encodings.
323
324
325.. _codec-base-classes:
326
327Codec Base Classes
328------------------
329
330The :mod:`codecs` module defines a set of base classes which define the
Benjamin Peterson06abba32008-05-26 20:43:24 +0000331interface and can also be used to easily write your own codecs for use in
332Python.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000333
334Each codec has to define four interfaces to make it usable as codec in Python:
335stateless encoder, stateless decoder, stream reader and stream writer. The
336stream reader and writers typically reuse the stateless encoder/decoder to
337implement the file protocols.
338
339The :class:`Codec` class defines the interface for stateless encoders/decoders.
340
Serhiy Storchakab33336f2013-10-13 23:09:00 +0300341To simplify and standardize error handling, the :meth:`~Codec.encode` and
342:meth:`~Codec.decode` methods may implement different error handling schemes by
Georg Brandl8ec7f652007-08-15 14:28:01 +0000343providing the *errors* string argument. The following string values are defined
344and implemented by all standard Python codecs:
345
Georg Brandl44ea77b2013-03-28 13:28:44 +0100346.. tabularcolumns:: |l|L|
347
Georg Brandl8ec7f652007-08-15 14:28:01 +0000348+-------------------------+-----------------------------------------------+
349| Value | Meaning |
350+=========================+===============================================+
351| ``'strict'`` | Raise :exc:`UnicodeError` (or a subclass); |
352| | this is the default. |
353+-------------------------+-----------------------------------------------+
354| ``'ignore'`` | Ignore the character and continue with the |
355| | next. |
356+-------------------------+-----------------------------------------------+
357| ``'replace'`` | Replace with a suitable replacement |
358| | character; Python will use the official |
359| | U+FFFD REPLACEMENT CHARACTER for the built-in |
360| | Unicode codecs on decoding and '?' on |
361| | encoding. |
362+-------------------------+-----------------------------------------------+
363| ``'xmlcharrefreplace'`` | Replace with the appropriate XML character |
364| | reference (only for encoding). |
365+-------------------------+-----------------------------------------------+
366| ``'backslashreplace'`` | Replace with backslashed escape sequences |
367| | (only for encoding). |
368+-------------------------+-----------------------------------------------+
369
370The set of allowed values can be extended via :meth:`register_error`.
371
372
373.. _codec-objects:
374
375Codec Objects
376^^^^^^^^^^^^^
377
378The :class:`Codec` class defines these methods which also define the function
379interfaces of the stateless encoder and decoder:
380
381
382.. method:: Codec.encode(input[, errors])
383
384 Encodes the object *input* and returns a tuple (output object, length consumed).
385 While codecs are not restricted to use with Unicode, in a Unicode context,
386 encoding converts a Unicode object to a plain string using a particular
387 character set encoding (e.g., ``cp1252`` or ``iso-8859-1``).
388
389 *errors* defines the error handling to apply. It defaults to ``'strict'``
390 handling.
391
392 The method may not store state in the :class:`Codec` instance. Use
393 :class:`StreamCodec` for codecs which have to keep state in order to make
394 encoding/decoding efficient.
395
396 The encoder must be able to handle zero length input and return an empty object
397 of the output object type in this situation.
398
399
400.. method:: Codec.decode(input[, errors])
401
402 Decodes the object *input* and returns a tuple (output object, length consumed).
403 In a Unicode context, decoding converts a plain string encoded using a
404 particular character set encoding to a Unicode object.
405
406 *input* must be an object which provides the ``bf_getreadbuf`` buffer slot.
407 Python strings, buffer objects and memory mapped files are examples of objects
408 providing this slot.
409
410 *errors* defines the error handling to apply. It defaults to ``'strict'``
411 handling.
412
413 The method may not store state in the :class:`Codec` instance. Use
414 :class:`StreamCodec` for codecs which have to keep state in order to make
415 encoding/decoding efficient.
416
417 The decoder must be able to handle zero length input and return an empty object
418 of the output object type in this situation.
419
420The :class:`IncrementalEncoder` and :class:`IncrementalDecoder` classes provide
421the basic interface for incremental encoding and decoding. Encoding/decoding the
422input isn't done with one call to the stateless encoder/decoder function, but
Serhiy Storchakab33336f2013-10-13 23:09:00 +0300423with multiple calls to the
424:meth:`~IncrementalEncoder.encode`/:meth:`~IncrementalDecoder.decode` method of
425the incremental encoder/decoder. The incremental encoder/decoder keeps track of
426the encoding/decoding process during method calls.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000427
Serhiy Storchakab33336f2013-10-13 23:09:00 +0300428The joined output of calls to the
429:meth:`~IncrementalEncoder.encode`/:meth:`~IncrementalDecoder.decode` method is
430the same as if all the single inputs were joined into one, and this input was
Georg Brandl8ec7f652007-08-15 14:28:01 +0000431encoded/decoded with the stateless encoder/decoder.
432
433
434.. _incremental-encoder-objects:
435
436IncrementalEncoder Objects
437^^^^^^^^^^^^^^^^^^^^^^^^^^
438
439.. versionadded:: 2.5
440
441The :class:`IncrementalEncoder` class is used for encoding an input in multiple
442steps. It defines the following methods which every incremental encoder must
443define in order to be compatible with the Python codec registry.
444
445
446.. class:: IncrementalEncoder([errors])
447
448 Constructor for an :class:`IncrementalEncoder` instance.
449
450 All incremental encoders must provide this constructor interface. They are free
451 to add additional keyword arguments, but only the ones defined here are used by
452 the Python codec registry.
453
454 The :class:`IncrementalEncoder` may implement different error handling schemes
455 by providing the *errors* keyword argument. These parameters are predefined:
456
457 * ``'strict'`` Raise :exc:`ValueError` (or a subclass); this is the default.
458
459 * ``'ignore'`` Ignore the character and continue with the next.
460
461 * ``'replace'`` Replace with a suitable replacement character
462
463 * ``'xmlcharrefreplace'`` Replace with the appropriate XML character reference
464
465 * ``'backslashreplace'`` Replace with backslashed escape sequences.
466
467 The *errors* argument will be assigned to an attribute of the same name.
468 Assigning to this attribute makes it possible to switch between different error
469 handling strategies during the lifetime of the :class:`IncrementalEncoder`
470 object.
471
472 The set of allowed values for the *errors* argument can be extended with
473 :func:`register_error`.
474
475
Benjamin Petersonc7b05922008-04-25 01:29:10 +0000476 .. method:: encode(object[, final])
Georg Brandl8ec7f652007-08-15 14:28:01 +0000477
Benjamin Petersonc7b05922008-04-25 01:29:10 +0000478 Encodes *object* (taking the current state of the encoder into account)
479 and returns the resulting encoded object. If this is the last call to
480 :meth:`encode` *final* must be true (the default is false).
Georg Brandl8ec7f652007-08-15 14:28:01 +0000481
482
Benjamin Petersonc7b05922008-04-25 01:29:10 +0000483 .. method:: reset()
Georg Brandl8ec7f652007-08-15 14:28:01 +0000484
Benjamin Petersonc7b05922008-04-25 01:29:10 +0000485 Reset the encoder to the initial state.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000486
487
488.. _incremental-decoder-objects:
489
490IncrementalDecoder Objects
491^^^^^^^^^^^^^^^^^^^^^^^^^^
492
493The :class:`IncrementalDecoder` class is used for decoding an input in multiple
494steps. It defines the following methods which every incremental decoder must
495define in order to be compatible with the Python codec registry.
496
497
498.. class:: IncrementalDecoder([errors])
499
500 Constructor for an :class:`IncrementalDecoder` instance.
501
502 All incremental decoders must provide this constructor interface. They are free
503 to add additional keyword arguments, but only the ones defined here are used by
504 the Python codec registry.
505
506 The :class:`IncrementalDecoder` may implement different error handling schemes
507 by providing the *errors* keyword argument. These parameters are predefined:
508
509 * ``'strict'`` Raise :exc:`ValueError` (or a subclass); this is the default.
510
511 * ``'ignore'`` Ignore the character and continue with the next.
512
513 * ``'replace'`` Replace with a suitable replacement character.
514
515 The *errors* argument will be assigned to an attribute of the same name.
516 Assigning to this attribute makes it possible to switch between different error
Georg Brandl2ba93212008-09-01 14:15:55 +0000517 handling strategies during the lifetime of the :class:`IncrementalDecoder`
Georg Brandl8ec7f652007-08-15 14:28:01 +0000518 object.
519
520 The set of allowed values for the *errors* argument can be extended with
521 :func:`register_error`.
522
523
Benjamin Petersonc7b05922008-04-25 01:29:10 +0000524 .. method:: decode(object[, final])
Georg Brandl8ec7f652007-08-15 14:28:01 +0000525
Benjamin Petersonc7b05922008-04-25 01:29:10 +0000526 Decodes *object* (taking the current state of the decoder into account)
527 and returns the resulting decoded object. If this is the last call to
528 :meth:`decode` *final* must be true (the default is false). If *final* is
529 true the decoder must decode the input completely and must flush all
530 buffers. If this isn't possible (e.g. because of incomplete byte sequences
531 at the end of the input) it must initiate error handling just like in the
532 stateless case (which might raise an exception).
Georg Brandl8ec7f652007-08-15 14:28:01 +0000533
534
Benjamin Petersonc7b05922008-04-25 01:29:10 +0000535 .. method:: reset()
Georg Brandl8ec7f652007-08-15 14:28:01 +0000536
Benjamin Petersonc7b05922008-04-25 01:29:10 +0000537 Reset the decoder to the initial state.
538
Georg Brandl8ec7f652007-08-15 14:28:01 +0000539
540The :class:`StreamWriter` and :class:`StreamReader` classes provide generic
541working interfaces which can be used to implement new encoding submodules very
542easily. See :mod:`encodings.utf_8` for an example of how this is done.
543
544
545.. _stream-writer-objects:
546
547StreamWriter Objects
548^^^^^^^^^^^^^^^^^^^^
549
550The :class:`StreamWriter` class is a subclass of :class:`Codec` and defines the
551following methods which every stream writer must define in order to be
552compatible with the Python codec registry.
553
554
555.. class:: StreamWriter(stream[, errors])
556
557 Constructor for a :class:`StreamWriter` instance.
558
559 All stream writers must provide this constructor interface. They are free to add
560 additional keyword arguments, but only the ones defined here are used by the
561 Python codec registry.
562
563 *stream* must be a file-like object open for writing binary data.
564
565 The :class:`StreamWriter` may implement different error handling schemes by
566 providing the *errors* keyword argument. These parameters are predefined:
567
568 * ``'strict'`` Raise :exc:`ValueError` (or a subclass); this is the default.
569
570 * ``'ignore'`` Ignore the character and continue with the next.
571
572 * ``'replace'`` Replace with a suitable replacement character
573
574 * ``'xmlcharrefreplace'`` Replace with the appropriate XML character reference
575
576 * ``'backslashreplace'`` Replace with backslashed escape sequences.
577
578 The *errors* argument will be assigned to an attribute of the same name.
579 Assigning to this attribute makes it possible to switch between different error
580 handling strategies during the lifetime of the :class:`StreamWriter` object.
581
582 The set of allowed values for the *errors* argument can be extended with
583 :func:`register_error`.
584
585
Benjamin Petersonc7b05922008-04-25 01:29:10 +0000586 .. method:: write(object)
Georg Brandl8ec7f652007-08-15 14:28:01 +0000587
Benjamin Petersonc7b05922008-04-25 01:29:10 +0000588 Writes the object's contents encoded to the stream.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000589
590
Benjamin Petersonc7b05922008-04-25 01:29:10 +0000591 .. method:: writelines(list)
Georg Brandl8ec7f652007-08-15 14:28:01 +0000592
Benjamin Petersonc7b05922008-04-25 01:29:10 +0000593 Writes the concatenated list of strings to the stream (possibly by reusing
594 the :meth:`write` method).
Georg Brandl8ec7f652007-08-15 14:28:01 +0000595
596
Benjamin Petersonc7b05922008-04-25 01:29:10 +0000597 .. method:: reset()
Georg Brandl8ec7f652007-08-15 14:28:01 +0000598
Benjamin Petersonc7b05922008-04-25 01:29:10 +0000599 Flushes and resets the codec buffers used for keeping state.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000600
Benjamin Petersonc7b05922008-04-25 01:29:10 +0000601 Calling this method should ensure that the data on the output is put into
602 a clean state that allows appending of new fresh data without having to
603 rescan the whole stream to recover state.
604
Georg Brandl8ec7f652007-08-15 14:28:01 +0000605
606In addition to the above methods, the :class:`StreamWriter` must also inherit
607all other methods and attributes from the underlying stream.
608
609
610.. _stream-reader-objects:
611
612StreamReader Objects
613^^^^^^^^^^^^^^^^^^^^
614
615The :class:`StreamReader` class is a subclass of :class:`Codec` and defines the
616following methods which every stream reader must define in order to be
617compatible with the Python codec registry.
618
619
620.. class:: StreamReader(stream[, errors])
621
622 Constructor for a :class:`StreamReader` instance.
623
624 All stream readers must provide this constructor interface. They are free to add
625 additional keyword arguments, but only the ones defined here are used by the
626 Python codec registry.
627
628 *stream* must be a file-like object open for reading (binary) data.
629
630 The :class:`StreamReader` may implement different error handling schemes by
631 providing the *errors* keyword argument. These parameters are defined:
632
633 * ``'strict'`` Raise :exc:`ValueError` (or a subclass); this is the default.
634
635 * ``'ignore'`` Ignore the character and continue with the next.
636
637 * ``'replace'`` Replace with a suitable replacement character.
638
639 The *errors* argument will be assigned to an attribute of the same name.
640 Assigning to this attribute makes it possible to switch between different error
641 handling strategies during the lifetime of the :class:`StreamReader` object.
642
643 The set of allowed values for the *errors* argument can be extended with
644 :func:`register_error`.
645
646
Benjamin Petersonc7b05922008-04-25 01:29:10 +0000647 .. method:: read([size[, chars, [firstline]]])
Georg Brandl8ec7f652007-08-15 14:28:01 +0000648
Benjamin Petersonc7b05922008-04-25 01:29:10 +0000649 Decodes data from the stream and returns the resulting object.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000650
Benjamin Petersonc7b05922008-04-25 01:29:10 +0000651 *chars* indicates the number of characters to read from the
652 stream. :func:`read` will never return more than *chars* characters, but
653 it might return less, if there are not enough characters available.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000654
Benjamin Petersonc7b05922008-04-25 01:29:10 +0000655 *size* indicates the approximate maximum number of bytes to read from the
656 stream for decoding purposes. The decoder can modify this setting as
657 appropriate. The default value -1 indicates to read and decode as much as
658 possible. *size* is intended to prevent having to decode huge files in
659 one step.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000660
Benjamin Petersonc7b05922008-04-25 01:29:10 +0000661 *firstline* indicates that it would be sufficient to only return the first
662 line, if there are decoding errors on later lines.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000663
Benjamin Petersonc7b05922008-04-25 01:29:10 +0000664 The method should use a greedy read strategy meaning that it should read
665 as much data as is allowed within the definition of the encoding and the
666 given size, e.g. if optional encoding endings or state markers are
667 available on the stream, these should be read too.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000668
Benjamin Petersonc7b05922008-04-25 01:29:10 +0000669 .. versionchanged:: 2.4
670 *chars* argument added.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000671
Benjamin Petersonc7b05922008-04-25 01:29:10 +0000672 .. versionchanged:: 2.4.2
673 *firstline* argument added.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000674
675
Benjamin Petersonc7b05922008-04-25 01:29:10 +0000676 .. method:: readline([size[, keepends]])
Georg Brandl8ec7f652007-08-15 14:28:01 +0000677
Benjamin Petersonc7b05922008-04-25 01:29:10 +0000678 Read one line from the input stream and return the decoded data.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000679
Benjamin Petersonc7b05922008-04-25 01:29:10 +0000680 *size*, if given, is passed as size argument to the stream's
Serhiy Storchaka6cda0ad2013-07-11 18:25:19 +0300681 :meth:`read` method.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000682
Benjamin Petersonc7b05922008-04-25 01:29:10 +0000683 If *keepends* is false line-endings will be stripped from the lines
684 returned.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000685
Benjamin Petersonc7b05922008-04-25 01:29:10 +0000686 .. versionchanged:: 2.4
687 *keepends* argument added.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000688
689
Benjamin Petersonc7b05922008-04-25 01:29:10 +0000690 .. method:: readlines([sizehint[, keepends]])
Georg Brandl8ec7f652007-08-15 14:28:01 +0000691
Benjamin Petersonc7b05922008-04-25 01:29:10 +0000692 Read all lines available on the input stream and return them as a list of
693 lines.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000694
Benjamin Petersonc7b05922008-04-25 01:29:10 +0000695 Line-endings are implemented using the codec's decoder method and are
696 included in the list entries if *keepends* is true.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000697
Benjamin Petersonc7b05922008-04-25 01:29:10 +0000698 *sizehint*, if given, is passed as the *size* argument to the stream's
699 :meth:`read` method.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000700
701
Benjamin Petersonc7b05922008-04-25 01:29:10 +0000702 .. method:: reset()
Georg Brandl8ec7f652007-08-15 14:28:01 +0000703
Benjamin Petersonc7b05922008-04-25 01:29:10 +0000704 Resets the codec buffers used for keeping state.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000705
Benjamin Petersonc7b05922008-04-25 01:29:10 +0000706 Note that no stream repositioning should take place. This method is
707 primarily intended to be able to recover from decoding errors.
708
Georg Brandl8ec7f652007-08-15 14:28:01 +0000709
710In addition to the above methods, the :class:`StreamReader` must also inherit
711all other methods and attributes from the underlying stream.
712
713The next two base classes are included for convenience. They are not needed by
714the codec registry, but may provide useful in practice.
715
716
717.. _stream-reader-writer:
718
719StreamReaderWriter Objects
720^^^^^^^^^^^^^^^^^^^^^^^^^^
721
722The :class:`StreamReaderWriter` allows wrapping streams which work in both read
723and write modes.
724
725The design is such that one can use the factory functions returned by the
726:func:`lookup` function to construct the instance.
727
728
729.. class:: StreamReaderWriter(stream, Reader, Writer, errors)
730
731 Creates a :class:`StreamReaderWriter` instance. *stream* must be a file-like
732 object. *Reader* and *Writer* must be factory functions or classes providing the
733 :class:`StreamReader` and :class:`StreamWriter` interface resp. Error handling
734 is done in the same way as defined for the stream readers and writers.
735
736:class:`StreamReaderWriter` instances define the combined interfaces of
737:class:`StreamReader` and :class:`StreamWriter` classes. They inherit all other
738methods and attributes from the underlying stream.
739
740
741.. _stream-recoder-objects:
742
743StreamRecoder Objects
744^^^^^^^^^^^^^^^^^^^^^
745
746The :class:`StreamRecoder` provide a frontend - backend view of encoding data
747which is sometimes useful when dealing with different encoding environments.
748
749The design is such that one can use the factory functions returned by the
750:func:`lookup` function to construct the instance.
751
752
753.. class:: StreamRecoder(stream, encode, decode, Reader, Writer, errors)
754
755 Creates a :class:`StreamRecoder` instance which implements a two-way conversion:
756 *encode* and *decode* work on the frontend (the input to :meth:`read` and output
757 of :meth:`write`) while *Reader* and *Writer* work on the backend (reading and
758 writing to the stream).
759
760 You can use these objects to do transparent direct recodings from e.g. Latin-1
761 to UTF-8 and back.
762
763 *stream* must be a file-like object.
764
765 *encode*, *decode* must adhere to the :class:`Codec` interface. *Reader*,
766 *Writer* must be factory functions or classes providing objects of the
767 :class:`StreamReader` and :class:`StreamWriter` interface respectively.
768
769 *encode* and *decode* are needed for the frontend translation, *Reader* and
770 *Writer* for the backend translation. The intermediate format used is
771 determined by the two sets of codecs, e.g. the Unicode codecs will use Unicode
772 as the intermediate encoding.
773
774 Error handling is done in the same way as defined for the stream readers and
775 writers.
776
Benjamin Petersonc7b05922008-04-25 01:29:10 +0000777
Georg Brandl8ec7f652007-08-15 14:28:01 +0000778:class:`StreamRecoder` instances define the combined interfaces of
779:class:`StreamReader` and :class:`StreamWriter` classes. They inherit all other
780methods and attributes from the underlying stream.
781
782
783.. _encodings-overview:
784
785Encodings and Unicode
786---------------------
787
788Unicode strings are stored internally as sequences of codepoints (to be precise
Sandro Tosi98ed08f2012-01-14 16:42:02 +0100789as :c:type:`Py_UNICODE` arrays). Depending on the way Python is compiled (either
Éric Araujoa8132ec2010-12-16 03:53:53 +0000790via ``--enable-unicode=ucs2`` or ``--enable-unicode=ucs4``, with the
Sandro Tosi98ed08f2012-01-14 16:42:02 +0100791former being the default) :c:type:`Py_UNICODE` is either a 16-bit or 32-bit data
Georg Brandl8ec7f652007-08-15 14:28:01 +0000792type. Once a Unicode object is used outside of CPU and memory, CPU endianness
793and how these arrays are stored as bytes become an issue. Transforming a
794unicode object into a sequence of bytes is called encoding and recreating the
795unicode object from the sequence of bytes is known as decoding. There are many
796different methods for how this transformation can be done (these methods are
797also called encodings). The simplest method is to map the codepoints 0-255 to
798the bytes ``0x0``-``0xff``. This means that a unicode object that contains
799codepoints above ``U+00FF`` can't be encoded with this method (which is called
800``'latin-1'`` or ``'iso-8859-1'``). :func:`unicode.encode` will raise a
801:exc:`UnicodeEncodeError` that looks like this: ``UnicodeEncodeError: 'latin-1'
802codec can't encode character u'\u1234' in position 3: ordinal not in
803range(256)``.
804
805There's another group of encodings (the so called charmap encodings) that choose
806a different subset of all unicode code points and how these codepoints are
807mapped to the bytes ``0x0``-``0xff``. To see how this is done simply open
808e.g. :file:`encodings/cp1252.py` (which is an encoding that is used primarily on
809Windows). There's a string constant with 256 characters that shows you which
810character is mapped to which byte value.
811
Ezio Melotti59b13f42011-10-25 10:46:22 +0300812All of these encodings can only encode 256 of the 1114112 codepoints
Georg Brandl8ec7f652007-08-15 14:28:01 +0000813defined in unicode. A simple and straightforward way that can store each Unicode
Ezio Melotti59b13f42011-10-25 10:46:22 +0300814code point, is to store each codepoint as four consecutive bytes. There are two
815possibilities: store the bytes in big endian or in little endian order. These
816two encodings are called ``UTF-32-BE`` and ``UTF-32-LE`` respectively. Their
817disadvantage is that if e.g. you use ``UTF-32-BE`` on a little endian machine you
818will always have to swap bytes on encoding and decoding. ``UTF-32`` avoids this
819problem: bytes will always be in natural endianness. When these bytes are read
Georg Brandl8ec7f652007-08-15 14:28:01 +0000820by a CPU with a different endianness, then bytes have to be swapped though. To
Ezio Melotti59b13f42011-10-25 10:46:22 +0300821be able to detect the endianness of a ``UTF-16`` or ``UTF-32`` byte sequence,
822there's the so called BOM ("Byte Order Mark"). This is the Unicode character
823``U+FEFF``. This character can be prepended to every ``UTF-16`` or ``UTF-32``
824byte sequence. The byte swapped version of this character (``0xFFFE``) is an
825illegal character that may not appear in a Unicode text. So when the
826first character in an ``UTF-16`` or ``UTF-32`` byte sequence
Georg Brandl8ec7f652007-08-15 14:28:01 +0000827appears to be a ``U+FFFE`` the bytes have to be swapped on decoding.
Ezio Melotti59b13f42011-10-25 10:46:22 +0300828Unfortunately the character ``U+FEFF`` had a second purpose as
829a ``ZERO WIDTH NO-BREAK SPACE``: a character that has no width and doesn't allow
Georg Brandl8ec7f652007-08-15 14:28:01 +0000830a word to be split. It can e.g. be used to give hints to a ligature algorithm.
831With Unicode 4.0 using ``U+FEFF`` as a ``ZERO WIDTH NO-BREAK SPACE`` has been
832deprecated (with ``U+2060`` (``WORD JOINER``) assuming this role). Nevertheless
Ezio Melotti59b13f42011-10-25 10:46:22 +0300833Unicode software still must be able to handle ``U+FEFF`` in both roles: as a BOM
Georg Brandl8ec7f652007-08-15 14:28:01 +0000834it's a device to determine the storage layout of the encoded bytes, and vanishes
835once the byte sequence has been decoded into a Unicode string; as a ``ZERO WIDTH
836NO-BREAK SPACE`` it's a normal character that will be decoded like any other.
837
838There's another encoding that is able to encoding the full range of Unicode
839characters: UTF-8. UTF-8 is an 8-bit encoding, which means there are no issues
840with byte order in UTF-8. Each byte in a UTF-8 byte sequence consists of two
Ezio Melotti59b13f42011-10-25 10:46:22 +0300841parts: marker bits (the most significant bits) and payload bits. The marker bits
Ezio Melotti4f14a1f2011-09-01 08:19:01 +0300842are a sequence of zero to four ``1`` bits followed by a ``0`` bit. Unicode characters are
Georg Brandl8ec7f652007-08-15 14:28:01 +0000843encoded like this (with x being payload bits, which when concatenated give the
844Unicode character):
845
846+-----------------------------------+----------------------------------------------+
847| Range | Encoding |
848+===================================+==============================================+
849| ``U-00000000`` ... ``U-0000007F`` | 0xxxxxxx |
850+-----------------------------------+----------------------------------------------+
851| ``U-00000080`` ... ``U-000007FF`` | 110xxxxx 10xxxxxx |
852+-----------------------------------+----------------------------------------------+
853| ``U-00000800`` ... ``U-0000FFFF`` | 1110xxxx 10xxxxxx 10xxxxxx |
854+-----------------------------------+----------------------------------------------+
Ezio Melotti4f14a1f2011-09-01 08:19:01 +0300855| ``U-00010000`` ... ``U-0010FFFF`` | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx |
Georg Brandl8ec7f652007-08-15 14:28:01 +0000856+-----------------------------------+----------------------------------------------+
857
858The least significant bit of the Unicode character is the rightmost x bit.
859
860As UTF-8 is an 8-bit encoding no BOM is required and any ``U+FEFF`` character in
861the decoded Unicode string (even if it's the first character) is treated as a
862``ZERO WIDTH NO-BREAK SPACE``.
863
864Without external information it's impossible to reliably determine which
865encoding was used for encoding a Unicode string. Each charmap encoding can
866decode any random byte sequence. However that's not possible with UTF-8, as
867UTF-8 byte sequences have a structure that doesn't allow arbitrary byte
Walter Dörwald73f83d22007-09-01 18:34:05 +0000868sequences. To increase the reliability with which a UTF-8 encoding can be
Georg Brandl8ec7f652007-08-15 14:28:01 +0000869detected, Microsoft invented a variant of UTF-8 (that Python 2.5 calls
870``"utf-8-sig"``) for its Notepad program: Before any of the Unicode characters
871is written to the file, a UTF-8 encoded BOM (which looks like this as a byte
872sequence: ``0xef``, ``0xbb``, ``0xbf``) is written. As it's rather improbable
873that any charmap encoded file starts with these byte values (which would e.g.
874map to
875
876 | LATIN SMALL LETTER I WITH DIAERESIS
877 | RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK
878 | INVERTED QUESTION MARK
879
Ezio Melotti59b13f42011-10-25 10:46:22 +0300880in iso-8859-1), this increases the probability that a ``utf-8-sig`` encoding can be
Georg Brandl8ec7f652007-08-15 14:28:01 +0000881correctly guessed from the byte sequence. So here the BOM is not used to be able
882to determine the byte order used for generating the byte sequence, but as a
883signature that helps in guessing the encoding. On encoding the utf-8-sig codec
884will write ``0xef``, ``0xbb``, ``0xbf`` as the first three bytes to the file. On
Ezio Melotti59b13f42011-10-25 10:46:22 +0300885decoding ``utf-8-sig`` will skip those three bytes if they appear as the first
886three bytes in the file. In UTF-8, the use of the BOM is discouraged and
887should generally be avoided.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000888
889
890.. _standard-encodings:
891
892Standard Encodings
893------------------
894
895Python comes with a number of codecs built-in, either implemented as C functions
896or with dictionaries as mapping tables. The following table lists the codecs by
897name, together with a few common aliases, and the languages for which the
898encoding is likely used. Neither the list of aliases nor the list of languages
899is meant to be exhaustive. Notice that spelling alternatives that only differ in
Georg Brandl87296622009-08-24 17:14:29 +0000900case or use a hyphen instead of an underscore are also valid aliases; therefore,
901e.g. ``'utf-8'`` is a valid alias for the ``'utf_8'`` codec.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000902
903Many of the character sets support the same languages. They vary in individual
904characters (e.g. whether the EURO SIGN is supported or not), and in the
905assignment of characters to code positions. For the European languages in
906particular, the following variants typically exist:
907
908* an ISO 8859 codeset
909
910* a Microsoft Windows code page, which is typically derived from a 8859 codeset,
911 but replaces control characters with additional graphic characters
912
913* an IBM EBCDIC code page
914
915* an IBM PC code page, which is ASCII compatible
916
Georg Brandl44ea77b2013-03-28 13:28:44 +0100917.. tabularcolumns:: |l|p{0.3\linewidth}|p{0.3\linewidth}|
918
Georg Brandl8ec7f652007-08-15 14:28:01 +0000919+-----------------+--------------------------------+--------------------------------+
920| Codec | Aliases | Languages |
921+=================+================================+================================+
922| ascii | 646, us-ascii | English |
923+-----------------+--------------------------------+--------------------------------+
924| big5 | big5-tw, csbig5 | Traditional Chinese |
925+-----------------+--------------------------------+--------------------------------+
926| big5hkscs | big5-hkscs, hkscs | Traditional Chinese |
927+-----------------+--------------------------------+--------------------------------+
928| cp037 | IBM037, IBM039 | English |
929+-----------------+--------------------------------+--------------------------------+
930| cp424 | EBCDIC-CP-HE, IBM424 | Hebrew |
931+-----------------+--------------------------------+--------------------------------+
932| cp437 | 437, IBM437 | English |
933+-----------------+--------------------------------+--------------------------------+
934| cp500 | EBCDIC-CP-BE, EBCDIC-CP-CH, | Western Europe |
935| | IBM500 | |
936+-----------------+--------------------------------+--------------------------------+
Amaury Forgeot d'Arc78c06bd2009-07-13 23:11:54 +0000937| cp720 | | Arabic |
938+-----------------+--------------------------------+--------------------------------+
Georg Brandl8ec7f652007-08-15 14:28:01 +0000939| cp737 | | Greek |
940+-----------------+--------------------------------+--------------------------------+
941| cp775 | IBM775 | Baltic languages |
942+-----------------+--------------------------------+--------------------------------+
943| cp850 | 850, IBM850 | Western Europe |
944+-----------------+--------------------------------+--------------------------------+
945| cp852 | 852, IBM852 | Central and Eastern Europe |
946+-----------------+--------------------------------+--------------------------------+
947| cp855 | 855, IBM855 | Bulgarian, Byelorussian, |
948| | | Macedonian, Russian, Serbian |
949+-----------------+--------------------------------+--------------------------------+
950| cp856 | | Hebrew |
951+-----------------+--------------------------------+--------------------------------+
952| cp857 | 857, IBM857 | Turkish |
953+-----------------+--------------------------------+--------------------------------+
Georg Brandlf0757a22010-05-24 21:29:07 +0000954| cp858 | 858, IBM858 | Western Europe |
955+-----------------+--------------------------------+--------------------------------+
Georg Brandl8ec7f652007-08-15 14:28:01 +0000956| cp860 | 860, IBM860 | Portuguese |
957+-----------------+--------------------------------+--------------------------------+
958| cp861 | 861, CP-IS, IBM861 | Icelandic |
959+-----------------+--------------------------------+--------------------------------+
960| cp862 | 862, IBM862 | Hebrew |
961+-----------------+--------------------------------+--------------------------------+
962| cp863 | 863, IBM863 | Canadian |
963+-----------------+--------------------------------+--------------------------------+
964| cp864 | IBM864 | Arabic |
965+-----------------+--------------------------------+--------------------------------+
966| cp865 | 865, IBM865 | Danish, Norwegian |
967+-----------------+--------------------------------+--------------------------------+
968| cp866 | 866, IBM866 | Russian |
969+-----------------+--------------------------------+--------------------------------+
970| cp869 | 869, CP-GR, IBM869 | Greek |
971+-----------------+--------------------------------+--------------------------------+
972| cp874 | | Thai |
973+-----------------+--------------------------------+--------------------------------+
974| cp875 | | Greek |
975+-----------------+--------------------------------+--------------------------------+
976| cp932 | 932, ms932, mskanji, ms-kanji | Japanese |
977+-----------------+--------------------------------+--------------------------------+
978| cp949 | 949, ms949, uhc | Korean |
979+-----------------+--------------------------------+--------------------------------+
980| cp950 | 950, ms950 | Traditional Chinese |
981+-----------------+--------------------------------+--------------------------------+
982| cp1006 | | Urdu |
983+-----------------+--------------------------------+--------------------------------+
984| cp1026 | ibm1026 | Turkish |
985+-----------------+--------------------------------+--------------------------------+
986| cp1140 | ibm1140 | Western Europe |
987+-----------------+--------------------------------+--------------------------------+
988| cp1250 | windows-1250 | Central and Eastern Europe |
989+-----------------+--------------------------------+--------------------------------+
990| cp1251 | windows-1251 | Bulgarian, Byelorussian, |
991| | | Macedonian, Russian, Serbian |
992+-----------------+--------------------------------+--------------------------------+
993| cp1252 | windows-1252 | Western Europe |
994+-----------------+--------------------------------+--------------------------------+
995| cp1253 | windows-1253 | Greek |
996+-----------------+--------------------------------+--------------------------------+
997| cp1254 | windows-1254 | Turkish |
998+-----------------+--------------------------------+--------------------------------+
999| cp1255 | windows-1255 | Hebrew |
1000+-----------------+--------------------------------+--------------------------------+
Georg Brandlac870772009-09-22 10:55:08 +00001001| cp1256 | windows-1256 | Arabic |
Georg Brandl8ec7f652007-08-15 14:28:01 +00001002+-----------------+--------------------------------+--------------------------------+
1003| cp1257 | windows-1257 | Baltic languages |
1004+-----------------+--------------------------------+--------------------------------+
1005| cp1258 | windows-1258 | Vietnamese |
1006+-----------------+--------------------------------+--------------------------------+
1007| euc_jp | eucjp, ujis, u-jis | Japanese |
1008+-----------------+--------------------------------+--------------------------------+
1009| euc_jis_2004 | jisx0213, eucjis2004 | Japanese |
1010+-----------------+--------------------------------+--------------------------------+
1011| euc_jisx0213 | eucjisx0213 | Japanese |
1012+-----------------+--------------------------------+--------------------------------+
1013| euc_kr | euckr, korean, ksc5601, | Korean |
1014| | ks_c-5601, ks_c-5601-1987, | |
1015| | ksx1001, ks_x-1001 | |
1016+-----------------+--------------------------------+--------------------------------+
1017| gb2312 | chinese, csiso58gb231280, euc- | Simplified Chinese |
1018| | cn, euccn, eucgb2312-cn, | |
1019| | gb2312-1980, gb2312-80, iso- | |
1020| | ir-58 | |
1021+-----------------+--------------------------------+--------------------------------+
1022| gbk | 936, cp936, ms936 | Unified Chinese |
1023+-----------------+--------------------------------+--------------------------------+
1024| gb18030 | gb18030-2000 | Unified Chinese |
1025+-----------------+--------------------------------+--------------------------------+
1026| hz | hzgb, hz-gb, hz-gb-2312 | Simplified Chinese |
1027+-----------------+--------------------------------+--------------------------------+
1028| iso2022_jp | csiso2022jp, iso2022jp, | Japanese |
1029| | iso-2022-jp | |
1030+-----------------+--------------------------------+--------------------------------+
1031| iso2022_jp_1 | iso2022jp-1, iso-2022-jp-1 | Japanese |
1032+-----------------+--------------------------------+--------------------------------+
1033| iso2022_jp_2 | iso2022jp-2, iso-2022-jp-2 | Japanese, Korean, Simplified |
1034| | | Chinese, Western Europe, Greek |
1035+-----------------+--------------------------------+--------------------------------+
1036| iso2022_jp_2004 | iso2022jp-2004, | Japanese |
1037| | iso-2022-jp-2004 | |
1038+-----------------+--------------------------------+--------------------------------+
1039| iso2022_jp_3 | iso2022jp-3, iso-2022-jp-3 | Japanese |
1040+-----------------+--------------------------------+--------------------------------+
1041| iso2022_jp_ext | iso2022jp-ext, iso-2022-jp-ext | Japanese |
1042+-----------------+--------------------------------+--------------------------------+
1043| iso2022_kr | csiso2022kr, iso2022kr, | Korean |
1044| | iso-2022-kr | |
1045+-----------------+--------------------------------+--------------------------------+
1046| latin_1 | iso-8859-1, iso8859-1, 8859, | West Europe |
1047| | cp819, latin, latin1, L1 | |
1048+-----------------+--------------------------------+--------------------------------+
1049| iso8859_2 | iso-8859-2, latin2, L2 | Central and Eastern Europe |
1050+-----------------+--------------------------------+--------------------------------+
1051| iso8859_3 | iso-8859-3, latin3, L3 | Esperanto, Maltese |
1052+-----------------+--------------------------------+--------------------------------+
Georg Brandl907a7202008-02-22 12:31:45 +00001053| iso8859_4 | iso-8859-4, latin4, L4 | Baltic languages |
Georg Brandl8ec7f652007-08-15 14:28:01 +00001054+-----------------+--------------------------------+--------------------------------+
1055| iso8859_5 | iso-8859-5, cyrillic | Bulgarian, Byelorussian, |
1056| | | Macedonian, Russian, Serbian |
1057+-----------------+--------------------------------+--------------------------------+
1058| iso8859_6 | iso-8859-6, arabic | Arabic |
1059+-----------------+--------------------------------+--------------------------------+
1060| iso8859_7 | iso-8859-7, greek, greek8 | Greek |
1061+-----------------+--------------------------------+--------------------------------+
1062| iso8859_8 | iso-8859-8, hebrew | Hebrew |
1063+-----------------+--------------------------------+--------------------------------+
1064| iso8859_9 | iso-8859-9, latin5, L5 | Turkish |
1065+-----------------+--------------------------------+--------------------------------+
1066| iso8859_10 | iso-8859-10, latin6, L6 | Nordic languages |
1067+-----------------+--------------------------------+--------------------------------+
Georg Brandl65db5872010-03-14 09:55:08 +00001068| iso8859_13 | iso-8859-13, latin7, L7 | Baltic languages |
Georg Brandl8ec7f652007-08-15 14:28:01 +00001069+-----------------+--------------------------------+--------------------------------+
1070| iso8859_14 | iso-8859-14, latin8, L8 | Celtic languages |
1071+-----------------+--------------------------------+--------------------------------+
Georg Brandl65db5872010-03-14 09:55:08 +00001072| iso8859_15 | iso-8859-15, latin9, L9 | Western Europe |
1073+-----------------+--------------------------------+--------------------------------+
1074| iso8859_16 | iso-8859-16, latin10, L10 | South-Eastern Europe |
Georg Brandl8ec7f652007-08-15 14:28:01 +00001075+-----------------+--------------------------------+--------------------------------+
1076| johab | cp1361, ms1361 | Korean |
1077+-----------------+--------------------------------+--------------------------------+
1078| koi8_r | | Russian |
1079+-----------------+--------------------------------+--------------------------------+
1080| koi8_u | | Ukrainian |
1081+-----------------+--------------------------------+--------------------------------+
1082| mac_cyrillic | maccyrillic | Bulgarian, Byelorussian, |
1083| | | Macedonian, Russian, Serbian |
1084+-----------------+--------------------------------+--------------------------------+
1085| mac_greek | macgreek | Greek |
1086+-----------------+--------------------------------+--------------------------------+
1087| mac_iceland | maciceland | Icelandic |
1088+-----------------+--------------------------------+--------------------------------+
1089| mac_latin2 | maclatin2, maccentraleurope | Central and Eastern Europe |
1090+-----------------+--------------------------------+--------------------------------+
1091| mac_roman | macroman | Western Europe |
1092+-----------------+--------------------------------+--------------------------------+
1093| mac_turkish | macturkish | Turkish |
1094+-----------------+--------------------------------+--------------------------------+
1095| ptcp154 | csptcp154, pt154, cp154, | Kazakh |
1096| | cyrillic-asian | |
1097+-----------------+--------------------------------+--------------------------------+
1098| shift_jis | csshiftjis, shiftjis, sjis, | Japanese |
1099| | s_jis | |
1100+-----------------+--------------------------------+--------------------------------+
1101| shift_jis_2004 | shiftjis2004, sjis_2004, | Japanese |
1102| | sjis2004 | |
1103+-----------------+--------------------------------+--------------------------------+
1104| shift_jisx0213 | shiftjisx0213, sjisx0213, | Japanese |
1105| | s_jisx0213 | |
1106+-----------------+--------------------------------+--------------------------------+
Walter Dörwald6e390802007-08-17 16:41:28 +00001107| utf_32 | U32, utf32 | all languages |
1108+-----------------+--------------------------------+--------------------------------+
1109| utf_32_be | UTF-32BE | all languages |
1110+-----------------+--------------------------------+--------------------------------+
1111| utf_32_le | UTF-32LE | all languages |
1112+-----------------+--------------------------------+--------------------------------+
Georg Brandl8ec7f652007-08-15 14:28:01 +00001113| utf_16 | U16, utf16 | all languages |
1114+-----------------+--------------------------------+--------------------------------+
1115| utf_16_be | UTF-16BE | all languages (BMP only) |
1116+-----------------+--------------------------------+--------------------------------+
1117| utf_16_le | UTF-16LE | all languages (BMP only) |
1118+-----------------+--------------------------------+--------------------------------+
1119| utf_7 | U7, unicode-1-1-utf-7 | all languages |
1120+-----------------+--------------------------------+--------------------------------+
1121| utf_8 | U8, UTF, utf8 | all languages |
1122+-----------------+--------------------------------+--------------------------------+
1123| utf_8_sig | | all languages |
1124+-----------------+--------------------------------+--------------------------------+
1125
Serhiy Storchaka54f70922013-05-22 15:28:30 +03001126Python Specific Encodings
1127-------------------------
Georg Brandl8ec7f652007-08-15 14:28:01 +00001128
Serhiy Storchaka54f70922013-05-22 15:28:30 +03001129A number of predefined codecs are specific to Python, so their codec names have
1130no meaning outside Python. These are listed in the tables below based on the
1131expected input and output types (note that while text encodings are the most
1132common use case for codecs, the underlying codec infrastructure supports
1133arbitrary data transforms rather than just text encodings). For asymmetric
1134codecs, the stated purpose describes the encoding direction.
Georg Brandl8ec7f652007-08-15 14:28:01 +00001135
Serhiy Storchaka54f70922013-05-22 15:28:30 +03001136The following codecs provide unicode-to-str encoding [#encoding-note]_ and
1137str-to-unicode decoding [#decoding-note]_, similar to the Unicode text
1138encodings.
Georg Brandl116aa622007-08-15 14:28:22 +00001139
Serhiy Storchaka54f70922013-05-22 15:28:30 +03001140.. tabularcolumns:: |l|L|L|
1141
1142+--------------------+---------------------------+---------------------------+
1143| Codec | Aliases | Purpose |
1144+====================+===========================+===========================+
1145| idna | | Implements :rfc:`3490`, |
1146| | | see also |
1147| | | :mod:`encodings.idna` |
1148+--------------------+---------------------------+---------------------------+
1149| mbcs | dbcs | Windows only: Encode |
1150| | | operand according to the |
1151| | | ANSI codepage (CP_ACP) |
1152+--------------------+---------------------------+---------------------------+
1153| palmos | | Encoding of PalmOS 3.5 |
1154+--------------------+---------------------------+---------------------------+
1155| punycode | | Implements :rfc:`3492` |
1156+--------------------+---------------------------+---------------------------+
1157| raw_unicode_escape | | Produce a string that is |
1158| | | suitable as raw Unicode |
1159| | | literal in Python source |
1160| | | code |
1161+--------------------+---------------------------+---------------------------+
1162| rot_13 | rot13 | Returns the Caesar-cypher |
1163| | | encryption of the operand |
1164+--------------------+---------------------------+---------------------------+
1165| undefined | | Raise an exception for |
1166| | | all conversions. Can be |
1167| | | used as the system |
1168| | | encoding if no automatic |
1169| | | :term:`coercion` between |
1170| | | byte and Unicode strings |
1171| | | is desired. |
1172+--------------------+---------------------------+---------------------------+
1173| unicode_escape | | Produce a string that is |
1174| | | suitable as Unicode |
1175| | | literal in Python source |
1176| | | code |
1177+--------------------+---------------------------+---------------------------+
1178| unicode_internal | | Return the internal |
1179| | | representation of the |
1180| | | operand |
1181+--------------------+---------------------------+---------------------------+
Georg Brandl8ec7f652007-08-15 14:28:01 +00001182
1183.. versionadded:: 2.3
1184 The ``idna`` and ``punycode`` encodings.
1185
Serhiy Storchaka54f70922013-05-22 15:28:30 +03001186The following codecs provide str-to-str encoding and decoding
1187[#decoding-note]_.
1188
1189.. tabularcolumns:: |l|L|L|L|
1190
1191+--------------------+---------------------------+---------------------------+------------------------------+
1192| Codec | Aliases | Purpose | Encoder/decoder |
1193+====================+===========================+===========================+==============================+
1194| base64_codec | base64, base-64 | Convert operand to MIME | :meth:`base64.b64encode`, |
1195| | | base64 (the result always | :meth:`base64.b64decode` |
1196| | | includes a trailing | |
1197| | | ``'\n'``) | |
1198+--------------------+---------------------------+---------------------------+------------------------------+
1199| bz2_codec | bz2 | Compress the operand | :meth:`bz2.compress`, |
1200| | | using bz2 | :meth:`bz2.decompress` |
1201+--------------------+---------------------------+---------------------------+------------------------------+
1202| hex_codec | hex | Convert operand to | :meth:`base64.b16encode`, |
1203| | | hexadecimal | :meth:`base64.b16decode` |
1204| | | representation, with two | |
1205| | | digits per byte | |
1206+--------------------+---------------------------+---------------------------+------------------------------+
1207| quopri_codec | quopri, quoted-printable, | Convert operand to MIME | :meth:`quopri.encodestring`, |
1208| | quotedprintable | quoted printable | :meth:`quopri.decodestring` |
1209+--------------------+---------------------------+---------------------------+------------------------------+
1210| string_escape | | Produce a string that is | |
1211| | | suitable as string | |
1212| | | literal in Python source | |
1213| | | code | |
1214+--------------------+---------------------------+---------------------------+------------------------------+
1215| uu_codec | uu | Convert the operand using | :meth:`uu.encode`, |
1216| | | uuencode | :meth:`uu.decode` |
1217+--------------------+---------------------------+---------------------------+------------------------------+
1218| zlib_codec | zip, zlib | Compress the operand | :meth:`zlib.compress`, |
1219| | | using gzip | :meth:`zlib.decompress` |
1220+--------------------+---------------------------+---------------------------+------------------------------+
1221
1222.. [#encoding-note] str objects are also accepted as input in place of unicode
1223 objects. They are implicitly converted to unicode by decoding them using
1224 the default encoding. If this conversion fails, it may lead to encoding
1225 operations raising :exc:`UnicodeDecodeError`.
1226
1227.. [#decoding-note] unicode objects are also accepted as input in place of str
1228 objects. They are implicitly converted to str by encoding them using the
1229 default encoding. If this conversion fails, it may lead to decoding
1230 operations raising :exc:`UnicodeEncodeError`.
1231
Georg Brandl8ec7f652007-08-15 14:28:01 +00001232
1233:mod:`encodings.idna` --- Internationalized Domain Names in Applications
1234------------------------------------------------------------------------
1235
1236.. module:: encodings.idna
1237 :synopsis: Internationalized Domain Names implementation
1238.. moduleauthor:: Martin v. Löwis
1239
1240.. versionadded:: 2.3
1241
1242This module implements :rfc:`3490` (Internationalized Domain Names in
1243Applications) and :rfc:`3492` (Nameprep: A Stringprep Profile for
1244Internationalized Domain Names (IDN)). It builds upon the ``punycode`` encoding
1245and :mod:`stringprep`.
1246
1247These RFCs together define a protocol to support non-ASCII characters in domain
1248names. A domain name containing non-ASCII characters (such as
1249``www.Alliancefrançaise.nu``) is converted into an ASCII-compatible encoding
1250(ACE, such as ``www.xn--alliancefranaise-npb.nu``). The ACE form of the domain
1251name is then used in all places where arbitrary characters are not allowed by
1252the protocol, such as DNS queries, HTTP :mailheader:`Host` fields, and so
1253on. This conversion is carried out in the application; if possible invisible to
1254the user: The application should transparently convert Unicode domain labels to
1255IDNA on the wire, and convert back ACE labels to Unicode before presenting them
1256to the user.
1257
R David Murraya2472d22011-04-13 14:20:30 -04001258Python supports this conversion in several ways: the ``idna`` codec performs
1259conversion between Unicode and ACE, separating an input string into labels
1260based on the separator characters defined in `section 3.1`_ (1) of :rfc:`3490`
1261and converting each label to ACE as required, and conversely separating an input
1262byte string into labels based on the ``.`` separator and converting any ACE
1263labels found into unicode. Furthermore, the :mod:`socket` module
Georg Brandl8ec7f652007-08-15 14:28:01 +00001264transparently converts Unicode host names to ACE, so that applications need not
1265be concerned about converting host names themselves when they pass them to the
1266socket module. On top of that, modules that have host names as function
1267parameters, such as :mod:`httplib` and :mod:`ftplib`, accept Unicode host names
1268(:mod:`httplib` then also transparently sends an IDNA hostname in the
1269:mailheader:`Host` field if it sends that field at all).
1270
R David Murraya2472d22011-04-13 14:20:30 -04001271.. _section 3.1: http://tools.ietf.org/html/rfc3490#section-3.1
1272
Georg Brandl8ec7f652007-08-15 14:28:01 +00001273When receiving host names from the wire (such as in reverse name lookup), no
1274automatic conversion to Unicode is performed: Applications wishing to present
1275such host names to the user should decode them to Unicode.
1276
1277The module :mod:`encodings.idna` also implements the nameprep procedure, which
1278performs certain normalizations on host names, to achieve case-insensitivity of
1279international domain names, and to unify similar characters. The nameprep
1280functions can be used directly if desired.
1281
1282
1283.. function:: nameprep(label)
1284
1285 Return the nameprepped version of *label*. The implementation currently assumes
1286 query strings, so ``AllowUnassigned`` is true.
1287
1288
1289.. function:: ToASCII(label)
1290
1291 Convert a label to ASCII, as specified in :rfc:`3490`. ``UseSTD3ASCIIRules`` is
1292 assumed to be false.
1293
1294
1295.. function:: ToUnicode(label)
1296
1297 Convert a label to Unicode, as specified in :rfc:`3490`.
1298
1299
1300:mod:`encodings.utf_8_sig` --- UTF-8 codec with BOM signature
1301-------------------------------------------------------------
1302
1303.. module:: encodings.utf_8_sig
1304 :synopsis: UTF-8 codec with BOM signature
1305.. moduleauthor:: Walter Dörwald
1306
1307.. versionadded:: 2.5
1308
1309This module implements a variant of the UTF-8 codec: On encoding a UTF-8 encoded
1310BOM will be prepended to the UTF-8 encoded bytes. For the stateful encoder this
1311is only done once (on the first write to the byte stream). For decoding an
1312optional UTF-8 encoded BOM at the start of the data will be skipped.
1313