blob: 7a035c266cce73281e84377021737815c740badd [file] [log] [blame]
Georg Brandl116aa622007-08-15 14:28:22 +00001
2:mod:`codecs` --- Codec registry and base classes
3=================================================
4
5.. module:: codecs
6 :synopsis: Encode and decode data and streams.
7.. moduleauthor:: Marc-Andre Lemburg <mal@lemburg.com>
8.. sectionauthor:: Marc-Andre Lemburg <mal@lemburg.com>
9.. sectionauthor:: Martin v. Löwis <martin@v.loewis.de>
10
11
12.. index::
13 single: Unicode
14 single: Codecs
15 pair: Codecs; encode
16 pair: Codecs; decode
17 single: streams
18 pair: stackable; streams
19
20This module defines base classes for standard Python codecs (encoders and
21decoders) and provides access to the internal Python codec registry which
22manages the codec and error handling lookup process.
23
24It defines the following functions:
25
26
27.. function:: register(search_function)
28
29 Register a codec search function. Search functions are expected to take one
30 argument, the encoding name in all lower case letters, and return a
31 :class:`CodecInfo` object having the following attributes:
32
33 * ``name`` The name of the encoding;
34
35 * ``encoder`` The stateless encoding function;
36
37 * ``decoder`` The stateless decoding function;
38
39 * ``incrementalencoder`` An incremental encoder class or factory function;
40
41 * ``incrementaldecoder`` An incremental decoder class or factory function;
42
43 * ``streamwriter`` A stream writer class or factory function;
44
45 * ``streamreader`` A stream reader class or factory function.
46
47 The various functions or classes take the following arguments:
48
49 *encoder* and *decoder*: These must be functions or methods which have the same
50 interface as the :meth:`encode`/:meth:`decode` methods of Codec instances (see
51 Codec Interface). The functions/methods are expected to work in a stateless
52 mode.
53
54 *incrementalencoder* and *incrementalencoder*: These have to be factory
55 functions providing the following interface:
56
57 ``factory(errors='strict')``
58
59 The factory functions must return objects providing the interfaces defined by
60 the base classes :class:`IncrementalEncoder` and :class:`IncrementalEncoder`,
61 respectively. Incremental codecs can maintain state.
62
63 *streamreader* and *streamwriter*: These have to be factory functions providing
64 the following interface:
65
66 ``factory(stream, errors='strict')``
67
68 The factory functions must return objects providing the interfaces defined by
69 the base classes :class:`StreamWriter` and :class:`StreamReader`, respectively.
70 Stream codecs can maintain state.
71
72 Possible values for errors are ``'strict'`` (raise an exception in case of an
73 encoding error), ``'replace'`` (replace malformed data with a suitable
74 replacement marker, such as ``'?'``), ``'ignore'`` (ignore malformed data and
75 continue without further notice), ``'xmlcharrefreplace'`` (replace with the
76 appropriate XML character reference (for encoding only)) and
77 ``'backslashreplace'`` (replace with backslashed escape sequences (for encoding
78 only)) as well as any other error handling name defined via
79 :func:`register_error`.
80
81 In case a search function cannot find a given encoding, it should return
82 ``None``.
83
84
85.. function:: lookup(encoding)
86
87 Looks up the codec info in the Python codec registry and returns a
88 :class:`CodecInfo` object as defined above.
89
90 Encodings are first looked up in the registry's cache. If not found, the list of
91 registered search functions is scanned. If no :class:`CodecInfo` object is
92 found, a :exc:`LookupError` is raised. Otherwise, the :class:`CodecInfo` object
93 is stored in the cache and returned to the caller.
94
95To simplify access to the various codecs, the module provides these additional
96functions which use :func:`lookup` for the codec lookup:
97
98
99.. function:: getencoder(encoding)
100
101 Look up the codec for the given encoding and return its encoder function.
102
103 Raises a :exc:`LookupError` in case the encoding cannot be found.
104
105
106.. function:: getdecoder(encoding)
107
108 Look up the codec for the given encoding and return its decoder function.
109
110 Raises a :exc:`LookupError` in case the encoding cannot be found.
111
112
113.. function:: getincrementalencoder(encoding)
114
115 Look up the codec for the given encoding and return its incremental encoder
116 class or factory function.
117
118 Raises a :exc:`LookupError` in case the encoding cannot be found or the codec
119 doesn't support an incremental encoder.
120
121 .. versionadded:: 2.5
122
123
124.. function:: getincrementaldecoder(encoding)
125
126 Look up the codec for the given encoding and return its incremental decoder
127 class or factory function.
128
129 Raises a :exc:`LookupError` in case the encoding cannot be found or the codec
130 doesn't support an incremental decoder.
131
132 .. versionadded:: 2.5
133
134
135.. function:: getreader(encoding)
136
137 Look up the codec for the given encoding and return its StreamReader class or
138 factory function.
139
140 Raises a :exc:`LookupError` in case the encoding cannot be found.
141
142
143.. function:: getwriter(encoding)
144
145 Look up the codec for the given encoding and return its StreamWriter class or
146 factory function.
147
148 Raises a :exc:`LookupError` in case the encoding cannot be found.
149
150
151.. function:: register_error(name, error_handler)
152
153 Register the error handling function *error_handler* under the name *name*.
154 *error_handler* will be called during encoding and decoding in case of an error,
155 when *name* is specified as the errors parameter.
156
157 For encoding *error_handler* will be called with a :exc:`UnicodeEncodeError`
158 instance, which contains information about the location of the error. The error
159 handler must either raise this or a different exception or return a tuple with a
160 replacement for the unencodable part of the input and a position where encoding
161 should continue. The encoder will encode the replacement and continue encoding
162 the original input at the specified position. Negative position values will be
163 treated as being relative to the end of the input string. If the resulting
164 position is out of bound an :exc:`IndexError` will be raised.
165
166 Decoding and translating works similar, except :exc:`UnicodeDecodeError` or
167 :exc:`UnicodeTranslateError` will be passed to the handler and that the
168 replacement from the error handler will be put into the output directly.
169
170
171.. function:: lookup_error(name)
172
173 Return the error handler previously registered under the name *name*.
174
175 Raises a :exc:`LookupError` in case the handler cannot be found.
176
177
178.. function:: strict_errors(exception)
179
180 Implements the ``strict`` error handling.
181
182
183.. function:: replace_errors(exception)
184
185 Implements the ``replace`` error handling.
186
187
188.. function:: ignore_errors(exception)
189
190 Implements the ``ignore`` error handling.
191
192
193.. function:: xmlcharrefreplace_errors_errors(exception)
194
195 Implements the ``xmlcharrefreplace`` error handling.
196
197
198.. function:: backslashreplace_errors_errors(exception)
199
200 Implements the ``backslashreplace`` error handling.
201
202To simplify working with encoded files or stream, the module also defines these
203utility functions:
204
205
206.. function:: open(filename, mode[, encoding[, errors[, buffering]]])
207
208 Open an encoded file using the given *mode* and return a wrapped version
209 providing transparent encoding/decoding.
210
211 .. note::
212
213 The wrapped version will only accept the object format defined by the codecs,
214 i.e. Unicode objects for most built-in codecs. Output is also codec-dependent
215 and will usually be Unicode as well.
216
217 *encoding* specifies the encoding which is to be used for the file.
218
219 *errors* may be given to define the error handling. It defaults to ``'strict'``
220 which causes a :exc:`ValueError` to be raised in case an encoding error occurs.
221
222 *buffering* has the same meaning as for the built-in :func:`open` function. It
223 defaults to line buffered.
224
225
226.. function:: EncodedFile(file, input[, output[, errors]])
227
228 Return a wrapped version of file which provides transparent encoding
229 translation.
230
231 Strings written to the wrapped file are interpreted according to the given
232 *input* encoding and then written to the original file as strings using the
233 *output* encoding. The intermediate encoding will usually be Unicode but depends
234 on the specified codecs.
235
236 If *output* is not given, it defaults to *input*.
237
238 *errors* may be given to define the error handling. It defaults to ``'strict'``,
239 which causes :exc:`ValueError` to be raised in case an encoding error occurs.
240
241
242.. function:: iterencode(iterable, encoding[, errors])
243
244 Uses an incremental encoder to iteratively encode the input provided by
245 *iterable*. This function is a generator. *errors* (as well as any other keyword
246 argument) is passed through to the incremental encoder.
247
248 .. versionadded:: 2.5
249
250
251.. function:: iterdecode(iterable, encoding[, errors])
252
253 Uses an incremental decoder to iteratively decode the input provided by
254 *iterable*. This function is a generator. *errors* (as well as any other keyword
255 argument) is passed through to the incremental decoder.
256
257 .. versionadded:: 2.5
258
259The module also provides the following constants which are useful for reading
260and writing to platform dependent files:
261
262
263.. data:: BOM
264 BOM_BE
265 BOM_LE
266 BOM_UTF8
267 BOM_UTF16
268 BOM_UTF16_BE
269 BOM_UTF16_LE
270 BOM_UTF32
271 BOM_UTF32_BE
272 BOM_UTF32_LE
273
274 These constants define various encodings of the Unicode byte order mark (BOM)
275 used in UTF-16 and UTF-32 data streams to indicate the byte order used in the
276 stream or file and in UTF-8 as a Unicode signature. :const:`BOM_UTF16` is either
277 :const:`BOM_UTF16_BE` or :const:`BOM_UTF16_LE` depending on the platform's
278 native byte order, :const:`BOM` is an alias for :const:`BOM_UTF16`,
279 :const:`BOM_LE` for :const:`BOM_UTF16_LE` and :const:`BOM_BE` for
280 :const:`BOM_UTF16_BE`. The others represent the BOM in UTF-8 and UTF-32
281 encodings.
282
283
284.. _codec-base-classes:
285
286Codec Base Classes
287------------------
288
289The :mod:`codecs` module defines a set of base classes which define the
290interface and can also be used to easily write you own codecs for use in Python.
291
292Each codec has to define four interfaces to make it usable as codec in Python:
293stateless encoder, stateless decoder, stream reader and stream writer. The
294stream reader and writers typically reuse the stateless encoder/decoder to
295implement the file protocols.
296
297The :class:`Codec` class defines the interface for stateless encoders/decoders.
298
299To simplify and standardize error handling, the :meth:`encode` and
300:meth:`decode` methods may implement different error handling schemes by
301providing the *errors* string argument. The following string values are defined
302and implemented by all standard Python codecs:
303
304+-------------------------+-----------------------------------------------+
305| Value | Meaning |
306+=========================+===============================================+
307| ``'strict'`` | Raise :exc:`UnicodeError` (or a subclass); |
308| | this is the default. |
309+-------------------------+-----------------------------------------------+
310| ``'ignore'`` | Ignore the character and continue with the |
311| | next. |
312+-------------------------+-----------------------------------------------+
313| ``'replace'`` | Replace with a suitable replacement |
314| | character; Python will use the official |
315| | U+FFFD REPLACEMENT CHARACTER for the built-in |
316| | Unicode codecs on decoding and '?' on |
317| | encoding. |
318+-------------------------+-----------------------------------------------+
319| ``'xmlcharrefreplace'`` | Replace with the appropriate XML character |
320| | reference (only for encoding). |
321+-------------------------+-----------------------------------------------+
322| ``'backslashreplace'`` | Replace with backslashed escape sequences |
323| | (only for encoding). |
324+-------------------------+-----------------------------------------------+
325
326The set of allowed values can be extended via :meth:`register_error`.
327
328
329.. _codec-objects:
330
331Codec Objects
332^^^^^^^^^^^^^
333
334The :class:`Codec` class defines these methods which also define the function
335interfaces of the stateless encoder and decoder:
336
337
338.. method:: Codec.encode(input[, errors])
339
340 Encodes the object *input* and returns a tuple (output object, length consumed).
341 While codecs are not restricted to use with Unicode, in a Unicode context,
342 encoding converts a Unicode object to a plain string using a particular
343 character set encoding (e.g., ``cp1252`` or ``iso-8859-1``).
344
345 *errors* defines the error handling to apply. It defaults to ``'strict'``
346 handling.
347
348 The method may not store state in the :class:`Codec` instance. Use
349 :class:`StreamCodec` for codecs which have to keep state in order to make
350 encoding/decoding efficient.
351
352 The encoder must be able to handle zero length input and return an empty object
353 of the output object type in this situation.
354
355
356.. method:: Codec.decode(input[, errors])
357
358 Decodes the object *input* and returns a tuple (output object, length consumed).
359 In a Unicode context, decoding converts a plain string encoded using a
360 particular character set encoding to a Unicode object.
361
362 *input* must be an object which provides the ``bf_getreadbuf`` buffer slot.
363 Python strings, buffer objects and memory mapped files are examples of objects
364 providing this slot.
365
366 *errors* defines the error handling to apply. It defaults to ``'strict'``
367 handling.
368
369 The method may not store state in the :class:`Codec` instance. Use
370 :class:`StreamCodec` for codecs which have to keep state in order to make
371 encoding/decoding efficient.
372
373 The decoder must be able to handle zero length input and return an empty object
374 of the output object type in this situation.
375
376The :class:`IncrementalEncoder` and :class:`IncrementalDecoder` classes provide
377the basic interface for incremental encoding and decoding. Encoding/decoding the
378input isn't done with one call to the stateless encoder/decoder function, but
379with multiple calls to the :meth:`encode`/:meth:`decode` method of the
380incremental encoder/decoder. The incremental encoder/decoder keeps track of the
381encoding/decoding process during method calls.
382
383The joined output of calls to the :meth:`encode`/:meth:`decode` method is the
384same as if all the single inputs were joined into one, and this input was
385encoded/decoded with the stateless encoder/decoder.
386
387
388.. _incremental-encoder-objects:
389
390IncrementalEncoder Objects
391^^^^^^^^^^^^^^^^^^^^^^^^^^
392
393.. versionadded:: 2.5
394
395The :class:`IncrementalEncoder` class is used for encoding an input in multiple
396steps. It defines the following methods which every incremental encoder must
397define in order to be compatible with the Python codec registry.
398
399
400.. class:: IncrementalEncoder([errors])
401
402 Constructor for an :class:`IncrementalEncoder` instance.
403
404 All incremental encoders must provide this constructor interface. They are free
405 to add additional keyword arguments, but only the ones defined here are used by
406 the Python codec registry.
407
408 The :class:`IncrementalEncoder` may implement different error handling schemes
409 by providing the *errors* keyword argument. These parameters are predefined:
410
411 * ``'strict'`` Raise :exc:`ValueError` (or a subclass); this is the default.
412
413 * ``'ignore'`` Ignore the character and continue with the next.
414
415 * ``'replace'`` Replace with a suitable replacement character
416
417 * ``'xmlcharrefreplace'`` Replace with the appropriate XML character reference
418
419 * ``'backslashreplace'`` Replace with backslashed escape sequences.
420
421 The *errors* argument will be assigned to an attribute of the same name.
422 Assigning to this attribute makes it possible to switch between different error
423 handling strategies during the lifetime of the :class:`IncrementalEncoder`
424 object.
425
426 The set of allowed values for the *errors* argument can be extended with
427 :func:`register_error`.
428
429
430.. method:: IncrementalEncoder.encode(object[, final])
431
432 Encodes *object* (taking the current state of the encoder into account) and
433 returns the resulting encoded object. If this is the last call to :meth:`encode`
434 *final* must be true (the default is false).
435
436
437.. method:: IncrementalEncoder.reset()
438
439 Reset the encoder to the initial state.
440
441
442.. method:: IncrementalEncoder.getstate()
443
444 Return the current state of the encoder which must be an integer. The
445 implementation should make sure that ``0`` is the most common state. (States
446 that are more complicated than integers can be converted into an integer by
447 marshaling/pickling the state and encoding the bytes of the resulting string
448 into an integer).
449
450 .. versionadded:: 3.0
451
452
453.. method:: IncrementalEncoder.setstate(state)
454
455 Set the state of the encoder to *state*. *state* must be an encoder state
456 returned by :meth:`getstate`.
457
458 .. versionadded:: 3.0
459
460
461.. _incremental-decoder-objects:
462
463IncrementalDecoder Objects
464^^^^^^^^^^^^^^^^^^^^^^^^^^
465
466The :class:`IncrementalDecoder` class is used for decoding an input in multiple
467steps. It defines the following methods which every incremental decoder must
468define in order to be compatible with the Python codec registry.
469
470
471.. class:: IncrementalDecoder([errors])
472
473 Constructor for an :class:`IncrementalDecoder` instance.
474
475 All incremental decoders must provide this constructor interface. They are free
476 to add additional keyword arguments, but only the ones defined here are used by
477 the Python codec registry.
478
479 The :class:`IncrementalDecoder` may implement different error handling schemes
480 by providing the *errors* keyword argument. These parameters are predefined:
481
482 * ``'strict'`` Raise :exc:`ValueError` (or a subclass); this is the default.
483
484 * ``'ignore'`` Ignore the character and continue with the next.
485
486 * ``'replace'`` Replace with a suitable replacement character.
487
488 The *errors* argument will be assigned to an attribute of the same name.
489 Assigning to this attribute makes it possible to switch between different error
490 handling strategies during the lifetime of the :class:`IncrementalEncoder`
491 object.
492
493 The set of allowed values for the *errors* argument can be extended with
494 :func:`register_error`.
495
496
497.. method:: IncrementalDecoder.decode(object[, final])
498
499 Decodes *object* (taking the current state of the decoder into account) and
500 returns the resulting decoded object. If this is the last call to :meth:`decode`
501 *final* must be true (the default is false). If *final* is true the decoder must
502 decode the input completely and must flush all buffers. If this isn't possible
503 (e.g. because of incomplete byte sequences at the end of the input) it must
504 initiate error handling just like in the stateless case (which might raise an
505 exception).
506
507
508.. method:: IncrementalDecoder.reset()
509
510 Reset the decoder to the initial state.
511
512
513.. method:: IncrementalDecoder.getstate()
514
515 Return the current state of the decoder. This must be a tuple with two items,
516 the first must be the buffer containing the still undecoded input. The second
517 must be an integer and can be additional state info. (The implementation should
518 make sure that ``0`` is the most common additional state info.) If this
519 additional state info is ``0`` it must be possible to set the decoder to the
520 state which has no input buffered and ``0`` as the additional state info, so
521 that feeding the previously buffered input to the decoder returns it to the
522 previous state without producing any output. (Additional state info that is more
523 complicated than integers can be converted into an integer by
524 marshaling/pickling the info and encoding the bytes of the resulting string into
525 an integer.)
526
527 .. versionadded:: 3.0
528
529
530.. method:: IncrementalDecoder.setstate(state)
531
532 Set the state of the encoder to *state*. *state* must be a decoder state
533 returned by :meth:`getstate`.
534
535 .. versionadded:: 3.0
536
537The :class:`StreamWriter` and :class:`StreamReader` classes provide generic
538working interfaces which can be used to implement new encoding submodules very
539easily. See :mod:`encodings.utf_8` for an example of how this is done.
540
541
542.. _stream-writer-objects:
543
544StreamWriter Objects
545^^^^^^^^^^^^^^^^^^^^
546
547The :class:`StreamWriter` class is a subclass of :class:`Codec` and defines the
548following methods which every stream writer must define in order to be
549compatible with the Python codec registry.
550
551
552.. class:: StreamWriter(stream[, errors])
553
554 Constructor for a :class:`StreamWriter` instance.
555
556 All stream writers must provide this constructor interface. They are free to add
557 additional keyword arguments, but only the ones defined here are used by the
558 Python codec registry.
559
560 *stream* must be a file-like object open for writing binary data.
561
562 The :class:`StreamWriter` may implement different error handling schemes by
563 providing the *errors* keyword argument. These parameters are predefined:
564
565 * ``'strict'`` Raise :exc:`ValueError` (or a subclass); this is the default.
566
567 * ``'ignore'`` Ignore the character and continue with the next.
568
569 * ``'replace'`` Replace with a suitable replacement character
570
571 * ``'xmlcharrefreplace'`` Replace with the appropriate XML character reference
572
573 * ``'backslashreplace'`` Replace with backslashed escape sequences.
574
575 The *errors* argument will be assigned to an attribute of the same name.
576 Assigning to this attribute makes it possible to switch between different error
577 handling strategies during the lifetime of the :class:`StreamWriter` object.
578
579 The set of allowed values for the *errors* argument can be extended with
580 :func:`register_error`.
581
582
583.. method:: StreamWriter.write(object)
584
585 Writes the object's contents encoded to the stream.
586
587
588.. method:: StreamWriter.writelines(list)
589
590 Writes the concatenated list of strings to the stream (possibly by reusing the
591 :meth:`write` method).
592
593
594.. method:: StreamWriter.reset()
595
596 Flushes and resets the codec buffers used for keeping state.
597
598 Calling this method should ensure that the data on the output is put into a
599 clean state that allows appending of new fresh data without having to rescan the
600 whole stream to recover state.
601
602In addition to the above methods, the :class:`StreamWriter` must also inherit
603all other methods and attributes from the underlying stream.
604
605
606.. _stream-reader-objects:
607
608StreamReader Objects
609^^^^^^^^^^^^^^^^^^^^
610
611The :class:`StreamReader` class is a subclass of :class:`Codec` and defines the
612following methods which every stream reader must define in order to be
613compatible with the Python codec registry.
614
615
616.. class:: StreamReader(stream[, errors])
617
618 Constructor for a :class:`StreamReader` instance.
619
620 All stream readers must provide this constructor interface. They are free to add
621 additional keyword arguments, but only the ones defined here are used by the
622 Python codec registry.
623
624 *stream* must be a file-like object open for reading (binary) data.
625
626 The :class:`StreamReader` may implement different error handling schemes by
627 providing the *errors* keyword argument. These parameters are defined:
628
629 * ``'strict'`` Raise :exc:`ValueError` (or a subclass); this is the default.
630
631 * ``'ignore'`` Ignore the character and continue with the next.
632
633 * ``'replace'`` Replace with a suitable replacement character.
634
635 The *errors* argument will be assigned to an attribute of the same name.
636 Assigning to this attribute makes it possible to switch between different error
637 handling strategies during the lifetime of the :class:`StreamReader` object.
638
639 The set of allowed values for the *errors* argument can be extended with
640 :func:`register_error`.
641
642
643.. method:: StreamReader.read([size[, chars, [firstline]]])
644
645 Decodes data from the stream and returns the resulting object.
646
647 *chars* indicates the number of characters to read from the stream. :func:`read`
648 will never return more than *chars* characters, but it might return less, if
649 there are not enough characters available.
650
651 *size* indicates the approximate maximum number of bytes to read from the stream
652 for decoding purposes. The decoder can modify this setting as appropriate. The
653 default value -1 indicates to read and decode as much as possible. *size* is
654 intended to prevent having to decode huge files in one step.
655
656 *firstline* indicates that it would be sufficient to only return the first line,
657 if there are decoding errors on later lines.
658
659 The method should use a greedy read strategy meaning that it should read as much
660 data as is allowed within the definition of the encoding and the given size,
661 e.g. if optional encoding endings or state markers are available on the stream,
662 these should be read too.
663
664 .. versionchanged:: 2.4
665 *chars* argument added.
666
667 .. versionchanged:: 2.4.2
668 *firstline* argument added.
669
670
671.. method:: StreamReader.readline([size[, keepends]])
672
673 Read one line from the input stream and return the decoded data.
674
675 *size*, if given, is passed as size argument to the stream's :meth:`readline`
676 method.
677
678 If *keepends* is false line-endings will be stripped from the lines returned.
679
680 .. versionchanged:: 2.4
681 *keepends* argument added.
682
683
684.. method:: StreamReader.readlines([sizehint[, keepends]])
685
686 Read all lines available on the input stream and return them as a list of lines.
687
688 Line-endings are implemented using the codec's decoder method and are included
689 in the list entries if *keepends* is true.
690
691 *sizehint*, if given, is passed as the *size* argument to the stream's
692 :meth:`read` method.
693
694
695.. method:: StreamReader.reset()
696
697 Resets the codec buffers used for keeping state.
698
699 Note that no stream repositioning should take place. This method is primarily
700 intended to be able to recover from decoding errors.
701
702In addition to the above methods, the :class:`StreamReader` must also inherit
703all other methods and attributes from the underlying stream.
704
705The next two base classes are included for convenience. They are not needed by
706the codec registry, but may provide useful in practice.
707
708
709.. _stream-reader-writer:
710
711StreamReaderWriter Objects
712^^^^^^^^^^^^^^^^^^^^^^^^^^
713
714The :class:`StreamReaderWriter` allows wrapping streams which work in both read
715and write modes.
716
717The design is such that one can use the factory functions returned by the
718:func:`lookup` function to construct the instance.
719
720
721.. class:: StreamReaderWriter(stream, Reader, Writer, errors)
722
723 Creates a :class:`StreamReaderWriter` instance. *stream* must be a file-like
724 object. *Reader* and *Writer* must be factory functions or classes providing the
725 :class:`StreamReader` and :class:`StreamWriter` interface resp. Error handling
726 is done in the same way as defined for the stream readers and writers.
727
728:class:`StreamReaderWriter` instances define the combined interfaces of
729:class:`StreamReader` and :class:`StreamWriter` classes. They inherit all other
730methods and attributes from the underlying stream.
731
732
733.. _stream-recoder-objects:
734
735StreamRecoder Objects
736^^^^^^^^^^^^^^^^^^^^^
737
738The :class:`StreamRecoder` provide a frontend - backend view of encoding data
739which is sometimes useful when dealing with different encoding environments.
740
741The design is such that one can use the factory functions returned by the
742:func:`lookup` function to construct the instance.
743
744
745.. class:: StreamRecoder(stream, encode, decode, Reader, Writer, errors)
746
747 Creates a :class:`StreamRecoder` instance which implements a two-way conversion:
748 *encode* and *decode* work on the frontend (the input to :meth:`read` and output
749 of :meth:`write`) while *Reader* and *Writer* work on the backend (reading and
750 writing to the stream).
751
752 You can use these objects to do transparent direct recodings from e.g. Latin-1
753 to UTF-8 and back.
754
755 *stream* must be a file-like object.
756
757 *encode*, *decode* must adhere to the :class:`Codec` interface. *Reader*,
758 *Writer* must be factory functions or classes providing objects of the
759 :class:`StreamReader` and :class:`StreamWriter` interface respectively.
760
761 *encode* and *decode* are needed for the frontend translation, *Reader* and
762 *Writer* for the backend translation. The intermediate format used is
763 determined by the two sets of codecs, e.g. the Unicode codecs will use Unicode
764 as the intermediate encoding.
765
766 Error handling is done in the same way as defined for the stream readers and
767 writers.
768
769:class:`StreamRecoder` instances define the combined interfaces of
770:class:`StreamReader` and :class:`StreamWriter` classes. They inherit all other
771methods and attributes from the underlying stream.
772
773
774.. _encodings-overview:
775
776Encodings and Unicode
777---------------------
778
779Unicode strings are stored internally as sequences of codepoints (to be precise
780as :ctype:`Py_UNICODE` arrays). Depending on the way Python is compiled (either
781via :option:`--enable-unicode=ucs2` or :option:`--enable-unicode=ucs4`, with the
782former being the default) :ctype:`Py_UNICODE` is either a 16-bit or 32-bit data
783type. Once a Unicode object is used outside of CPU and memory, CPU endianness
784and how these arrays are stored as bytes become an issue. Transforming a
785unicode object into a sequence of bytes is called encoding and recreating the
786unicode object from the sequence of bytes is known as decoding. There are many
787different methods for how this transformation can be done (these methods are
788also called encodings). The simplest method is to map the codepoints 0-255 to
789the bytes ``0x0``-``0xff``. This means that a unicode object that contains
790codepoints above ``U+00FF`` can't be encoded with this method (which is called
791``'latin-1'`` or ``'iso-8859-1'``). :func:`unicode.encode` will raise a
792:exc:`UnicodeEncodeError` that looks like this: ``UnicodeEncodeError: 'latin-1'
793codec can't encode character u'\u1234' in position 3: ordinal not in
794range(256)``.
795
796There's another group of encodings (the so called charmap encodings) that choose
797a different subset of all unicode code points and how these codepoints are
798mapped to the bytes ``0x0``-``0xff``. To see how this is done simply open
799e.g. :file:`encodings/cp1252.py` (which is an encoding that is used primarily on
800Windows). There's a string constant with 256 characters that shows you which
801character is mapped to which byte value.
802
803All of these encodings can only encode 256 of the 65536 (or 1114111) codepoints
804defined in unicode. A simple and straightforward way that can store each Unicode
805code point, is to store each codepoint as two consecutive bytes. There are two
806possibilities: Store the bytes in big endian or in little endian order. These
807two encodings are called UTF-16-BE and UTF-16-LE respectively. Their
808disadvantage is that if e.g. you use UTF-16-BE on a little endian machine you
809will always have to swap bytes on encoding and decoding. UTF-16 avoids this
810problem: Bytes will always be in natural endianness. When these bytes are read
811by a CPU with a different endianness, then bytes have to be swapped though. To
812be able to detect the endianness of a UTF-16 byte sequence, there's the so
813called BOM (the "Byte Order Mark"). This is the Unicode character ``U+FEFF``.
814This character will be prepended to every UTF-16 byte sequence. The byte swapped
815version of this character (``0xFFFE``) is an illegal character that may not
816appear in a Unicode text. So when the first character in an UTF-16 byte sequence
817appears to be a ``U+FFFE`` the bytes have to be swapped on decoding.
818Unfortunately upto Unicode 4.0 the character ``U+FEFF`` had a second purpose as
819a ``ZERO WIDTH NO-BREAK SPACE``: A character that has no width and doesn't allow
820a word to be split. It can e.g. be used to give hints to a ligature algorithm.
821With Unicode 4.0 using ``U+FEFF`` as a ``ZERO WIDTH NO-BREAK SPACE`` has been
822deprecated (with ``U+2060`` (``WORD JOINER``) assuming this role). Nevertheless
823Unicode software still must be able to handle ``U+FEFF`` in both roles: As a BOM
824it's a device to determine the storage layout of the encoded bytes, and vanishes
825once the byte sequence has been decoded into a Unicode string; as a ``ZERO WIDTH
826NO-BREAK SPACE`` it's a normal character that will be decoded like any other.
827
828There's another encoding that is able to encoding the full range of Unicode
829characters: UTF-8. UTF-8 is an 8-bit encoding, which means there are no issues
830with byte order in UTF-8. Each byte in a UTF-8 byte sequence consists of two
831parts: Marker bits (the most significant bits) and payload bits. The marker bits
832are a sequence of zero to six 1 bits followed by a 0 bit. Unicode characters are
833encoded like this (with x being payload bits, which when concatenated give the
834Unicode character):
835
836+-----------------------------------+----------------------------------------------+
837| Range | Encoding |
838+===================================+==============================================+
839| ``U-00000000`` ... ``U-0000007F`` | 0xxxxxxx |
840+-----------------------------------+----------------------------------------------+
841| ``U-00000080`` ... ``U-000007FF`` | 110xxxxx 10xxxxxx |
842+-----------------------------------+----------------------------------------------+
843| ``U-00000800`` ... ``U-0000FFFF`` | 1110xxxx 10xxxxxx 10xxxxxx |
844+-----------------------------------+----------------------------------------------+
845| ``U-00010000`` ... ``U-001FFFFF`` | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx |
846+-----------------------------------+----------------------------------------------+
847| ``U-00200000`` ... ``U-03FFFFFF`` | 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx |
848+-----------------------------------+----------------------------------------------+
849| ``U-04000000`` ... ``U-7FFFFFFF`` | 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx |
850| | 10xxxxxx |
851+-----------------------------------+----------------------------------------------+
852
853The least significant bit of the Unicode character is the rightmost x bit.
854
855As UTF-8 is an 8-bit encoding no BOM is required and any ``U+FEFF`` character in
856the decoded Unicode string (even if it's the first character) is treated as a
857``ZERO WIDTH NO-BREAK SPACE``.
858
859Without external information it's impossible to reliably determine which
860encoding was used for encoding a Unicode string. Each charmap encoding can
861decode any random byte sequence. However that's not possible with UTF-8, as
862UTF-8 byte sequences have a structure that doesn't allow arbitrary byte
863sequence. To increase the reliability with which a UTF-8 encoding can be
864detected, Microsoft invented a variant of UTF-8 (that Python 2.5 calls
865``"utf-8-sig"``) for its Notepad program: Before any of the Unicode characters
866is written to the file, a UTF-8 encoded BOM (which looks like this as a byte
867sequence: ``0xef``, ``0xbb``, ``0xbf``) is written. As it's rather improbable
868that any charmap encoded file starts with these byte values (which would e.g.
869map to
870
871 | LATIN SMALL LETTER I WITH DIAERESIS
872 | RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK
873 | INVERTED QUESTION MARK
874
875in iso-8859-1), this increases the probability that a utf-8-sig encoding can be
876correctly guessed from the byte sequence. So here the BOM is not used to be able
877to determine the byte order used for generating the byte sequence, but as a
878signature that helps in guessing the encoding. On encoding the utf-8-sig codec
879will write ``0xef``, ``0xbb``, ``0xbf`` as the first three bytes to the file. On
880decoding utf-8-sig will skip those three bytes if they appear as the first three
881bytes in the file.
882
883
884.. _standard-encodings:
885
886Standard Encodings
887------------------
888
889Python comes with a number of codecs built-in, either implemented as C functions
890or with dictionaries as mapping tables. The following table lists the codecs by
891name, together with a few common aliases, and the languages for which the
892encoding is likely used. Neither the list of aliases nor the list of languages
893is meant to be exhaustive. Notice that spelling alternatives that only differ in
894case or use a hyphen instead of an underscore are also valid aliases.
895
896Many of the character sets support the same languages. They vary in individual
897characters (e.g. whether the EURO SIGN is supported or not), and in the
898assignment of characters to code positions. For the European languages in
899particular, the following variants typically exist:
900
901* an ISO 8859 codeset
902
903* a Microsoft Windows code page, which is typically derived from a 8859 codeset,
904 but replaces control characters with additional graphic characters
905
906* an IBM EBCDIC code page
907
908* an IBM PC code page, which is ASCII compatible
909
910+-----------------+--------------------------------+--------------------------------+
911| Codec | Aliases | Languages |
912+=================+================================+================================+
913| ascii | 646, us-ascii | English |
914+-----------------+--------------------------------+--------------------------------+
915| big5 | big5-tw, csbig5 | Traditional Chinese |
916+-----------------+--------------------------------+--------------------------------+
917| big5hkscs | big5-hkscs, hkscs | Traditional Chinese |
918+-----------------+--------------------------------+--------------------------------+
919| cp037 | IBM037, IBM039 | English |
920+-----------------+--------------------------------+--------------------------------+
921| cp424 | EBCDIC-CP-HE, IBM424 | Hebrew |
922+-----------------+--------------------------------+--------------------------------+
923| cp437 | 437, IBM437 | English |
924+-----------------+--------------------------------+--------------------------------+
925| cp500 | EBCDIC-CP-BE, EBCDIC-CP-CH, | Western Europe |
926| | IBM500 | |
927+-----------------+--------------------------------+--------------------------------+
928| cp737 | | Greek |
929+-----------------+--------------------------------+--------------------------------+
930| cp775 | IBM775 | Baltic languages |
931+-----------------+--------------------------------+--------------------------------+
932| cp850 | 850, IBM850 | Western Europe |
933+-----------------+--------------------------------+--------------------------------+
934| cp852 | 852, IBM852 | Central and Eastern Europe |
935+-----------------+--------------------------------+--------------------------------+
936| cp855 | 855, IBM855 | Bulgarian, Byelorussian, |
937| | | Macedonian, Russian, Serbian |
938+-----------------+--------------------------------+--------------------------------+
939| cp856 | | Hebrew |
940+-----------------+--------------------------------+--------------------------------+
941| cp857 | 857, IBM857 | Turkish |
942+-----------------+--------------------------------+--------------------------------+
943| cp860 | 860, IBM860 | Portuguese |
944+-----------------+--------------------------------+--------------------------------+
945| cp861 | 861, CP-IS, IBM861 | Icelandic |
946+-----------------+--------------------------------+--------------------------------+
947| cp862 | 862, IBM862 | Hebrew |
948+-----------------+--------------------------------+--------------------------------+
949| cp863 | 863, IBM863 | Canadian |
950+-----------------+--------------------------------+--------------------------------+
951| cp864 | IBM864 | Arabic |
952+-----------------+--------------------------------+--------------------------------+
953| cp865 | 865, IBM865 | Danish, Norwegian |
954+-----------------+--------------------------------+--------------------------------+
955| cp866 | 866, IBM866 | Russian |
956+-----------------+--------------------------------+--------------------------------+
957| cp869 | 869, CP-GR, IBM869 | Greek |
958+-----------------+--------------------------------+--------------------------------+
959| cp874 | | Thai |
960+-----------------+--------------------------------+--------------------------------+
961| cp875 | | Greek |
962+-----------------+--------------------------------+--------------------------------+
963| cp932 | 932, ms932, mskanji, ms-kanji | Japanese |
964+-----------------+--------------------------------+--------------------------------+
965| cp949 | 949, ms949, uhc | Korean |
966+-----------------+--------------------------------+--------------------------------+
967| cp950 | 950, ms950 | Traditional Chinese |
968+-----------------+--------------------------------+--------------------------------+
969| cp1006 | | Urdu |
970+-----------------+--------------------------------+--------------------------------+
971| cp1026 | ibm1026 | Turkish |
972+-----------------+--------------------------------+--------------------------------+
973| cp1140 | ibm1140 | Western Europe |
974+-----------------+--------------------------------+--------------------------------+
975| cp1250 | windows-1250 | Central and Eastern Europe |
976+-----------------+--------------------------------+--------------------------------+
977| cp1251 | windows-1251 | Bulgarian, Byelorussian, |
978| | | Macedonian, Russian, Serbian |
979+-----------------+--------------------------------+--------------------------------+
980| cp1252 | windows-1252 | Western Europe |
981+-----------------+--------------------------------+--------------------------------+
982| cp1253 | windows-1253 | Greek |
983+-----------------+--------------------------------+--------------------------------+
984| cp1254 | windows-1254 | Turkish |
985+-----------------+--------------------------------+--------------------------------+
986| cp1255 | windows-1255 | Hebrew |
987+-----------------+--------------------------------+--------------------------------+
988| cp1256 | windows1256 | Arabic |
989+-----------------+--------------------------------+--------------------------------+
990| cp1257 | windows-1257 | Baltic languages |
991+-----------------+--------------------------------+--------------------------------+
992| cp1258 | windows-1258 | Vietnamese |
993+-----------------+--------------------------------+--------------------------------+
994| euc_jp | eucjp, ujis, u-jis | Japanese |
995+-----------------+--------------------------------+--------------------------------+
996| euc_jis_2004 | jisx0213, eucjis2004 | Japanese |
997+-----------------+--------------------------------+--------------------------------+
998| euc_jisx0213 | eucjisx0213 | Japanese |
999+-----------------+--------------------------------+--------------------------------+
1000| euc_kr | euckr, korean, ksc5601, | Korean |
1001| | ks_c-5601, ks_c-5601-1987, | |
1002| | ksx1001, ks_x-1001 | |
1003+-----------------+--------------------------------+--------------------------------+
1004| gb2312 | chinese, csiso58gb231280, euc- | Simplified Chinese |
1005| | cn, euccn, eucgb2312-cn, | |
1006| | gb2312-1980, gb2312-80, iso- | |
1007| | ir-58 | |
1008+-----------------+--------------------------------+--------------------------------+
1009| gbk | 936, cp936, ms936 | Unified Chinese |
1010+-----------------+--------------------------------+--------------------------------+
1011| gb18030 | gb18030-2000 | Unified Chinese |
1012+-----------------+--------------------------------+--------------------------------+
1013| hz | hzgb, hz-gb, hz-gb-2312 | Simplified Chinese |
1014+-----------------+--------------------------------+--------------------------------+
1015| iso2022_jp | csiso2022jp, iso2022jp, | Japanese |
1016| | iso-2022-jp | |
1017+-----------------+--------------------------------+--------------------------------+
1018| iso2022_jp_1 | iso2022jp-1, iso-2022-jp-1 | Japanese |
1019+-----------------+--------------------------------+--------------------------------+
1020| iso2022_jp_2 | iso2022jp-2, iso-2022-jp-2 | Japanese, Korean, Simplified |
1021| | | Chinese, Western Europe, Greek |
1022+-----------------+--------------------------------+--------------------------------+
1023| iso2022_jp_2004 | iso2022jp-2004, | Japanese |
1024| | iso-2022-jp-2004 | |
1025+-----------------+--------------------------------+--------------------------------+
1026| iso2022_jp_3 | iso2022jp-3, iso-2022-jp-3 | Japanese |
1027+-----------------+--------------------------------+--------------------------------+
1028| iso2022_jp_ext | iso2022jp-ext, iso-2022-jp-ext | Japanese |
1029+-----------------+--------------------------------+--------------------------------+
1030| iso2022_kr | csiso2022kr, iso2022kr, | Korean |
1031| | iso-2022-kr | |
1032+-----------------+--------------------------------+--------------------------------+
1033| latin_1 | iso-8859-1, iso8859-1, 8859, | West Europe |
1034| | cp819, latin, latin1, L1 | |
1035+-----------------+--------------------------------+--------------------------------+
1036| iso8859_2 | iso-8859-2, latin2, L2 | Central and Eastern Europe |
1037+-----------------+--------------------------------+--------------------------------+
1038| iso8859_3 | iso-8859-3, latin3, L3 | Esperanto, Maltese |
1039+-----------------+--------------------------------+--------------------------------+
1040| iso8859_4 | iso-8859-4, latin4, L4 | Baltic languagues |
1041+-----------------+--------------------------------+--------------------------------+
1042| iso8859_5 | iso-8859-5, cyrillic | Bulgarian, Byelorussian, |
1043| | | Macedonian, Russian, Serbian |
1044+-----------------+--------------------------------+--------------------------------+
1045| iso8859_6 | iso-8859-6, arabic | Arabic |
1046+-----------------+--------------------------------+--------------------------------+
1047| iso8859_7 | iso-8859-7, greek, greek8 | Greek |
1048+-----------------+--------------------------------+--------------------------------+
1049| iso8859_8 | iso-8859-8, hebrew | Hebrew |
1050+-----------------+--------------------------------+--------------------------------+
1051| iso8859_9 | iso-8859-9, latin5, L5 | Turkish |
1052+-----------------+--------------------------------+--------------------------------+
1053| iso8859_10 | iso-8859-10, latin6, L6 | Nordic languages |
1054+-----------------+--------------------------------+--------------------------------+
1055| iso8859_13 | iso-8859-13 | Baltic languages |
1056+-----------------+--------------------------------+--------------------------------+
1057| iso8859_14 | iso-8859-14, latin8, L8 | Celtic languages |
1058+-----------------+--------------------------------+--------------------------------+
1059| iso8859_15 | iso-8859-15 | Western Europe |
1060+-----------------+--------------------------------+--------------------------------+
1061| johab | cp1361, ms1361 | Korean |
1062+-----------------+--------------------------------+--------------------------------+
1063| koi8_r | | Russian |
1064+-----------------+--------------------------------+--------------------------------+
1065| koi8_u | | Ukrainian |
1066+-----------------+--------------------------------+--------------------------------+
1067| mac_cyrillic | maccyrillic | Bulgarian, Byelorussian, |
1068| | | Macedonian, Russian, Serbian |
1069+-----------------+--------------------------------+--------------------------------+
1070| mac_greek | macgreek | Greek |
1071+-----------------+--------------------------------+--------------------------------+
1072| mac_iceland | maciceland | Icelandic |
1073+-----------------+--------------------------------+--------------------------------+
1074| mac_latin2 | maclatin2, maccentraleurope | Central and Eastern Europe |
1075+-----------------+--------------------------------+--------------------------------+
1076| mac_roman | macroman | Western Europe |
1077+-----------------+--------------------------------+--------------------------------+
1078| mac_turkish | macturkish | Turkish |
1079+-----------------+--------------------------------+--------------------------------+
1080| ptcp154 | csptcp154, pt154, cp154, | Kazakh |
1081| | cyrillic-asian | |
1082+-----------------+--------------------------------+--------------------------------+
1083| shift_jis | csshiftjis, shiftjis, sjis, | Japanese |
1084| | s_jis | |
1085+-----------------+--------------------------------+--------------------------------+
1086| shift_jis_2004 | shiftjis2004, sjis_2004, | Japanese |
1087| | sjis2004 | |
1088+-----------------+--------------------------------+--------------------------------+
1089| shift_jisx0213 | shiftjisx0213, sjisx0213, | Japanese |
1090| | s_jisx0213 | |
1091+-----------------+--------------------------------+--------------------------------+
Walter Dörwald41980ca2007-08-16 21:55:45 +00001092| utf_32 | U32, utf32 | all languages |
1093+-----------------+--------------------------------+--------------------------------+
1094| utf_32_be | UTF-32BE | all languages |
1095+-----------------+--------------------------------+--------------------------------+
1096| utf_32_le | UTF-32LE | all languages |
1097+-----------------+--------------------------------+--------------------------------+
Georg Brandl116aa622007-08-15 14:28:22 +00001098| utf_16 | U16, utf16 | all languages |
1099+-----------------+--------------------------------+--------------------------------+
1100| utf_16_be | UTF-16BE | all languages (BMP only) |
1101+-----------------+--------------------------------+--------------------------------+
1102| utf_16_le | UTF-16LE | all languages (BMP only) |
1103+-----------------+--------------------------------+--------------------------------+
1104| utf_7 | U7, unicode-1-1-utf-7 | all languages |
1105+-----------------+--------------------------------+--------------------------------+
1106| utf_8 | U8, UTF, utf8 | all languages |
1107+-----------------+--------------------------------+--------------------------------+
1108| utf_8_sig | | all languages |
1109+-----------------+--------------------------------+--------------------------------+
1110
1111A number of codecs are specific to Python, so their codec names have no meaning
1112outside Python. Some of them don't convert from Unicode strings to byte strings,
1113but instead use the property of the Python codecs machinery that any bijective
1114function with one argument can be considered as an encoding.
1115
1116For the codecs listed below, the result in the "encoding" direction is always a
1117byte string. The result of the "decoding" direction is listed as operand type in
1118the table.
1119
1120+--------------------+---------+----------------+---------------------------+
1121| Codec | Aliases | Operand type | Purpose |
1122+====================+=========+================+===========================+
1123| idna | | Unicode string | Implements :rfc:`3490`, |
1124| | | | see also |
1125| | | | :mod:`encodings.idna` |
1126+--------------------+---------+----------------+---------------------------+
1127| mbcs | dbcs | Unicode string | Windows only: Encode |
1128| | | | operand according to the |
1129| | | | ANSI codepage (CP_ACP) |
1130+--------------------+---------+----------------+---------------------------+
1131| palmos | | Unicode string | Encoding of PalmOS 3.5 |
1132+--------------------+---------+----------------+---------------------------+
1133| punycode | | Unicode string | Implements :rfc:`3492` |
1134+--------------------+---------+----------------+---------------------------+
1135| raw_unicode_escape | | Unicode string | Produce a string that is |
1136| | | | suitable as raw Unicode |
1137| | | | literal in Python source |
1138| | | | code |
1139+--------------------+---------+----------------+---------------------------+
1140| undefined | | any | Raise an exception for |
1141| | | | all conversions. Can be |
1142| | | | used as the system |
1143| | | | encoding if no automatic |
1144| | | | coercion between byte and |
1145| | | | Unicode strings is |
1146| | | | desired. |
1147+--------------------+---------+----------------+---------------------------+
1148| unicode_escape | | Unicode string | Produce a string that is |
1149| | | | suitable as Unicode |
1150| | | | literal in Python source |
1151| | | | code |
1152+--------------------+---------+----------------+---------------------------+
1153| unicode_internal | | Unicode string | Return the internal |
1154| | | | representation of the |
1155| | | | operand |
1156+--------------------+---------+----------------+---------------------------+
1157
1158.. versionadded:: 2.3
1159 The ``idna`` and ``punycode`` encodings.
1160
1161
1162:mod:`encodings.idna` --- Internationalized Domain Names in Applications
1163------------------------------------------------------------------------
1164
1165.. module:: encodings.idna
1166 :synopsis: Internationalized Domain Names implementation
1167.. moduleauthor:: Martin v. Löwis
1168
1169.. versionadded:: 2.3
1170
1171This module implements :rfc:`3490` (Internationalized Domain Names in
1172Applications) and :rfc:`3492` (Nameprep: A Stringprep Profile for
1173Internationalized Domain Names (IDN)). It builds upon the ``punycode`` encoding
1174and :mod:`stringprep`.
1175
1176These RFCs together define a protocol to support non-ASCII characters in domain
1177names. A domain name containing non-ASCII characters (such as
1178``www.Alliancefrançaise.nu``) is converted into an ASCII-compatible encoding
1179(ACE, such as ``www.xn--alliancefranaise-npb.nu``). The ACE form of the domain
1180name is then used in all places where arbitrary characters are not allowed by
1181the protocol, such as DNS queries, HTTP :mailheader:`Host` fields, and so
1182on. This conversion is carried out in the application; if possible invisible to
1183the user: The application should transparently convert Unicode domain labels to
1184IDNA on the wire, and convert back ACE labels to Unicode before presenting them
1185to the user.
1186
1187Python supports this conversion in several ways: The ``idna`` codec allows to
1188convert between Unicode and the ACE. Furthermore, the :mod:`socket` module
1189transparently converts Unicode host names to ACE, so that applications need not
1190be concerned about converting host names themselves when they pass them to the
1191socket module. On top of that, modules that have host names as function
1192parameters, such as :mod:`httplib` and :mod:`ftplib`, accept Unicode host names
1193(:mod:`httplib` then also transparently sends an IDNA hostname in the
1194:mailheader:`Host` field if it sends that field at all).
1195
1196When receiving host names from the wire (such as in reverse name lookup), no
1197automatic conversion to Unicode is performed: Applications wishing to present
1198such host names to the user should decode them to Unicode.
1199
1200The module :mod:`encodings.idna` also implements the nameprep procedure, which
1201performs certain normalizations on host names, to achieve case-insensitivity of
1202international domain names, and to unify similar characters. The nameprep
1203functions can be used directly if desired.
1204
1205
1206.. function:: nameprep(label)
1207
1208 Return the nameprepped version of *label*. The implementation currently assumes
1209 query strings, so ``AllowUnassigned`` is true.
1210
1211
1212.. function:: ToASCII(label)
1213
1214 Convert a label to ASCII, as specified in :rfc:`3490`. ``UseSTD3ASCIIRules`` is
1215 assumed to be false.
1216
1217
1218.. function:: ToUnicode(label)
1219
1220 Convert a label to Unicode, as specified in :rfc:`3490`.
1221
1222
1223:mod:`encodings.utf_8_sig` --- UTF-8 codec with BOM signature
1224-------------------------------------------------------------
1225
1226.. module:: encodings.utf_8_sig
1227 :synopsis: UTF-8 codec with BOM signature
1228.. moduleauthor:: Walter Dörwald
1229
1230.. versionadded:: 2.5
1231
1232This module implements a variant of the UTF-8 codec: On encoding a UTF-8 encoded
1233BOM will be prepended to the UTF-8 encoded bytes. For the stateful encoder this
1234is only done once (on the first write to the byte stream). For decoding an
1235optional UTF-8 encoded BOM at the start of the data will be skipped.
1236