| Georg Brandl | 8ec7f65 | 2007-08-15 14:28:01 +0000 | [diff] [blame] | 1 |  | 
|  | 2 | :mod:`codecs` --- Codec registry and base classes | 
|  | 3 | ================================================= | 
|  | 4 |  | 
|  | 5 | .. module:: codecs | 
|  | 6 | :synopsis: Encode and decode data and streams. | 
|  | 7 | .. moduleauthor:: Marc-Andre Lemburg <mal@lemburg.com> | 
|  | 8 | .. sectionauthor:: Marc-Andre Lemburg <mal@lemburg.com> | 
|  | 9 | .. sectionauthor:: Martin v. Löwis <martin@v.loewis.de> | 
|  | 10 |  | 
|  | 11 |  | 
|  | 12 | .. index:: | 
|  | 13 | single: Unicode | 
|  | 14 | single: Codecs | 
|  | 15 | pair: Codecs; encode | 
|  | 16 | pair: Codecs; decode | 
|  | 17 | single: streams | 
|  | 18 | pair: stackable; streams | 
|  | 19 |  | 
|  | 20 | This module defines base classes for standard Python codecs (encoders and | 
|  | 21 | decoders) and provides access to the internal Python codec registry which | 
|  | 22 | manages the codec and error handling lookup process. | 
|  | 23 |  | 
|  | 24 | It defines the following functions: | 
|  | 25 |  | 
|  | 26 |  | 
|  | 27 | .. function:: register(search_function) | 
|  | 28 |  | 
|  | 29 | Register a codec search function. Search functions are expected to take one | 
|  | 30 | argument, the encoding name in all lower case letters, and return a | 
|  | 31 | :class:`CodecInfo` object having the following attributes: | 
|  | 32 |  | 
|  | 33 | * ``name`` The name of the encoding; | 
|  | 34 |  | 
|  | 35 | * ``encoder`` The stateless encoding function; | 
|  | 36 |  | 
|  | 37 | * ``decoder`` The stateless decoding function; | 
|  | 38 |  | 
|  | 39 | * ``incrementalencoder`` An incremental encoder class or factory function; | 
|  | 40 |  | 
|  | 41 | * ``incrementaldecoder`` An incremental decoder class or factory function; | 
|  | 42 |  | 
|  | 43 | * ``streamwriter`` A stream writer class or factory function; | 
|  | 44 |  | 
|  | 45 | * ``streamreader`` A stream reader class or factory function. | 
|  | 46 |  | 
|  | 47 | The various functions or classes take the following arguments: | 
|  | 48 |  | 
|  | 49 | *encoder* and *decoder*: These must be functions or methods which have the same | 
|  | 50 | interface as the :meth:`encode`/:meth:`decode` methods of Codec instances (see | 
|  | 51 | Codec Interface). The functions/methods are expected to work in a stateless | 
|  | 52 | mode. | 
|  | 53 |  | 
|  | 54 | *incrementalencoder* and *incrementalencoder*: These have to be factory | 
|  | 55 | functions providing the following interface: | 
|  | 56 |  | 
|  | 57 | ``factory(errors='strict')`` | 
|  | 58 |  | 
|  | 59 | The factory functions must return objects providing the interfaces defined by | 
|  | 60 | the base classes :class:`IncrementalEncoder` and :class:`IncrementalEncoder`, | 
|  | 61 | respectively. Incremental codecs can maintain state. | 
|  | 62 |  | 
|  | 63 | *streamreader* and *streamwriter*: These have to be factory functions providing | 
|  | 64 | the following interface: | 
|  | 65 |  | 
|  | 66 | ``factory(stream, errors='strict')`` | 
|  | 67 |  | 
|  | 68 | The factory functions must return objects providing the interfaces defined by | 
|  | 69 | the base classes :class:`StreamWriter` and :class:`StreamReader`, respectively. | 
|  | 70 | Stream codecs can maintain state. | 
|  | 71 |  | 
|  | 72 | Possible values for errors are ``'strict'`` (raise an exception in case of an | 
|  | 73 | encoding error), ``'replace'`` (replace malformed data with a suitable | 
|  | 74 | replacement marker, such as ``'?'``), ``'ignore'`` (ignore malformed data and | 
|  | 75 | continue without further notice), ``'xmlcharrefreplace'`` (replace with the | 
|  | 76 | appropriate XML character reference (for encoding only)) and | 
|  | 77 | ``'backslashreplace'`` (replace with backslashed escape sequences (for encoding | 
|  | 78 | only)) as well as any other error handling name defined via | 
|  | 79 | :func:`register_error`. | 
|  | 80 |  | 
|  | 81 | In case a search function cannot find a given encoding, it should return | 
|  | 82 | ``None``. | 
|  | 83 |  | 
|  | 84 |  | 
|  | 85 | .. function:: lookup(encoding) | 
|  | 86 |  | 
|  | 87 | Looks up the codec info in the Python codec registry and returns a | 
|  | 88 | :class:`CodecInfo` object as defined above. | 
|  | 89 |  | 
|  | 90 | Encodings are first looked up in the registry's cache. If not found, the list of | 
|  | 91 | registered search functions is scanned. If no :class:`CodecInfo` object is | 
|  | 92 | found, a :exc:`LookupError` is raised. Otherwise, the :class:`CodecInfo` object | 
|  | 93 | is stored in the cache and returned to the caller. | 
|  | 94 |  | 
|  | 95 | To simplify access to the various codecs, the module provides these additional | 
|  | 96 | functions which use :func:`lookup` for the codec lookup: | 
|  | 97 |  | 
|  | 98 |  | 
|  | 99 | .. function:: getencoder(encoding) | 
|  | 100 |  | 
|  | 101 | Look up the codec for the given encoding and return its encoder function. | 
|  | 102 |  | 
|  | 103 | Raises a :exc:`LookupError` in case the encoding cannot be found. | 
|  | 104 |  | 
|  | 105 |  | 
|  | 106 | .. function:: getdecoder(encoding) | 
|  | 107 |  | 
|  | 108 | Look up the codec for the given encoding and return its decoder function. | 
|  | 109 |  | 
|  | 110 | Raises a :exc:`LookupError` in case the encoding cannot be found. | 
|  | 111 |  | 
|  | 112 |  | 
|  | 113 | .. function:: getincrementalencoder(encoding) | 
|  | 114 |  | 
|  | 115 | Look up the codec for the given encoding and return its incremental encoder | 
|  | 116 | class or factory function. | 
|  | 117 |  | 
|  | 118 | Raises a :exc:`LookupError` in case the encoding cannot be found or the codec | 
|  | 119 | doesn't support an incremental encoder. | 
|  | 120 |  | 
|  | 121 | .. versionadded:: 2.5 | 
|  | 122 |  | 
|  | 123 |  | 
|  | 124 | .. function:: getincrementaldecoder(encoding) | 
|  | 125 |  | 
|  | 126 | Look up the codec for the given encoding and return its incremental decoder | 
|  | 127 | class or factory function. | 
|  | 128 |  | 
|  | 129 | Raises a :exc:`LookupError` in case the encoding cannot be found or the codec | 
|  | 130 | doesn't support an incremental decoder. | 
|  | 131 |  | 
|  | 132 | .. versionadded:: 2.5 | 
|  | 133 |  | 
|  | 134 |  | 
|  | 135 | .. function:: getreader(encoding) | 
|  | 136 |  | 
|  | 137 | Look up the codec for the given encoding and return its StreamReader class or | 
|  | 138 | factory function. | 
|  | 139 |  | 
|  | 140 | Raises a :exc:`LookupError` in case the encoding cannot be found. | 
|  | 141 |  | 
|  | 142 |  | 
|  | 143 | .. function:: getwriter(encoding) | 
|  | 144 |  | 
|  | 145 | Look up the codec for the given encoding and return its StreamWriter class or | 
|  | 146 | factory function. | 
|  | 147 |  | 
|  | 148 | Raises a :exc:`LookupError` in case the encoding cannot be found. | 
|  | 149 |  | 
|  | 150 |  | 
|  | 151 | .. function:: register_error(name, error_handler) | 
|  | 152 |  | 
|  | 153 | Register the error handling function *error_handler* under the name *name*. | 
|  | 154 | *error_handler* will be called during encoding and decoding in case of an error, | 
|  | 155 | when *name* is specified as the errors parameter. | 
|  | 156 |  | 
|  | 157 | For encoding *error_handler* will be called with a :exc:`UnicodeEncodeError` | 
|  | 158 | instance, which contains information about the location of the error. The error | 
|  | 159 | handler must either raise this or a different exception or return a tuple with a | 
|  | 160 | replacement for the unencodable part of the input and a position where encoding | 
|  | 161 | should continue. The encoder will encode the replacement and continue encoding | 
|  | 162 | the original input at the specified position. Negative position values will be | 
|  | 163 | treated as being relative to the end of the input string. If the resulting | 
|  | 164 | position is out of bound an :exc:`IndexError` will be raised. | 
|  | 165 |  | 
|  | 166 | Decoding and translating works similar, except :exc:`UnicodeDecodeError` or | 
|  | 167 | :exc:`UnicodeTranslateError` will be passed to the handler and that the | 
|  | 168 | replacement from the error handler will be put into the output directly. | 
|  | 169 |  | 
|  | 170 |  | 
|  | 171 | .. function:: lookup_error(name) | 
|  | 172 |  | 
|  | 173 | Return the error handler previously registered under the name *name*. | 
|  | 174 |  | 
|  | 175 | Raises a :exc:`LookupError` in case the handler cannot be found. | 
|  | 176 |  | 
|  | 177 |  | 
|  | 178 | .. function:: strict_errors(exception) | 
|  | 179 |  | 
|  | 180 | Implements the ``strict`` error handling. | 
|  | 181 |  | 
|  | 182 |  | 
|  | 183 | .. function:: replace_errors(exception) | 
|  | 184 |  | 
|  | 185 | Implements the ``replace`` error handling. | 
|  | 186 |  | 
|  | 187 |  | 
|  | 188 | .. function:: ignore_errors(exception) | 
|  | 189 |  | 
|  | 190 | Implements the ``ignore`` error handling. | 
|  | 191 |  | 
|  | 192 |  | 
| Walter Dörwald | 90014e0 | 2007-09-01 18:18:09 +0000 | [diff] [blame] | 193 | .. function:: xmlcharrefreplace_errors(exception) | 
| Georg Brandl | 8ec7f65 | 2007-08-15 14:28:01 +0000 | [diff] [blame] | 194 |  | 
|  | 195 | Implements the ``xmlcharrefreplace`` error handling. | 
|  | 196 |  | 
|  | 197 |  | 
| Walter Dörwald | 90014e0 | 2007-09-01 18:18:09 +0000 | [diff] [blame] | 198 | .. function:: backslashreplace_errors(exception) | 
| Georg Brandl | 8ec7f65 | 2007-08-15 14:28:01 +0000 | [diff] [blame] | 199 |  | 
|  | 200 | Implements the ``backslashreplace`` error handling. | 
|  | 201 |  | 
|  | 202 | To simplify working with encoded files or stream, the module also defines these | 
|  | 203 | utility functions: | 
|  | 204 |  | 
|  | 205 |  | 
|  | 206 | .. function:: open(filename, mode[, encoding[, errors[, buffering]]]) | 
|  | 207 |  | 
|  | 208 | Open an encoded file using the given *mode* and return a wrapped version | 
| Georg Brandl | 5e203f5 | 2008-02-17 11:33:38 +0000 | [diff] [blame] | 209 | providing transparent encoding/decoding.  The default file mode is ``'r'`` | 
|  | 210 | meaning to open the file in read mode. | 
| Georg Brandl | 8ec7f65 | 2007-08-15 14:28:01 +0000 | [diff] [blame] | 211 |  | 
|  | 212 | .. note:: | 
|  | 213 |  | 
|  | 214 | The wrapped version will only accept the object format defined by the codecs, | 
|  | 215 | i.e. Unicode objects for most built-in codecs.  Output is also codec-dependent | 
|  | 216 | and will usually be Unicode as well. | 
|  | 217 |  | 
| Georg Brandl | 5e203f5 | 2008-02-17 11:33:38 +0000 | [diff] [blame] | 218 | .. note:: | 
|  | 219 |  | 
|  | 220 | Files are always opened in binary mode, even if no binary mode was | 
|  | 221 | specified.  This is done to avoid data loss due to encodings using 8-bit | 
|  | 222 | values.  This means that no automatic conversion of ``'\n'`` is done | 
|  | 223 | on reading and writing. | 
|  | 224 |  | 
| Georg Brandl | 8ec7f65 | 2007-08-15 14:28:01 +0000 | [diff] [blame] | 225 | *encoding* specifies the encoding which is to be used for the file. | 
|  | 226 |  | 
|  | 227 | *errors* may be given to define the error handling. It defaults to ``'strict'`` | 
|  | 228 | which causes a :exc:`ValueError` to be raised in case an encoding error occurs. | 
|  | 229 |  | 
|  | 230 | *buffering* has the same meaning as for the built-in :func:`open` function.  It | 
|  | 231 | defaults to line buffered. | 
|  | 232 |  | 
|  | 233 |  | 
|  | 234 | .. function:: EncodedFile(file, input[, output[, errors]]) | 
|  | 235 |  | 
|  | 236 | Return a wrapped version of file which provides transparent encoding | 
|  | 237 | translation. | 
|  | 238 |  | 
|  | 239 | Strings written to the wrapped file are interpreted according to the given | 
|  | 240 | *input* encoding and then written to the original file as strings using the | 
|  | 241 | *output* encoding. The intermediate encoding will usually be Unicode but depends | 
|  | 242 | on the specified codecs. | 
|  | 243 |  | 
|  | 244 | If *output* is not given, it defaults to *input*. | 
|  | 245 |  | 
|  | 246 | *errors* may be given to define the error handling. It defaults to ``'strict'``, | 
|  | 247 | which causes :exc:`ValueError` to be raised in case an encoding error occurs. | 
|  | 248 |  | 
|  | 249 |  | 
|  | 250 | .. function:: iterencode(iterable, encoding[, errors]) | 
|  | 251 |  | 
|  | 252 | Uses an incremental encoder to iteratively encode the input provided by | 
| Georg Brandl | cf3fb25 | 2007-10-21 10:52:38 +0000 | [diff] [blame] | 253 | *iterable*. This function is a :term:`generator`.  *errors* (as well as any | 
|  | 254 | other keyword argument) is passed through to the incremental encoder. | 
| Georg Brandl | 8ec7f65 | 2007-08-15 14:28:01 +0000 | [diff] [blame] | 255 |  | 
|  | 256 | .. versionadded:: 2.5 | 
|  | 257 |  | 
|  | 258 |  | 
|  | 259 | .. function:: iterdecode(iterable, encoding[, errors]) | 
|  | 260 |  | 
|  | 261 | Uses an incremental decoder to iteratively decode the input provided by | 
| Georg Brandl | cf3fb25 | 2007-10-21 10:52:38 +0000 | [diff] [blame] | 262 | *iterable*. This function is a :term:`generator`.  *errors* (as well as any | 
|  | 263 | other keyword argument) is passed through to the incremental decoder. | 
| Georg Brandl | 8ec7f65 | 2007-08-15 14:28:01 +0000 | [diff] [blame] | 264 |  | 
|  | 265 | .. versionadded:: 2.5 | 
|  | 266 |  | 
|  | 267 | The module also provides the following constants which are useful for reading | 
|  | 268 | and writing to platform dependent files: | 
|  | 269 |  | 
|  | 270 |  | 
|  | 271 | .. data:: BOM | 
|  | 272 | BOM_BE | 
|  | 273 | BOM_LE | 
|  | 274 | BOM_UTF8 | 
|  | 275 | BOM_UTF16 | 
|  | 276 | BOM_UTF16_BE | 
|  | 277 | BOM_UTF16_LE | 
|  | 278 | BOM_UTF32 | 
|  | 279 | BOM_UTF32_BE | 
|  | 280 | BOM_UTF32_LE | 
|  | 281 |  | 
|  | 282 | These constants define various encodings of the Unicode byte order mark (BOM) | 
|  | 283 | used in UTF-16 and UTF-32 data streams to indicate the byte order used in the | 
|  | 284 | stream or file and in UTF-8 as a Unicode signature. :const:`BOM_UTF16` is either | 
|  | 285 | :const:`BOM_UTF16_BE` or :const:`BOM_UTF16_LE` depending on the platform's | 
|  | 286 | native byte order, :const:`BOM` is an alias for :const:`BOM_UTF16`, | 
|  | 287 | :const:`BOM_LE` for :const:`BOM_UTF16_LE` and :const:`BOM_BE` for | 
|  | 288 | :const:`BOM_UTF16_BE`. The others represent the BOM in UTF-8 and UTF-32 | 
|  | 289 | encodings. | 
|  | 290 |  | 
|  | 291 |  | 
|  | 292 | .. _codec-base-classes: | 
|  | 293 |  | 
|  | 294 | Codec Base Classes | 
|  | 295 | ------------------ | 
|  | 296 |  | 
|  | 297 | The :mod:`codecs` module defines a set of base classes which define the | 
|  | 298 | interface and can also be used to easily write you own codecs for use in Python. | 
|  | 299 |  | 
|  | 300 | Each codec has to define four interfaces to make it usable as codec in Python: | 
|  | 301 | stateless encoder, stateless decoder, stream reader and stream writer. The | 
|  | 302 | stream reader and writers typically reuse the stateless encoder/decoder to | 
|  | 303 | implement the file protocols. | 
|  | 304 |  | 
|  | 305 | The :class:`Codec` class defines the interface for stateless encoders/decoders. | 
|  | 306 |  | 
|  | 307 | To simplify and standardize error handling, the :meth:`encode` and | 
|  | 308 | :meth:`decode` methods may implement different error handling schemes by | 
|  | 309 | providing the *errors* string argument.  The following string values are defined | 
|  | 310 | and implemented by all standard Python codecs: | 
|  | 311 |  | 
|  | 312 | +-------------------------+-----------------------------------------------+ | 
|  | 313 | | Value                   | Meaning                                       | | 
|  | 314 | +=========================+===============================================+ | 
|  | 315 | | ``'strict'``            | Raise :exc:`UnicodeError` (or a subclass);    | | 
|  | 316 | |                         | this is the default.                          | | 
|  | 317 | +-------------------------+-----------------------------------------------+ | 
|  | 318 | | ``'ignore'``            | Ignore the character and continue with the    | | 
|  | 319 | |                         | next.                                         | | 
|  | 320 | +-------------------------+-----------------------------------------------+ | 
|  | 321 | | ``'replace'``           | Replace with a suitable replacement           | | 
|  | 322 | |                         | character; Python will use the official       | | 
|  | 323 | |                         | U+FFFD REPLACEMENT CHARACTER for the built-in | | 
|  | 324 | |                         | Unicode codecs on decoding and '?' on         | | 
|  | 325 | |                         | encoding.                                     | | 
|  | 326 | +-------------------------+-----------------------------------------------+ | 
|  | 327 | | ``'xmlcharrefreplace'`` | Replace with the appropriate XML character    | | 
|  | 328 | |                         | reference (only for encoding).                | | 
|  | 329 | +-------------------------+-----------------------------------------------+ | 
|  | 330 | | ``'backslashreplace'``  | Replace with backslashed escape sequences     | | 
|  | 331 | |                         | (only for encoding).                          | | 
|  | 332 | +-------------------------+-----------------------------------------------+ | 
|  | 333 |  | 
|  | 334 | The set of allowed values can be extended via :meth:`register_error`. | 
|  | 335 |  | 
|  | 336 |  | 
|  | 337 | .. _codec-objects: | 
|  | 338 |  | 
|  | 339 | Codec Objects | 
|  | 340 | ^^^^^^^^^^^^^ | 
|  | 341 |  | 
|  | 342 | The :class:`Codec` class defines these methods which also define the function | 
|  | 343 | interfaces of the stateless encoder and decoder: | 
|  | 344 |  | 
|  | 345 |  | 
|  | 346 | .. method:: Codec.encode(input[, errors]) | 
|  | 347 |  | 
|  | 348 | Encodes the object *input* and returns a tuple (output object, length consumed). | 
|  | 349 | While codecs are not restricted to use with Unicode, in a Unicode context, | 
|  | 350 | encoding converts a Unicode object to a plain string using a particular | 
|  | 351 | character set encoding (e.g., ``cp1252`` or ``iso-8859-1``). | 
|  | 352 |  | 
|  | 353 | *errors* defines the error handling to apply. It defaults to ``'strict'`` | 
|  | 354 | handling. | 
|  | 355 |  | 
|  | 356 | The method may not store state in the :class:`Codec` instance. Use | 
|  | 357 | :class:`StreamCodec` for codecs which have to keep state in order to make | 
|  | 358 | encoding/decoding efficient. | 
|  | 359 |  | 
|  | 360 | The encoder must be able to handle zero length input and return an empty object | 
|  | 361 | of the output object type in this situation. | 
|  | 362 |  | 
|  | 363 |  | 
|  | 364 | .. method:: Codec.decode(input[, errors]) | 
|  | 365 |  | 
|  | 366 | Decodes the object *input* and returns a tuple (output object, length consumed). | 
|  | 367 | In a Unicode context, decoding converts a plain string encoded using a | 
|  | 368 | particular character set encoding to a Unicode object. | 
|  | 369 |  | 
|  | 370 | *input* must be an object which provides the ``bf_getreadbuf`` buffer slot. | 
|  | 371 | Python strings, buffer objects and memory mapped files are examples of objects | 
|  | 372 | providing this slot. | 
|  | 373 |  | 
|  | 374 | *errors* defines the error handling to apply. It defaults to ``'strict'`` | 
|  | 375 | handling. | 
|  | 376 |  | 
|  | 377 | The method may not store state in the :class:`Codec` instance. Use | 
|  | 378 | :class:`StreamCodec` for codecs which have to keep state in order to make | 
|  | 379 | encoding/decoding efficient. | 
|  | 380 |  | 
|  | 381 | The decoder must be able to handle zero length input and return an empty object | 
|  | 382 | of the output object type in this situation. | 
|  | 383 |  | 
|  | 384 | The :class:`IncrementalEncoder` and :class:`IncrementalDecoder` classes provide | 
|  | 385 | the basic interface for incremental encoding and decoding. Encoding/decoding the | 
|  | 386 | input isn't done with one call to the stateless encoder/decoder function, but | 
|  | 387 | with multiple calls to the :meth:`encode`/:meth:`decode` method of the | 
|  | 388 | incremental encoder/decoder. The incremental encoder/decoder keeps track of the | 
|  | 389 | encoding/decoding process during method calls. | 
|  | 390 |  | 
|  | 391 | The joined output of calls to the :meth:`encode`/:meth:`decode` method is the | 
|  | 392 | same as if all the single inputs were joined into one, and this input was | 
|  | 393 | encoded/decoded with the stateless encoder/decoder. | 
|  | 394 |  | 
|  | 395 |  | 
|  | 396 | .. _incremental-encoder-objects: | 
|  | 397 |  | 
|  | 398 | IncrementalEncoder Objects | 
|  | 399 | ^^^^^^^^^^^^^^^^^^^^^^^^^^ | 
|  | 400 |  | 
|  | 401 | .. versionadded:: 2.5 | 
|  | 402 |  | 
|  | 403 | The :class:`IncrementalEncoder` class is used for encoding an input in multiple | 
|  | 404 | steps. It defines the following methods which every incremental encoder must | 
|  | 405 | define in order to be compatible with the Python codec registry. | 
|  | 406 |  | 
|  | 407 |  | 
|  | 408 | .. class:: IncrementalEncoder([errors]) | 
|  | 409 |  | 
|  | 410 | Constructor for an :class:`IncrementalEncoder` instance. | 
|  | 411 |  | 
|  | 412 | All incremental encoders must provide this constructor interface. They are free | 
|  | 413 | to add additional keyword arguments, but only the ones defined here are used by | 
|  | 414 | the Python codec registry. | 
|  | 415 |  | 
|  | 416 | The :class:`IncrementalEncoder` may implement different error handling schemes | 
|  | 417 | by providing the *errors* keyword argument. These parameters are predefined: | 
|  | 418 |  | 
|  | 419 | * ``'strict'`` Raise :exc:`ValueError` (or a subclass); this is the default. | 
|  | 420 |  | 
|  | 421 | * ``'ignore'`` Ignore the character and continue with the next. | 
|  | 422 |  | 
|  | 423 | * ``'replace'`` Replace with a suitable replacement character | 
|  | 424 |  | 
|  | 425 | * ``'xmlcharrefreplace'`` Replace with the appropriate XML character reference | 
|  | 426 |  | 
|  | 427 | * ``'backslashreplace'`` Replace with backslashed escape sequences. | 
|  | 428 |  | 
|  | 429 | The *errors* argument will be assigned to an attribute of the same name. | 
|  | 430 | Assigning to this attribute makes it possible to switch between different error | 
|  | 431 | handling strategies during the lifetime of the :class:`IncrementalEncoder` | 
|  | 432 | object. | 
|  | 433 |  | 
|  | 434 | The set of allowed values for the *errors* argument can be extended with | 
|  | 435 | :func:`register_error`. | 
|  | 436 |  | 
|  | 437 |  | 
|  | 438 | .. method:: IncrementalEncoder.encode(object[, final]) | 
|  | 439 |  | 
|  | 440 | Encodes *object* (taking the current state of the encoder into account) and | 
|  | 441 | returns the resulting encoded object. If this is the last call to :meth:`encode` | 
|  | 442 | *final* must be true (the default is false). | 
|  | 443 |  | 
|  | 444 |  | 
|  | 445 | .. method:: IncrementalEncoder.reset() | 
|  | 446 |  | 
|  | 447 | Reset the encoder to the initial state. | 
|  | 448 |  | 
|  | 449 |  | 
|  | 450 | .. _incremental-decoder-objects: | 
|  | 451 |  | 
|  | 452 | IncrementalDecoder Objects | 
|  | 453 | ^^^^^^^^^^^^^^^^^^^^^^^^^^ | 
|  | 454 |  | 
|  | 455 | The :class:`IncrementalDecoder` class is used for decoding an input in multiple | 
|  | 456 | steps. It defines the following methods which every incremental decoder must | 
|  | 457 | define in order to be compatible with the Python codec registry. | 
|  | 458 |  | 
|  | 459 |  | 
|  | 460 | .. class:: IncrementalDecoder([errors]) | 
|  | 461 |  | 
|  | 462 | Constructor for an :class:`IncrementalDecoder` instance. | 
|  | 463 |  | 
|  | 464 | All incremental decoders must provide this constructor interface. They are free | 
|  | 465 | to add additional keyword arguments, but only the ones defined here are used by | 
|  | 466 | the Python codec registry. | 
|  | 467 |  | 
|  | 468 | The :class:`IncrementalDecoder` may implement different error handling schemes | 
|  | 469 | by providing the *errors* keyword argument. These parameters are predefined: | 
|  | 470 |  | 
|  | 471 | * ``'strict'`` Raise :exc:`ValueError` (or a subclass); this is the default. | 
|  | 472 |  | 
|  | 473 | * ``'ignore'`` Ignore the character and continue with the next. | 
|  | 474 |  | 
|  | 475 | * ``'replace'`` Replace with a suitable replacement character. | 
|  | 476 |  | 
|  | 477 | The *errors* argument will be assigned to an attribute of the same name. | 
|  | 478 | Assigning to this attribute makes it possible to switch between different error | 
|  | 479 | handling strategies during the lifetime of the :class:`IncrementalEncoder` | 
|  | 480 | object. | 
|  | 481 |  | 
|  | 482 | The set of allowed values for the *errors* argument can be extended with | 
|  | 483 | :func:`register_error`. | 
|  | 484 |  | 
|  | 485 |  | 
|  | 486 | .. method:: IncrementalDecoder.decode(object[, final]) | 
|  | 487 |  | 
|  | 488 | Decodes *object* (taking the current state of the decoder into account) and | 
|  | 489 | returns the resulting decoded object. If this is the last call to :meth:`decode` | 
|  | 490 | *final* must be true (the default is false). If *final* is true the decoder must | 
|  | 491 | decode the input completely and must flush all buffers. If this isn't possible | 
|  | 492 | (e.g. because of incomplete byte sequences at the end of the input) it must | 
|  | 493 | initiate error handling just like in the stateless case (which might raise an | 
|  | 494 | exception). | 
|  | 495 |  | 
|  | 496 |  | 
|  | 497 | .. method:: IncrementalDecoder.reset() | 
|  | 498 |  | 
|  | 499 | Reset the decoder to the initial state. | 
|  | 500 |  | 
|  | 501 | The :class:`StreamWriter` and :class:`StreamReader` classes provide generic | 
|  | 502 | working interfaces which can be used to implement new encoding submodules very | 
|  | 503 | easily. See :mod:`encodings.utf_8` for an example of how this is done. | 
|  | 504 |  | 
|  | 505 |  | 
|  | 506 | .. _stream-writer-objects: | 
|  | 507 |  | 
|  | 508 | StreamWriter Objects | 
|  | 509 | ^^^^^^^^^^^^^^^^^^^^ | 
|  | 510 |  | 
|  | 511 | The :class:`StreamWriter` class is a subclass of :class:`Codec` and defines the | 
|  | 512 | following methods which every stream writer must define in order to be | 
|  | 513 | compatible with the Python codec registry. | 
|  | 514 |  | 
|  | 515 |  | 
|  | 516 | .. class:: StreamWriter(stream[, errors]) | 
|  | 517 |  | 
|  | 518 | Constructor for a :class:`StreamWriter` instance. | 
|  | 519 |  | 
|  | 520 | All stream writers must provide this constructor interface. They are free to add | 
|  | 521 | additional keyword arguments, but only the ones defined here are used by the | 
|  | 522 | Python codec registry. | 
|  | 523 |  | 
|  | 524 | *stream* must be a file-like object open for writing binary data. | 
|  | 525 |  | 
|  | 526 | The :class:`StreamWriter` may implement different error handling schemes by | 
|  | 527 | providing the *errors* keyword argument. These parameters are predefined: | 
|  | 528 |  | 
|  | 529 | * ``'strict'`` Raise :exc:`ValueError` (or a subclass); this is the default. | 
|  | 530 |  | 
|  | 531 | * ``'ignore'`` Ignore the character and continue with the next. | 
|  | 532 |  | 
|  | 533 | * ``'replace'`` Replace with a suitable replacement character | 
|  | 534 |  | 
|  | 535 | * ``'xmlcharrefreplace'`` Replace with the appropriate XML character reference | 
|  | 536 |  | 
|  | 537 | * ``'backslashreplace'`` Replace with backslashed escape sequences. | 
|  | 538 |  | 
|  | 539 | The *errors* argument will be assigned to an attribute of the same name. | 
|  | 540 | Assigning to this attribute makes it possible to switch between different error | 
|  | 541 | handling strategies during the lifetime of the :class:`StreamWriter` object. | 
|  | 542 |  | 
|  | 543 | The set of allowed values for the *errors* argument can be extended with | 
|  | 544 | :func:`register_error`. | 
|  | 545 |  | 
|  | 546 |  | 
|  | 547 | .. method:: StreamWriter.write(object) | 
|  | 548 |  | 
|  | 549 | Writes the object's contents encoded to the stream. | 
|  | 550 |  | 
|  | 551 |  | 
|  | 552 | .. method:: StreamWriter.writelines(list) | 
|  | 553 |  | 
|  | 554 | Writes the concatenated list of strings to the stream (possibly by reusing the | 
|  | 555 | :meth:`write` method). | 
|  | 556 |  | 
|  | 557 |  | 
|  | 558 | .. method:: StreamWriter.reset() | 
|  | 559 |  | 
|  | 560 | Flushes and resets the codec buffers used for keeping state. | 
|  | 561 |  | 
|  | 562 | Calling this method should ensure that the data on the output is put into a | 
|  | 563 | clean state that allows appending of new fresh data without having to rescan the | 
|  | 564 | whole stream to recover state. | 
|  | 565 |  | 
|  | 566 | In addition to the above methods, the :class:`StreamWriter` must also inherit | 
|  | 567 | all other methods and attributes from the underlying stream. | 
|  | 568 |  | 
|  | 569 |  | 
|  | 570 | .. _stream-reader-objects: | 
|  | 571 |  | 
|  | 572 | StreamReader Objects | 
|  | 573 | ^^^^^^^^^^^^^^^^^^^^ | 
|  | 574 |  | 
|  | 575 | The :class:`StreamReader` class is a subclass of :class:`Codec` and defines the | 
|  | 576 | following methods which every stream reader must define in order to be | 
|  | 577 | compatible with the Python codec registry. | 
|  | 578 |  | 
|  | 579 |  | 
|  | 580 | .. class:: StreamReader(stream[, errors]) | 
|  | 581 |  | 
|  | 582 | Constructor for a :class:`StreamReader` instance. | 
|  | 583 |  | 
|  | 584 | All stream readers must provide this constructor interface. They are free to add | 
|  | 585 | additional keyword arguments, but only the ones defined here are used by the | 
|  | 586 | Python codec registry. | 
|  | 587 |  | 
|  | 588 | *stream* must be a file-like object open for reading (binary) data. | 
|  | 589 |  | 
|  | 590 | The :class:`StreamReader` may implement different error handling schemes by | 
|  | 591 | providing the *errors* keyword argument. These parameters are defined: | 
|  | 592 |  | 
|  | 593 | * ``'strict'`` Raise :exc:`ValueError` (or a subclass); this is the default. | 
|  | 594 |  | 
|  | 595 | * ``'ignore'`` Ignore the character and continue with the next. | 
|  | 596 |  | 
|  | 597 | * ``'replace'`` Replace with a suitable replacement character. | 
|  | 598 |  | 
|  | 599 | The *errors* argument will be assigned to an attribute of the same name. | 
|  | 600 | Assigning to this attribute makes it possible to switch between different error | 
|  | 601 | handling strategies during the lifetime of the :class:`StreamReader` object. | 
|  | 602 |  | 
|  | 603 | The set of allowed values for the *errors* argument can be extended with | 
|  | 604 | :func:`register_error`. | 
|  | 605 |  | 
|  | 606 |  | 
|  | 607 | .. method:: StreamReader.read([size[, chars, [firstline]]]) | 
|  | 608 |  | 
|  | 609 | Decodes data from the stream and returns the resulting object. | 
|  | 610 |  | 
|  | 611 | *chars* indicates the number of characters to read from the stream. :func:`read` | 
|  | 612 | will never return more than *chars* characters, but it might return less, if | 
|  | 613 | there are not enough characters available. | 
|  | 614 |  | 
|  | 615 | *size* indicates the approximate maximum number of bytes to read from the stream | 
|  | 616 | for decoding purposes. The decoder can modify this setting as appropriate. The | 
|  | 617 | default value -1 indicates to read and decode as much as possible.  *size* is | 
|  | 618 | intended to prevent having to decode huge files in one step. | 
|  | 619 |  | 
|  | 620 | *firstline* indicates that it would be sufficient to only return the first line, | 
|  | 621 | if there are decoding errors on later lines. | 
|  | 622 |  | 
|  | 623 | The method should use a greedy read strategy meaning that it should read as much | 
|  | 624 | data as is allowed within the definition of the encoding and the given size, | 
|  | 625 | e.g.  if optional encoding endings or state markers are available on the stream, | 
|  | 626 | these should be read too. | 
|  | 627 |  | 
|  | 628 | .. versionchanged:: 2.4 | 
|  | 629 | *chars* argument added. | 
|  | 630 |  | 
|  | 631 | .. versionchanged:: 2.4.2 | 
|  | 632 | *firstline* argument added. | 
|  | 633 |  | 
|  | 634 |  | 
|  | 635 | .. method:: StreamReader.readline([size[, keepends]]) | 
|  | 636 |  | 
|  | 637 | Read one line from the input stream and return the decoded data. | 
|  | 638 |  | 
|  | 639 | *size*, if given, is passed as size argument to the stream's :meth:`readline` | 
|  | 640 | method. | 
|  | 641 |  | 
|  | 642 | If *keepends* is false line-endings will be stripped from the lines returned. | 
|  | 643 |  | 
|  | 644 | .. versionchanged:: 2.4 | 
|  | 645 | *keepends* argument added. | 
|  | 646 |  | 
|  | 647 |  | 
|  | 648 | .. method:: StreamReader.readlines([sizehint[, keepends]]) | 
|  | 649 |  | 
|  | 650 | Read all lines available on the input stream and return them as a list of lines. | 
|  | 651 |  | 
|  | 652 | Line-endings are implemented using the codec's decoder method and are included | 
|  | 653 | in the list entries if *keepends* is true. | 
|  | 654 |  | 
|  | 655 | *sizehint*, if given, is passed as the *size* argument to the stream's | 
|  | 656 | :meth:`read` method. | 
|  | 657 |  | 
|  | 658 |  | 
|  | 659 | .. method:: StreamReader.reset() | 
|  | 660 |  | 
|  | 661 | Resets the codec buffers used for keeping state. | 
|  | 662 |  | 
|  | 663 | Note that no stream repositioning should take place.  This method is primarily | 
|  | 664 | intended to be able to recover from decoding errors. | 
|  | 665 |  | 
|  | 666 | In addition to the above methods, the :class:`StreamReader` must also inherit | 
|  | 667 | all other methods and attributes from the underlying stream. | 
|  | 668 |  | 
|  | 669 | The next two base classes are included for convenience. They are not needed by | 
|  | 670 | the codec registry, but may provide useful in practice. | 
|  | 671 |  | 
|  | 672 |  | 
|  | 673 | .. _stream-reader-writer: | 
|  | 674 |  | 
|  | 675 | StreamReaderWriter Objects | 
|  | 676 | ^^^^^^^^^^^^^^^^^^^^^^^^^^ | 
|  | 677 |  | 
|  | 678 | The :class:`StreamReaderWriter` allows wrapping streams which work in both read | 
|  | 679 | and write modes. | 
|  | 680 |  | 
|  | 681 | The design is such that one can use the factory functions returned by the | 
|  | 682 | :func:`lookup` function to construct the instance. | 
|  | 683 |  | 
|  | 684 |  | 
|  | 685 | .. class:: StreamReaderWriter(stream, Reader, Writer, errors) | 
|  | 686 |  | 
|  | 687 | Creates a :class:`StreamReaderWriter` instance. *stream* must be a file-like | 
|  | 688 | object. *Reader* and *Writer* must be factory functions or classes providing the | 
|  | 689 | :class:`StreamReader` and :class:`StreamWriter` interface resp. Error handling | 
|  | 690 | is done in the same way as defined for the stream readers and writers. | 
|  | 691 |  | 
|  | 692 | :class:`StreamReaderWriter` instances define the combined interfaces of | 
|  | 693 | :class:`StreamReader` and :class:`StreamWriter` classes. They inherit all other | 
|  | 694 | methods and attributes from the underlying stream. | 
|  | 695 |  | 
|  | 696 |  | 
|  | 697 | .. _stream-recoder-objects: | 
|  | 698 |  | 
|  | 699 | StreamRecoder Objects | 
|  | 700 | ^^^^^^^^^^^^^^^^^^^^^ | 
|  | 701 |  | 
|  | 702 | The :class:`StreamRecoder` provide a frontend - backend view of encoding data | 
|  | 703 | which is sometimes useful when dealing with different encoding environments. | 
|  | 704 |  | 
|  | 705 | The design is such that one can use the factory functions returned by the | 
|  | 706 | :func:`lookup` function to construct the instance. | 
|  | 707 |  | 
|  | 708 |  | 
|  | 709 | .. class:: StreamRecoder(stream, encode, decode, Reader, Writer, errors) | 
|  | 710 |  | 
|  | 711 | Creates a :class:`StreamRecoder` instance which implements a two-way conversion: | 
|  | 712 | *encode* and *decode* work on the frontend (the input to :meth:`read` and output | 
|  | 713 | of :meth:`write`) while *Reader* and *Writer* work on the backend (reading and | 
|  | 714 | writing to the stream). | 
|  | 715 |  | 
|  | 716 | You can use these objects to do transparent direct recodings from e.g. Latin-1 | 
|  | 717 | to UTF-8 and back. | 
|  | 718 |  | 
|  | 719 | *stream* must be a file-like object. | 
|  | 720 |  | 
|  | 721 | *encode*, *decode* must adhere to the :class:`Codec` interface. *Reader*, | 
|  | 722 | *Writer* must be factory functions or classes providing objects of the | 
|  | 723 | :class:`StreamReader` and :class:`StreamWriter` interface respectively. | 
|  | 724 |  | 
|  | 725 | *encode* and *decode* are needed for the frontend translation, *Reader* and | 
|  | 726 | *Writer* for the backend translation.  The intermediate format used is | 
|  | 727 | determined by the two sets of codecs, e.g. the Unicode codecs will use Unicode | 
|  | 728 | as the intermediate encoding. | 
|  | 729 |  | 
|  | 730 | Error handling is done in the same way as defined for the stream readers and | 
|  | 731 | writers. | 
|  | 732 |  | 
|  | 733 | :class:`StreamRecoder` instances define the combined interfaces of | 
|  | 734 | :class:`StreamReader` and :class:`StreamWriter` classes. They inherit all other | 
|  | 735 | methods and attributes from the underlying stream. | 
|  | 736 |  | 
|  | 737 |  | 
|  | 738 | .. _encodings-overview: | 
|  | 739 |  | 
|  | 740 | Encodings and Unicode | 
|  | 741 | --------------------- | 
|  | 742 |  | 
|  | 743 | Unicode strings are stored internally as sequences of codepoints (to be precise | 
|  | 744 | as :ctype:`Py_UNICODE` arrays). Depending on the way Python is compiled (either | 
|  | 745 | via :option:`--enable-unicode=ucs2` or :option:`--enable-unicode=ucs4`, with the | 
|  | 746 | former being the default) :ctype:`Py_UNICODE` is either a 16-bit or 32-bit data | 
|  | 747 | type. Once a Unicode object is used outside of CPU and memory, CPU endianness | 
|  | 748 | and how these arrays are stored as bytes become an issue.  Transforming a | 
|  | 749 | unicode object into a sequence of bytes is called encoding and recreating the | 
|  | 750 | unicode object from the sequence of bytes is known as decoding.  There are many | 
|  | 751 | different methods for how this transformation can be done (these methods are | 
|  | 752 | also called encodings). The simplest method is to map the codepoints 0-255 to | 
|  | 753 | the bytes ``0x0``-``0xff``. This means that a unicode object that contains | 
|  | 754 | codepoints above ``U+00FF`` can't be encoded with this method (which is called | 
|  | 755 | ``'latin-1'`` or ``'iso-8859-1'``). :func:`unicode.encode` will raise a | 
|  | 756 | :exc:`UnicodeEncodeError` that looks like this: ``UnicodeEncodeError: 'latin-1' | 
|  | 757 | codec can't encode character u'\u1234' in position 3: ordinal not in | 
|  | 758 | range(256)``. | 
|  | 759 |  | 
|  | 760 | There's another group of encodings (the so called charmap encodings) that choose | 
|  | 761 | a different subset of all unicode code points and how these codepoints are | 
|  | 762 | mapped to the bytes ``0x0``-``0xff``. To see how this is done simply open | 
|  | 763 | e.g. :file:`encodings/cp1252.py` (which is an encoding that is used primarily on | 
|  | 764 | Windows). There's a string constant with 256 characters that shows you which | 
|  | 765 | character is mapped to which byte value. | 
|  | 766 |  | 
|  | 767 | All of these encodings can only encode 256 of the 65536 (or 1114111) codepoints | 
|  | 768 | defined in unicode. A simple and straightforward way that can store each Unicode | 
|  | 769 | code point, is to store each codepoint as two consecutive bytes. There are two | 
|  | 770 | possibilities: Store the bytes in big endian or in little endian order. These | 
|  | 771 | two encodings are called UTF-16-BE and UTF-16-LE respectively. Their | 
|  | 772 | disadvantage is that if e.g. you use UTF-16-BE on a little endian machine you | 
|  | 773 | will always have to swap bytes on encoding and decoding. UTF-16 avoids this | 
|  | 774 | problem: Bytes will always be in natural endianness. When these bytes are read | 
|  | 775 | by a CPU with a different endianness, then bytes have to be swapped though. To | 
|  | 776 | be able to detect the endianness of a UTF-16 byte sequence, there's the so | 
|  | 777 | called BOM (the "Byte Order Mark"). This is the Unicode character ``U+FEFF``. | 
|  | 778 | This character will be prepended to every UTF-16 byte sequence. The byte swapped | 
|  | 779 | version of this character (``0xFFFE``) is an illegal character that may not | 
|  | 780 | appear in a Unicode text. So when the first character in an UTF-16 byte sequence | 
|  | 781 | appears to be a ``U+FFFE`` the bytes have to be swapped on decoding. | 
|  | 782 | Unfortunately upto Unicode 4.0 the character ``U+FEFF`` had a second purpose as | 
|  | 783 | a ``ZERO WIDTH NO-BREAK SPACE``: A character that has no width and doesn't allow | 
|  | 784 | a word to be split. It can e.g. be used to give hints to a ligature algorithm. | 
|  | 785 | With Unicode 4.0 using ``U+FEFF`` as a ``ZERO WIDTH NO-BREAK SPACE`` has been | 
|  | 786 | deprecated (with ``U+2060`` (``WORD JOINER``) assuming this role). Nevertheless | 
|  | 787 | Unicode software still must be able to handle ``U+FEFF`` in both roles: As a BOM | 
|  | 788 | it's a device to determine the storage layout of the encoded bytes, and vanishes | 
|  | 789 | once the byte sequence has been decoded into a Unicode string; as a ``ZERO WIDTH | 
|  | 790 | NO-BREAK SPACE`` it's a normal character that will be decoded like any other. | 
|  | 791 |  | 
|  | 792 | There's another encoding that is able to encoding the full range of Unicode | 
|  | 793 | characters: UTF-8. UTF-8 is an 8-bit encoding, which means there are no issues | 
|  | 794 | with byte order in UTF-8. Each byte in a UTF-8 byte sequence consists of two | 
|  | 795 | parts: Marker bits (the most significant bits) and payload bits. The marker bits | 
|  | 796 | are a sequence of zero to six 1 bits followed by a 0 bit. Unicode characters are | 
|  | 797 | encoded like this (with x being payload bits, which when concatenated give the | 
|  | 798 | Unicode character): | 
|  | 799 |  | 
|  | 800 | +-----------------------------------+----------------------------------------------+ | 
|  | 801 | | Range                             | Encoding                                     | | 
|  | 802 | +===================================+==============================================+ | 
|  | 803 | | ``U-00000000`` ... ``U-0000007F`` | 0xxxxxxx                                     | | 
|  | 804 | +-----------------------------------+----------------------------------------------+ | 
|  | 805 | | ``U-00000080`` ... ``U-000007FF`` | 110xxxxx 10xxxxxx                            | | 
|  | 806 | +-----------------------------------+----------------------------------------------+ | 
|  | 807 | | ``U-00000800`` ... ``U-0000FFFF`` | 1110xxxx 10xxxxxx 10xxxxxx                   | | 
|  | 808 | +-----------------------------------+----------------------------------------------+ | 
|  | 809 | | ``U-00010000`` ... ``U-001FFFFF`` | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx          | | 
|  | 810 | +-----------------------------------+----------------------------------------------+ | 
|  | 811 | | ``U-00200000`` ... ``U-03FFFFFF`` | 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx | | 
|  | 812 | +-----------------------------------+----------------------------------------------+ | 
|  | 813 | | ``U-04000000`` ... ``U-7FFFFFFF`` | 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx | | 
|  | 814 | |                                   | 10xxxxxx                                     | | 
|  | 815 | +-----------------------------------+----------------------------------------------+ | 
|  | 816 |  | 
|  | 817 | The least significant bit of the Unicode character is the rightmost x bit. | 
|  | 818 |  | 
|  | 819 | As UTF-8 is an 8-bit encoding no BOM is required and any ``U+FEFF`` character in | 
|  | 820 | the decoded Unicode string (even if it's the first character) is treated as a | 
|  | 821 | ``ZERO WIDTH NO-BREAK SPACE``. | 
|  | 822 |  | 
|  | 823 | Without external information it's impossible to reliably determine which | 
|  | 824 | encoding was used for encoding a Unicode string. Each charmap encoding can | 
|  | 825 | decode any random byte sequence. However that's not possible with UTF-8, as | 
|  | 826 | UTF-8 byte sequences have a structure that doesn't allow arbitrary byte | 
| Walter Dörwald | 73f83d2 | 2007-09-01 18:34:05 +0000 | [diff] [blame] | 827 | sequences. To increase the reliability with which a UTF-8 encoding can be | 
| Georg Brandl | 8ec7f65 | 2007-08-15 14:28:01 +0000 | [diff] [blame] | 828 | detected, Microsoft invented a variant of UTF-8 (that Python 2.5 calls | 
|  | 829 | ``"utf-8-sig"``) for its Notepad program: Before any of the Unicode characters | 
|  | 830 | is written to the file, a UTF-8 encoded BOM (which looks like this as a byte | 
|  | 831 | sequence: ``0xef``, ``0xbb``, ``0xbf``) is written. As it's rather improbable | 
|  | 832 | that any charmap encoded file starts with these byte values (which would e.g. | 
|  | 833 | map to | 
|  | 834 |  | 
|  | 835 | | LATIN SMALL LETTER I WITH DIAERESIS | 
|  | 836 | | RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK | 
|  | 837 | | INVERTED QUESTION MARK | 
|  | 838 |  | 
|  | 839 | in iso-8859-1), this increases the probability that a utf-8-sig encoding can be | 
|  | 840 | correctly guessed from the byte sequence. So here the BOM is not used to be able | 
|  | 841 | to determine the byte order used for generating the byte sequence, but as a | 
|  | 842 | signature that helps in guessing the encoding. On encoding the utf-8-sig codec | 
|  | 843 | will write ``0xef``, ``0xbb``, ``0xbf`` as the first three bytes to the file. On | 
|  | 844 | decoding utf-8-sig will skip those three bytes if they appear as the first three | 
|  | 845 | bytes in the file. | 
|  | 846 |  | 
|  | 847 |  | 
|  | 848 | .. _standard-encodings: | 
|  | 849 |  | 
|  | 850 | Standard Encodings | 
|  | 851 | ------------------ | 
|  | 852 |  | 
|  | 853 | Python comes with a number of codecs built-in, either implemented as C functions | 
|  | 854 | or with dictionaries as mapping tables. The following table lists the codecs by | 
|  | 855 | name, together with a few common aliases, and the languages for which the | 
|  | 856 | encoding is likely used. Neither the list of aliases nor the list of languages | 
|  | 857 | is meant to be exhaustive. Notice that spelling alternatives that only differ in | 
|  | 858 | case or use a hyphen instead of an underscore are also valid aliases. | 
|  | 859 |  | 
|  | 860 | Many of the character sets support the same languages. They vary in individual | 
|  | 861 | characters (e.g. whether the EURO SIGN is supported or not), and in the | 
|  | 862 | assignment of characters to code positions. For the European languages in | 
|  | 863 | particular, the following variants typically exist: | 
|  | 864 |  | 
|  | 865 | * an ISO 8859 codeset | 
|  | 866 |  | 
|  | 867 | * a Microsoft Windows code page, which is typically derived from a 8859 codeset, | 
|  | 868 | but replaces control characters with additional graphic characters | 
|  | 869 |  | 
|  | 870 | * an IBM EBCDIC code page | 
|  | 871 |  | 
|  | 872 | * an IBM PC code page, which is ASCII compatible | 
|  | 873 |  | 
|  | 874 | +-----------------+--------------------------------+--------------------------------+ | 
|  | 875 | | Codec           | Aliases                        | Languages                      | | 
|  | 876 | +=================+================================+================================+ | 
|  | 877 | | ascii           | 646, us-ascii                  | English                        | | 
|  | 878 | +-----------------+--------------------------------+--------------------------------+ | 
|  | 879 | | big5            | big5-tw, csbig5                | Traditional Chinese            | | 
|  | 880 | +-----------------+--------------------------------+--------------------------------+ | 
|  | 881 | | big5hkscs       | big5-hkscs, hkscs              | Traditional Chinese            | | 
|  | 882 | +-----------------+--------------------------------+--------------------------------+ | 
|  | 883 | | cp037           | IBM037, IBM039                 | English                        | | 
|  | 884 | +-----------------+--------------------------------+--------------------------------+ | 
|  | 885 | | cp424           | EBCDIC-CP-HE, IBM424           | Hebrew                         | | 
|  | 886 | +-----------------+--------------------------------+--------------------------------+ | 
|  | 887 | | cp437           | 437, IBM437                    | English                        | | 
|  | 888 | +-----------------+--------------------------------+--------------------------------+ | 
|  | 889 | | cp500           | EBCDIC-CP-BE, EBCDIC-CP-CH,    | Western Europe                 | | 
|  | 890 | |                 | IBM500                         |                                | | 
|  | 891 | +-----------------+--------------------------------+--------------------------------+ | 
|  | 892 | | cp737           |                                | Greek                          | | 
|  | 893 | +-----------------+--------------------------------+--------------------------------+ | 
|  | 894 | | cp775           | IBM775                         | Baltic languages               | | 
|  | 895 | +-----------------+--------------------------------+--------------------------------+ | 
|  | 896 | | cp850           | 850, IBM850                    | Western Europe                 | | 
|  | 897 | +-----------------+--------------------------------+--------------------------------+ | 
|  | 898 | | cp852           | 852, IBM852                    | Central and Eastern Europe     | | 
|  | 899 | +-----------------+--------------------------------+--------------------------------+ | 
|  | 900 | | cp855           | 855, IBM855                    | Bulgarian, Byelorussian,       | | 
|  | 901 | |                 |                                | Macedonian, Russian, Serbian   | | 
|  | 902 | +-----------------+--------------------------------+--------------------------------+ | 
|  | 903 | | cp856           |                                | Hebrew                         | | 
|  | 904 | +-----------------+--------------------------------+--------------------------------+ | 
|  | 905 | | cp857           | 857, IBM857                    | Turkish                        | | 
|  | 906 | +-----------------+--------------------------------+--------------------------------+ | 
|  | 907 | | cp860           | 860, IBM860                    | Portuguese                     | | 
|  | 908 | +-----------------+--------------------------------+--------------------------------+ | 
|  | 909 | | cp861           | 861, CP-IS, IBM861             | Icelandic                      | | 
|  | 910 | +-----------------+--------------------------------+--------------------------------+ | 
|  | 911 | | cp862           | 862, IBM862                    | Hebrew                         | | 
|  | 912 | +-----------------+--------------------------------+--------------------------------+ | 
|  | 913 | | cp863           | 863, IBM863                    | Canadian                       | | 
|  | 914 | +-----------------+--------------------------------+--------------------------------+ | 
|  | 915 | | cp864           | IBM864                         | Arabic                         | | 
|  | 916 | +-----------------+--------------------------------+--------------------------------+ | 
|  | 917 | | cp865           | 865, IBM865                    | Danish, Norwegian              | | 
|  | 918 | +-----------------+--------------------------------+--------------------------------+ | 
|  | 919 | | cp866           | 866, IBM866                    | Russian                        | | 
|  | 920 | +-----------------+--------------------------------+--------------------------------+ | 
|  | 921 | | cp869           | 869, CP-GR, IBM869             | Greek                          | | 
|  | 922 | +-----------------+--------------------------------+--------------------------------+ | 
|  | 923 | | cp874           |                                | Thai                           | | 
|  | 924 | +-----------------+--------------------------------+--------------------------------+ | 
|  | 925 | | cp875           |                                | Greek                          | | 
|  | 926 | +-----------------+--------------------------------+--------------------------------+ | 
|  | 927 | | cp932           | 932, ms932, mskanji, ms-kanji  | Japanese                       | | 
|  | 928 | +-----------------+--------------------------------+--------------------------------+ | 
|  | 929 | | cp949           | 949, ms949, uhc                | Korean                         | | 
|  | 930 | +-----------------+--------------------------------+--------------------------------+ | 
|  | 931 | | cp950           | 950, ms950                     | Traditional Chinese            | | 
|  | 932 | +-----------------+--------------------------------+--------------------------------+ | 
|  | 933 | | cp1006          |                                | Urdu                           | | 
|  | 934 | +-----------------+--------------------------------+--------------------------------+ | 
|  | 935 | | cp1026          | ibm1026                        | Turkish                        | | 
|  | 936 | +-----------------+--------------------------------+--------------------------------+ | 
|  | 937 | | cp1140          | ibm1140                        | Western Europe                 | | 
|  | 938 | +-----------------+--------------------------------+--------------------------------+ | 
|  | 939 | | cp1250          | windows-1250                   | Central and Eastern Europe     | | 
|  | 940 | +-----------------+--------------------------------+--------------------------------+ | 
|  | 941 | | cp1251          | windows-1251                   | Bulgarian, Byelorussian,       | | 
|  | 942 | |                 |                                | Macedonian, Russian, Serbian   | | 
|  | 943 | +-----------------+--------------------------------+--------------------------------+ | 
|  | 944 | | cp1252          | windows-1252                   | Western Europe                 | | 
|  | 945 | +-----------------+--------------------------------+--------------------------------+ | 
|  | 946 | | cp1253          | windows-1253                   | Greek                          | | 
|  | 947 | +-----------------+--------------------------------+--------------------------------+ | 
|  | 948 | | cp1254          | windows-1254                   | Turkish                        | | 
|  | 949 | +-----------------+--------------------------------+--------------------------------+ | 
|  | 950 | | cp1255          | windows-1255                   | Hebrew                         | | 
|  | 951 | +-----------------+--------------------------------+--------------------------------+ | 
|  | 952 | | cp1256          | windows1256                    | Arabic                         | | 
|  | 953 | +-----------------+--------------------------------+--------------------------------+ | 
|  | 954 | | cp1257          | windows-1257                   | Baltic languages               | | 
|  | 955 | +-----------------+--------------------------------+--------------------------------+ | 
|  | 956 | | cp1258          | windows-1258                   | Vietnamese                     | | 
|  | 957 | +-----------------+--------------------------------+--------------------------------+ | 
|  | 958 | | euc_jp          | eucjp, ujis, u-jis             | Japanese                       | | 
|  | 959 | +-----------------+--------------------------------+--------------------------------+ | 
|  | 960 | | euc_jis_2004    | jisx0213, eucjis2004           | Japanese                       | | 
|  | 961 | +-----------------+--------------------------------+--------------------------------+ | 
|  | 962 | | euc_jisx0213    | eucjisx0213                    | Japanese                       | | 
|  | 963 | +-----------------+--------------------------------+--------------------------------+ | 
|  | 964 | | euc_kr          | euckr, korean, ksc5601,        | Korean                         | | 
|  | 965 | |                 | ks_c-5601, ks_c-5601-1987,     |                                | | 
|  | 966 | |                 | ksx1001, ks_x-1001             |                                | | 
|  | 967 | +-----------------+--------------------------------+--------------------------------+ | 
|  | 968 | | gb2312          | chinese, csiso58gb231280, euc- | Simplified Chinese             | | 
|  | 969 | |                 | cn, euccn, eucgb2312-cn,       |                                | | 
|  | 970 | |                 | gb2312-1980, gb2312-80, iso-   |                                | | 
|  | 971 | |                 | ir-58                          |                                | | 
|  | 972 | +-----------------+--------------------------------+--------------------------------+ | 
|  | 973 | | gbk             | 936, cp936, ms936              | Unified Chinese                | | 
|  | 974 | +-----------------+--------------------------------+--------------------------------+ | 
|  | 975 | | gb18030         | gb18030-2000                   | Unified Chinese                | | 
|  | 976 | +-----------------+--------------------------------+--------------------------------+ | 
|  | 977 | | hz              | hzgb, hz-gb, hz-gb-2312        | Simplified Chinese             | | 
|  | 978 | +-----------------+--------------------------------+--------------------------------+ | 
|  | 979 | | iso2022_jp      | csiso2022jp, iso2022jp,        | Japanese                       | | 
|  | 980 | |                 | iso-2022-jp                    |                                | | 
|  | 981 | +-----------------+--------------------------------+--------------------------------+ | 
|  | 982 | | iso2022_jp_1    | iso2022jp-1, iso-2022-jp-1     | Japanese                       | | 
|  | 983 | +-----------------+--------------------------------+--------------------------------+ | 
|  | 984 | | iso2022_jp_2    | iso2022jp-2, iso-2022-jp-2     | Japanese, Korean, Simplified   | | 
|  | 985 | |                 |                                | Chinese, Western Europe, Greek | | 
|  | 986 | +-----------------+--------------------------------+--------------------------------+ | 
|  | 987 | | iso2022_jp_2004 | iso2022jp-2004,                | Japanese                       | | 
|  | 988 | |                 | iso-2022-jp-2004               |                                | | 
|  | 989 | +-----------------+--------------------------------+--------------------------------+ | 
|  | 990 | | iso2022_jp_3    | iso2022jp-3, iso-2022-jp-3     | Japanese                       | | 
|  | 991 | +-----------------+--------------------------------+--------------------------------+ | 
|  | 992 | | iso2022_jp_ext  | iso2022jp-ext, iso-2022-jp-ext | Japanese                       | | 
|  | 993 | +-----------------+--------------------------------+--------------------------------+ | 
|  | 994 | | iso2022_kr      | csiso2022kr, iso2022kr,        | Korean                         | | 
|  | 995 | |                 | iso-2022-kr                    |                                | | 
|  | 996 | +-----------------+--------------------------------+--------------------------------+ | 
|  | 997 | | latin_1         | iso-8859-1, iso8859-1, 8859,   | West Europe                    | | 
|  | 998 | |                 | cp819, latin, latin1, L1       |                                | | 
|  | 999 | +-----------------+--------------------------------+--------------------------------+ | 
|  | 1000 | | iso8859_2       | iso-8859-2, latin2, L2         | Central and Eastern Europe     | | 
|  | 1001 | +-----------------+--------------------------------+--------------------------------+ | 
|  | 1002 | | iso8859_3       | iso-8859-3, latin3, L3         | Esperanto, Maltese             | | 
|  | 1003 | +-----------------+--------------------------------+--------------------------------+ | 
| Georg Brandl | 907a720 | 2008-02-22 12:31:45 +0000 | [diff] [blame^] | 1004 | | iso8859_4       | iso-8859-4, latin4, L4         | Baltic languages               | | 
| Georg Brandl | 8ec7f65 | 2007-08-15 14:28:01 +0000 | [diff] [blame] | 1005 | +-----------------+--------------------------------+--------------------------------+ | 
|  | 1006 | | iso8859_5       | iso-8859-5, cyrillic           | Bulgarian, Byelorussian,       | | 
|  | 1007 | |                 |                                | Macedonian, Russian, Serbian   | | 
|  | 1008 | +-----------------+--------------------------------+--------------------------------+ | 
|  | 1009 | | iso8859_6       | iso-8859-6, arabic             | Arabic                         | | 
|  | 1010 | +-----------------+--------------------------------+--------------------------------+ | 
|  | 1011 | | iso8859_7       | iso-8859-7, greek, greek8      | Greek                          | | 
|  | 1012 | +-----------------+--------------------------------+--------------------------------+ | 
|  | 1013 | | iso8859_8       | iso-8859-8, hebrew             | Hebrew                         | | 
|  | 1014 | +-----------------+--------------------------------+--------------------------------+ | 
|  | 1015 | | iso8859_9       | iso-8859-9, latin5, L5         | Turkish                        | | 
|  | 1016 | +-----------------+--------------------------------+--------------------------------+ | 
|  | 1017 | | iso8859_10      | iso-8859-10, latin6, L6        | Nordic languages               | | 
|  | 1018 | +-----------------+--------------------------------+--------------------------------+ | 
|  | 1019 | | iso8859_13      | iso-8859-13                    | Baltic languages               | | 
|  | 1020 | +-----------------+--------------------------------+--------------------------------+ | 
|  | 1021 | | iso8859_14      | iso-8859-14, latin8, L8        | Celtic languages               | | 
|  | 1022 | +-----------------+--------------------------------+--------------------------------+ | 
|  | 1023 | | iso8859_15      | iso-8859-15                    | Western Europe                 | | 
|  | 1024 | +-----------------+--------------------------------+--------------------------------+ | 
|  | 1025 | | johab           | cp1361, ms1361                 | Korean                         | | 
|  | 1026 | +-----------------+--------------------------------+--------------------------------+ | 
|  | 1027 | | koi8_r          |                                | Russian                        | | 
|  | 1028 | +-----------------+--------------------------------+--------------------------------+ | 
|  | 1029 | | koi8_u          |                                | Ukrainian                      | | 
|  | 1030 | +-----------------+--------------------------------+--------------------------------+ | 
|  | 1031 | | mac_cyrillic    | maccyrillic                    | Bulgarian, Byelorussian,       | | 
|  | 1032 | |                 |                                | Macedonian, Russian, Serbian   | | 
|  | 1033 | +-----------------+--------------------------------+--------------------------------+ | 
|  | 1034 | | mac_greek       | macgreek                       | Greek                          | | 
|  | 1035 | +-----------------+--------------------------------+--------------------------------+ | 
|  | 1036 | | mac_iceland     | maciceland                     | Icelandic                      | | 
|  | 1037 | +-----------------+--------------------------------+--------------------------------+ | 
|  | 1038 | | mac_latin2      | maclatin2, maccentraleurope    | Central and Eastern Europe     | | 
|  | 1039 | +-----------------+--------------------------------+--------------------------------+ | 
|  | 1040 | | mac_roman       | macroman                       | Western Europe                 | | 
|  | 1041 | +-----------------+--------------------------------+--------------------------------+ | 
|  | 1042 | | mac_turkish     | macturkish                     | Turkish                        | | 
|  | 1043 | +-----------------+--------------------------------+--------------------------------+ | 
|  | 1044 | | ptcp154         | csptcp154, pt154, cp154,       | Kazakh                         | | 
|  | 1045 | |                 | cyrillic-asian                 |                                | | 
|  | 1046 | +-----------------+--------------------------------+--------------------------------+ | 
|  | 1047 | | shift_jis       | csshiftjis, shiftjis, sjis,    | Japanese                       | | 
|  | 1048 | |                 | s_jis                          |                                | | 
|  | 1049 | +-----------------+--------------------------------+--------------------------------+ | 
|  | 1050 | | shift_jis_2004  | shiftjis2004, sjis_2004,       | Japanese                       | | 
|  | 1051 | |                 | sjis2004                       |                                | | 
|  | 1052 | +-----------------+--------------------------------+--------------------------------+ | 
|  | 1053 | | shift_jisx0213  | shiftjisx0213, sjisx0213,      | Japanese                       | | 
|  | 1054 | |                 | s_jisx0213                     |                                | | 
|  | 1055 | +-----------------+--------------------------------+--------------------------------+ | 
| Walter Dörwald | 6e39080 | 2007-08-17 16:41:28 +0000 | [diff] [blame] | 1056 | | utf_32          | U32, utf32                     | all languages                  | | 
|  | 1057 | +-----------------+--------------------------------+--------------------------------+ | 
|  | 1058 | | utf_32_be       | UTF-32BE                       | all languages                  | | 
|  | 1059 | +-----------------+--------------------------------+--------------------------------+ | 
|  | 1060 | | utf_32_le       | UTF-32LE                       | all languages                  | | 
|  | 1061 | +-----------------+--------------------------------+--------------------------------+ | 
| Georg Brandl | 8ec7f65 | 2007-08-15 14:28:01 +0000 | [diff] [blame] | 1062 | | utf_16          | U16, utf16                     | all languages                  | | 
|  | 1063 | +-----------------+--------------------------------+--------------------------------+ | 
|  | 1064 | | utf_16_be       | UTF-16BE                       | all languages (BMP only)       | | 
|  | 1065 | +-----------------+--------------------------------+--------------------------------+ | 
|  | 1066 | | utf_16_le       | UTF-16LE                       | all languages (BMP only)       | | 
|  | 1067 | +-----------------+--------------------------------+--------------------------------+ | 
|  | 1068 | | utf_7           | U7, unicode-1-1-utf-7          | all languages                  | | 
|  | 1069 | +-----------------+--------------------------------+--------------------------------+ | 
|  | 1070 | | utf_8           | U8, UTF, utf8                  | all languages                  | | 
|  | 1071 | +-----------------+--------------------------------+--------------------------------+ | 
|  | 1072 | | utf_8_sig       |                                | all languages                  | | 
|  | 1073 | +-----------------+--------------------------------+--------------------------------+ | 
|  | 1074 |  | 
|  | 1075 | A number of codecs are specific to Python, so their codec names have no meaning | 
|  | 1076 | outside Python. Some of them don't convert from Unicode strings to byte strings, | 
|  | 1077 | but instead use the property of the Python codecs machinery that any bijective | 
|  | 1078 | function with one argument can be considered as an encoding. | 
|  | 1079 |  | 
|  | 1080 | For the codecs listed below, the result in the "encoding" direction is always a | 
|  | 1081 | byte string. The result of the "decoding" direction is listed as operand type in | 
|  | 1082 | the table. | 
|  | 1083 |  | 
|  | 1084 | +--------------------+---------------------------+----------------+---------------------------+ | 
|  | 1085 | | Codec              | Aliases                   | Operand type   | Purpose                   | | 
|  | 1086 | +====================+===========================+================+===========================+ | 
|  | 1087 | | base64_codec       | base64, base-64           | byte string    | Convert operand to MIME   | | 
|  | 1088 | |                    |                           |                | base64                    | | 
|  | 1089 | +--------------------+---------------------------+----------------+---------------------------+ | 
|  | 1090 | | bz2_codec          | bz2                       | byte string    | Compress the operand      | | 
|  | 1091 | |                    |                           |                | using bz2                 | | 
|  | 1092 | +--------------------+---------------------------+----------------+---------------------------+ | 
|  | 1093 | | hex_codec          | hex                       | byte string    | Convert operand to        | | 
|  | 1094 | |                    |                           |                | hexadecimal               | | 
|  | 1095 | |                    |                           |                | representation, with two  | | 
|  | 1096 | |                    |                           |                | digits per byte           | | 
|  | 1097 | +--------------------+---------------------------+----------------+---------------------------+ | 
|  | 1098 | | idna               |                           | Unicode string | Implements :rfc:`3490`,   | | 
|  | 1099 | |                    |                           |                | see also                  | | 
|  | 1100 | |                    |                           |                | :mod:`encodings.idna`     | | 
|  | 1101 | +--------------------+---------------------------+----------------+---------------------------+ | 
|  | 1102 | | mbcs               | dbcs                      | Unicode string | Windows only: Encode      | | 
|  | 1103 | |                    |                           |                | operand according to the  | | 
|  | 1104 | |                    |                           |                | ANSI codepage (CP_ACP)    | | 
|  | 1105 | +--------------------+---------------------------+----------------+---------------------------+ | 
|  | 1106 | | palmos             |                           | Unicode string | Encoding of PalmOS 3.5    | | 
|  | 1107 | +--------------------+---------------------------+----------------+---------------------------+ | 
|  | 1108 | | punycode           |                           | Unicode string | Implements :rfc:`3492`    | | 
|  | 1109 | +--------------------+---------------------------+----------------+---------------------------+ | 
|  | 1110 | | quopri_codec       | quopri, quoted-printable, | byte string    | Convert operand to MIME   | | 
|  | 1111 | |                    | quotedprintable           |                | quoted printable          | | 
|  | 1112 | +--------------------+---------------------------+----------------+---------------------------+ | 
|  | 1113 | | raw_unicode_escape |                           | Unicode string | Produce a string that is  | | 
|  | 1114 | |                    |                           |                | suitable as raw Unicode   | | 
|  | 1115 | |                    |                           |                | literal in Python source  | | 
|  | 1116 | |                    |                           |                | code                      | | 
|  | 1117 | +--------------------+---------------------------+----------------+---------------------------+ | 
|  | 1118 | | rot_13             | rot13                     | Unicode string | Returns the Caesar-cypher | | 
|  | 1119 | |                    |                           |                | encryption of the operand | | 
|  | 1120 | +--------------------+---------------------------+----------------+---------------------------+ | 
|  | 1121 | | string_escape      |                           | byte string    | Produce a string that is  | | 
|  | 1122 | |                    |                           |                | suitable as string        | | 
|  | 1123 | |                    |                           |                | literal in Python source  | | 
|  | 1124 | |                    |                           |                | code                      | | 
|  | 1125 | +--------------------+---------------------------+----------------+---------------------------+ | 
|  | 1126 | | undefined          |                           | any            | Raise an exception for    | | 
|  | 1127 | |                    |                           |                | all conversions. Can be   | | 
|  | 1128 | |                    |                           |                | used as the system        | | 
|  | 1129 | |                    |                           |                | encoding if no automatic  | | 
| Georg Brandl | 584265b | 2007-12-02 14:58:50 +0000 | [diff] [blame] | 1130 | |                    |                           |                | :term:`coercion` between  | | 
|  | 1131 | |                    |                           |                | byte and Unicode strings  | | 
|  | 1132 | |                    |                           |                | is desired.               | | 
| Georg Brandl | 8ec7f65 | 2007-08-15 14:28:01 +0000 | [diff] [blame] | 1133 | +--------------------+---------------------------+----------------+---------------------------+ | 
|  | 1134 | | unicode_escape     |                           | Unicode string | Produce a string that is  | | 
|  | 1135 | |                    |                           |                | suitable as Unicode       | | 
|  | 1136 | |                    |                           |                | literal in Python source  | | 
|  | 1137 | |                    |                           |                | code                      | | 
|  | 1138 | +--------------------+---------------------------+----------------+---------------------------+ | 
|  | 1139 | | unicode_internal   |                           | Unicode string | Return the internal       | | 
|  | 1140 | |                    |                           |                | representation of the     | | 
|  | 1141 | |                    |                           |                | operand                   | | 
|  | 1142 | +--------------------+---------------------------+----------------+---------------------------+ | 
|  | 1143 | | uu_codec           | uu                        | byte string    | Convert the operand using | | 
|  | 1144 | |                    |                           |                | uuencode                  | | 
|  | 1145 | +--------------------+---------------------------+----------------+---------------------------+ | 
|  | 1146 | | zlib_codec         | zip, zlib                 | byte string    | Compress the operand      | | 
|  | 1147 | |                    |                           |                | using gzip                | | 
|  | 1148 | +--------------------+---------------------------+----------------+---------------------------+ | 
|  | 1149 |  | 
|  | 1150 | .. versionadded:: 2.3 | 
|  | 1151 | The ``idna`` and ``punycode`` encodings. | 
|  | 1152 |  | 
|  | 1153 |  | 
|  | 1154 | :mod:`encodings.idna` --- Internationalized Domain Names in Applications | 
|  | 1155 | ------------------------------------------------------------------------ | 
|  | 1156 |  | 
|  | 1157 | .. module:: encodings.idna | 
|  | 1158 | :synopsis: Internationalized Domain Names implementation | 
|  | 1159 | .. moduleauthor:: Martin v. Löwis | 
|  | 1160 |  | 
|  | 1161 | .. versionadded:: 2.3 | 
|  | 1162 |  | 
|  | 1163 | This module implements :rfc:`3490` (Internationalized Domain Names in | 
|  | 1164 | Applications) and :rfc:`3492` (Nameprep: A Stringprep Profile for | 
|  | 1165 | Internationalized Domain Names (IDN)). It builds upon the ``punycode`` encoding | 
|  | 1166 | and :mod:`stringprep`. | 
|  | 1167 |  | 
|  | 1168 | These RFCs together define a protocol to support non-ASCII characters in domain | 
|  | 1169 | names. A domain name containing non-ASCII characters (such as | 
|  | 1170 | ``www.Alliancefrançaise.nu``) is converted into an ASCII-compatible encoding | 
|  | 1171 | (ACE, such as ``www.xn--alliancefranaise-npb.nu``). The ACE form of the domain | 
|  | 1172 | name is then used in all places where arbitrary characters are not allowed by | 
|  | 1173 | the protocol, such as DNS queries, HTTP :mailheader:`Host` fields, and so | 
|  | 1174 | on. This conversion is carried out in the application; if possible invisible to | 
|  | 1175 | the user: The application should transparently convert Unicode domain labels to | 
|  | 1176 | IDNA on the wire, and convert back ACE labels to Unicode before presenting them | 
|  | 1177 | to the user. | 
|  | 1178 |  | 
|  | 1179 | Python supports this conversion in several ways: The ``idna`` codec allows to | 
|  | 1180 | convert between Unicode and the ACE. Furthermore, the :mod:`socket` module | 
|  | 1181 | transparently converts Unicode host names to ACE, so that applications need not | 
|  | 1182 | be concerned about converting host names themselves when they pass them to the | 
|  | 1183 | socket module. On top of that, modules that have host names as function | 
|  | 1184 | parameters, such as :mod:`httplib` and :mod:`ftplib`, accept Unicode host names | 
|  | 1185 | (:mod:`httplib` then also transparently sends an IDNA hostname in the | 
|  | 1186 | :mailheader:`Host` field if it sends that field at all). | 
|  | 1187 |  | 
|  | 1188 | When receiving host names from the wire (such as in reverse name lookup), no | 
|  | 1189 | automatic conversion to Unicode is performed: Applications wishing to present | 
|  | 1190 | such host names to the user should decode them to Unicode. | 
|  | 1191 |  | 
|  | 1192 | The module :mod:`encodings.idna` also implements the nameprep procedure, which | 
|  | 1193 | performs certain normalizations on host names, to achieve case-insensitivity of | 
|  | 1194 | international domain names, and to unify similar characters. The nameprep | 
|  | 1195 | functions can be used directly if desired. | 
|  | 1196 |  | 
|  | 1197 |  | 
|  | 1198 | .. function:: nameprep(label) | 
|  | 1199 |  | 
|  | 1200 | Return the nameprepped version of *label*. The implementation currently assumes | 
|  | 1201 | query strings, so ``AllowUnassigned`` is true. | 
|  | 1202 |  | 
|  | 1203 |  | 
|  | 1204 | .. function:: ToASCII(label) | 
|  | 1205 |  | 
|  | 1206 | Convert a label to ASCII, as specified in :rfc:`3490`. ``UseSTD3ASCIIRules`` is | 
|  | 1207 | assumed to be false. | 
|  | 1208 |  | 
|  | 1209 |  | 
|  | 1210 | .. function:: ToUnicode(label) | 
|  | 1211 |  | 
|  | 1212 | Convert a label to Unicode, as specified in :rfc:`3490`. | 
|  | 1213 |  | 
|  | 1214 |  | 
|  | 1215 | :mod:`encodings.utf_8_sig` --- UTF-8 codec with BOM signature | 
|  | 1216 | ------------------------------------------------------------- | 
|  | 1217 |  | 
|  | 1218 | .. module:: encodings.utf_8_sig | 
|  | 1219 | :synopsis: UTF-8 codec with BOM signature | 
|  | 1220 | .. moduleauthor:: Walter Dörwald | 
|  | 1221 |  | 
|  | 1222 | .. versionadded:: 2.5 | 
|  | 1223 |  | 
|  | 1224 | This module implements a variant of the UTF-8 codec: On encoding a UTF-8 encoded | 
|  | 1225 | BOM will be prepended to the UTF-8 encoded bytes. For the stateful encoder this | 
|  | 1226 | is only done once (on the first write to the byte stream).  For decoding an | 
|  | 1227 | optional UTF-8 encoded BOM at the start of the data will be skipped. | 
|  | 1228 |  |