| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 1 | :mod:`codecs` --- Codec registry and base classes | 
 | 2 | ================================================= | 
 | 3 |  | 
 | 4 | .. module:: codecs | 
 | 5 |    :synopsis: Encode and decode data and streams. | 
 | 6 | .. moduleauthor:: Marc-Andre Lemburg <mal@lemburg.com> | 
 | 7 | .. sectionauthor:: Marc-Andre Lemburg <mal@lemburg.com> | 
 | 8 | .. sectionauthor:: Martin v. Löwis <martin@v.loewis.de> | 
 | 9 |  | 
 | 10 |  | 
 | 11 | .. index:: | 
 | 12 |    single: Unicode | 
 | 13 |    single: Codecs | 
 | 14 |    pair: Codecs; encode | 
 | 15 |    pair: Codecs; decode | 
 | 16 |    single: streams | 
 | 17 |    pair: stackable; streams | 
 | 18 |  | 
 | 19 | This module defines base classes for standard Python codecs (encoders and | 
 | 20 | decoders) and provides access to the internal Python codec registry which | 
 | 21 | manages the codec and error handling lookup process. | 
 | 22 |  | 
 | 23 | It defines the following functions: | 
 | 24 |  | 
 | 25 |  | 
 | 26 | .. function:: register(search_function) | 
 | 27 |  | 
 | 28 |    Register a codec search function. Search functions are expected to take one | 
 | 29 |    argument, the encoding name in all lower case letters, and return a | 
 | 30 |    :class:`CodecInfo` object having the following attributes: | 
 | 31 |  | 
 | 32 |    * ``name`` The name of the encoding; | 
 | 33 |  | 
| Walter Dörwald | 62073e0 | 2008-10-23 13:21:33 +0000 | [diff] [blame] | 34 |    * ``encode`` The stateless encoding function; | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 35 |  | 
| Walter Dörwald | 62073e0 | 2008-10-23 13:21:33 +0000 | [diff] [blame] | 36 |    * ``decode`` The stateless decoding function; | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 37 |  | 
 | 38 |    * ``incrementalencoder`` An incremental encoder class or factory function; | 
 | 39 |  | 
 | 40 |    * ``incrementaldecoder`` An incremental decoder class or factory function; | 
 | 41 |  | 
 | 42 |    * ``streamwriter`` A stream writer class or factory function; | 
 | 43 |  | 
 | 44 |    * ``streamreader`` A stream reader class or factory function. | 
 | 45 |  | 
 | 46 |    The various functions or classes take the following arguments: | 
 | 47 |  | 
| Walter Dörwald | 62073e0 | 2008-10-23 13:21:33 +0000 | [diff] [blame] | 48 |    *encode* and *decode*: These must be functions or methods which have the same | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 49 |    interface as the :meth:`encode`/:meth:`decode` methods of Codec instances (see | 
 | 50 |    Codec Interface). The functions/methods are expected to work in a stateless | 
 | 51 |    mode. | 
 | 52 |  | 
| Benjamin Peterson | 3e4f055 | 2008-09-02 00:31:15 +0000 | [diff] [blame] | 53 |    *incrementalencoder* and *incrementaldecoder*: These have to be factory | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 54 |    functions providing the following interface: | 
 | 55 |  | 
| Georg Brandl | 495f7b5 | 2009-10-27 15:28:25 +0000 | [diff] [blame] | 56 |       ``factory(errors='strict')`` | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 57 |  | 
 | 58 |    The factory functions must return objects providing the interfaces defined by | 
| Benjamin Peterson | 3e4f055 | 2008-09-02 00:31:15 +0000 | [diff] [blame] | 59 |    the base classes :class:`IncrementalEncoder` and :class:`IncrementalDecoder`, | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 60 |    respectively. Incremental codecs can maintain state. | 
 | 61 |  | 
 | 62 |    *streamreader* and *streamwriter*: These have to be factory functions providing | 
 | 63 |    the following interface: | 
 | 64 |  | 
| Georg Brandl | 495f7b5 | 2009-10-27 15:28:25 +0000 | [diff] [blame] | 65 |       ``factory(stream, errors='strict')`` | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 66 |  | 
 | 67 |    The factory functions must return objects providing the interfaces defined by | 
 | 68 |    the base classes :class:`StreamWriter` and :class:`StreamReader`, respectively. | 
 | 69 |    Stream codecs can maintain state. | 
 | 70 |  | 
| Georg Brandl | 495f7b5 | 2009-10-27 15:28:25 +0000 | [diff] [blame] | 71 |    Possible values for errors are | 
 | 72 |  | 
 | 73 |    * ``'strict'``: raise an exception in case of an encoding error | 
 | 74 |    * ``'replace'``: replace malformed data with a suitable replacement marker, | 
 | 75 |      such as ``'?'`` or ``'\ufffd'`` | 
 | 76 |    * ``'ignore'``: ignore malformed data and continue without further notice | 
 | 77 |    * ``'xmlcharrefreplace'``: replace with the appropriate XML character | 
 | 78 |      reference (for encoding only) | 
 | 79 |    * ``'backslashreplace'``: replace with backslashed escape sequences (for | 
| Ezio Melotti | e33721e | 2010-02-27 13:54:27 +0000 | [diff] [blame] | 80 |      encoding only) | 
| Georg Brandl | 495f7b5 | 2009-10-27 15:28:25 +0000 | [diff] [blame] | 81 |    * ``'surrogateescape'``: replace with surrogate U+DCxx, see :pep:`383` | 
 | 82 |  | 
 | 83 |    as well as any other error handling name defined via :func:`register_error`. | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 84 |  | 
 | 85 |    In case a search function cannot find a given encoding, it should return | 
 | 86 |    ``None``. | 
 | 87 |  | 
 | 88 |  | 
 | 89 | .. function:: lookup(encoding) | 
 | 90 |  | 
 | 91 |    Looks up the codec info in the Python codec registry and returns a | 
 | 92 |    :class:`CodecInfo` object as defined above. | 
 | 93 |  | 
 | 94 |    Encodings are first looked up in the registry's cache. If not found, the list of | 
 | 95 |    registered search functions is scanned. If no :class:`CodecInfo` object is | 
 | 96 |    found, a :exc:`LookupError` is raised. Otherwise, the :class:`CodecInfo` object | 
 | 97 |    is stored in the cache and returned to the caller. | 
 | 98 |  | 
 | 99 | To simplify access to the various codecs, the module provides these additional | 
 | 100 | functions which use :func:`lookup` for the codec lookup: | 
 | 101 |  | 
 | 102 |  | 
 | 103 | .. function:: getencoder(encoding) | 
 | 104 |  | 
 | 105 |    Look up the codec for the given encoding and return its encoder function. | 
 | 106 |  | 
 | 107 |    Raises a :exc:`LookupError` in case the encoding cannot be found. | 
 | 108 |  | 
 | 109 |  | 
 | 110 | .. function:: getdecoder(encoding) | 
 | 111 |  | 
 | 112 |    Look up the codec for the given encoding and return its decoder function. | 
 | 113 |  | 
 | 114 |    Raises a :exc:`LookupError` in case the encoding cannot be found. | 
 | 115 |  | 
 | 116 |  | 
 | 117 | .. function:: getincrementalencoder(encoding) | 
 | 118 |  | 
 | 119 |    Look up the codec for the given encoding and return its incremental encoder | 
 | 120 |    class or factory function. | 
 | 121 |  | 
 | 122 |    Raises a :exc:`LookupError` in case the encoding cannot be found or the codec | 
 | 123 |    doesn't support an incremental encoder. | 
 | 124 |  | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 125 |  | 
 | 126 | .. function:: getincrementaldecoder(encoding) | 
 | 127 |  | 
 | 128 |    Look up the codec for the given encoding and return its incremental decoder | 
 | 129 |    class or factory function. | 
 | 130 |  | 
 | 131 |    Raises a :exc:`LookupError` in case the encoding cannot be found or the codec | 
 | 132 |    doesn't support an incremental decoder. | 
 | 133 |  | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 134 |  | 
 | 135 | .. function:: getreader(encoding) | 
 | 136 |  | 
 | 137 |    Look up the codec for the given encoding and return its StreamReader class or | 
 | 138 |    factory function. | 
 | 139 |  | 
 | 140 |    Raises a :exc:`LookupError` in case the encoding cannot be found. | 
 | 141 |  | 
 | 142 |  | 
 | 143 | .. function:: getwriter(encoding) | 
 | 144 |  | 
 | 145 |    Look up the codec for the given encoding and return its StreamWriter class or | 
 | 146 |    factory function. | 
 | 147 |  | 
 | 148 |    Raises a :exc:`LookupError` in case the encoding cannot be found. | 
 | 149 |  | 
 | 150 |  | 
 | 151 | .. function:: register_error(name, error_handler) | 
 | 152 |  | 
 | 153 |    Register the error handling function *error_handler* under the name *name*. | 
 | 154 |    *error_handler* will be called during encoding and decoding in case of an error, | 
 | 155 |    when *name* is specified as the errors parameter. | 
 | 156 |  | 
 | 157 |    For encoding *error_handler* will be called with a :exc:`UnicodeEncodeError` | 
 | 158 |    instance, which contains information about the location of the error. The error | 
 | 159 |    handler must either raise this or a different exception or return a tuple with a | 
 | 160 |    replacement for the unencodable part of the input and a position where encoding | 
 | 161 |    should continue. The encoder will encode the replacement and continue encoding | 
 | 162 |    the original input at the specified position. Negative position values will be | 
 | 163 |    treated as being relative to the end of the input string. If the resulting | 
 | 164 |    position is out of bound an :exc:`IndexError` will be raised. | 
 | 165 |  | 
 | 166 |    Decoding and translating works similar, except :exc:`UnicodeDecodeError` or | 
 | 167 |    :exc:`UnicodeTranslateError` will be passed to the handler and that the | 
 | 168 |    replacement from the error handler will be put into the output directly. | 
 | 169 |  | 
 | 170 |  | 
 | 171 | .. function:: lookup_error(name) | 
 | 172 |  | 
 | 173 |    Return the error handler previously registered under the name *name*. | 
 | 174 |  | 
 | 175 |    Raises a :exc:`LookupError` in case the handler cannot be found. | 
 | 176 |  | 
 | 177 |  | 
 | 178 | .. function:: strict_errors(exception) | 
 | 179 |  | 
| Georg Brandl | 495f7b5 | 2009-10-27 15:28:25 +0000 | [diff] [blame] | 180 |    Implements the ``strict`` error handling: each encoding or decoding error | 
 | 181 |    raises a :exc:`UnicodeError`. | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 182 |  | 
 | 183 |  | 
 | 184 | .. function:: replace_errors(exception) | 
 | 185 |  | 
| Georg Brandl | 495f7b5 | 2009-10-27 15:28:25 +0000 | [diff] [blame] | 186 |    Implements the ``replace`` error handling: malformed data is replaced with a | 
 | 187 |    suitable replacement character such as ``'?'`` in bytestrings and | 
 | 188 |    ``'\ufffd'`` in Unicode strings. | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 189 |  | 
 | 190 |  | 
 | 191 | .. function:: ignore_errors(exception) | 
 | 192 |  | 
| Georg Brandl | 495f7b5 | 2009-10-27 15:28:25 +0000 | [diff] [blame] | 193 |    Implements the ``ignore`` error handling: malformed data is ignored and | 
 | 194 |    encoding or decoding is continued without further notice. | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 195 |  | 
 | 196 |  | 
| Thomas Wouters | 89d996e | 2007-09-08 17:39:28 +0000 | [diff] [blame] | 197 | .. function:: xmlcharrefreplace_errors(exception) | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 198 |  | 
| Georg Brandl | 495f7b5 | 2009-10-27 15:28:25 +0000 | [diff] [blame] | 199 |    Implements the ``xmlcharrefreplace`` error handling (for encoding only): the | 
 | 200 |    unencodable character is replaced by an appropriate XML character reference. | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 201 |  | 
 | 202 |  | 
| Thomas Wouters | 89d996e | 2007-09-08 17:39:28 +0000 | [diff] [blame] | 203 | .. function:: backslashreplace_errors(exception) | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 204 |  | 
| Georg Brandl | 495f7b5 | 2009-10-27 15:28:25 +0000 | [diff] [blame] | 205 |    Implements the ``backslashreplace`` error handling (for encoding only): the | 
 | 206 |    unencodable character is replaced by a backslashed escape sequence. | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 207 |  | 
 | 208 | To simplify working with encoded files or stream, the module also defines these | 
 | 209 | utility functions: | 
 | 210 |  | 
 | 211 |  | 
 | 212 | .. function:: open(filename, mode[, encoding[, errors[, buffering]]]) | 
 | 213 |  | 
 | 214 |    Open an encoded file using the given *mode* and return a wrapped version | 
| Christian Heimes | 18c6689 | 2008-02-17 13:31:39 +0000 | [diff] [blame] | 215 |    providing transparent encoding/decoding.  The default file mode is ``'r'`` | 
 | 216 |    meaning to open the file in read mode. | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 217 |  | 
 | 218 |    .. note:: | 
 | 219 |  | 
| Georg Brandl | 30c78d6 | 2008-05-11 14:52:00 +0000 | [diff] [blame] | 220 |       The wrapped version's methods will accept and return strings only.  Bytes | 
 | 221 |       arguments will be rejected. | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 222 |  | 
| Christian Heimes | 18c6689 | 2008-02-17 13:31:39 +0000 | [diff] [blame] | 223 |    .. note:: | 
 | 224 |  | 
 | 225 |       Files are always opened in binary mode, even if no binary mode was | 
 | 226 |       specified.  This is done to avoid data loss due to encodings using 8-bit | 
| Georg Brandl | 30c78d6 | 2008-05-11 14:52:00 +0000 | [diff] [blame] | 227 |       values.  This means that no automatic conversion of ``b'\n'`` is done | 
| Christian Heimes | 18c6689 | 2008-02-17 13:31:39 +0000 | [diff] [blame] | 228 |       on reading and writing. | 
 | 229 |  | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 230 |    *encoding* specifies the encoding which is to be used for the file. | 
 | 231 |  | 
 | 232 |    *errors* may be given to define the error handling. It defaults to ``'strict'`` | 
 | 233 |    which causes a :exc:`ValueError` to be raised in case an encoding error occurs. | 
 | 234 |  | 
 | 235 |    *buffering* has the same meaning as for the built-in :func:`open` function.  It | 
 | 236 |    defaults to line buffered. | 
 | 237 |  | 
 | 238 |  | 
| Georg Brandl | 0d8f073 | 2009-04-05 22:20:44 +0000 | [diff] [blame] | 239 | .. function:: EncodedFile(file, data_encoding, file_encoding=None, errors='strict') | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 240 |  | 
 | 241 |    Return a wrapped version of file which provides transparent encoding | 
 | 242 |    translation. | 
 | 243 |  | 
| Georg Brandl | 30c78d6 | 2008-05-11 14:52:00 +0000 | [diff] [blame] | 244 |    Bytes written to the wrapped file are interpreted according to the given | 
| Georg Brandl | 0d8f073 | 2009-04-05 22:20:44 +0000 | [diff] [blame] | 245 |    *data_encoding* and then written to the original file as bytes using the | 
 | 246 |    *file_encoding*. | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 247 |  | 
| Georg Brandl | 0d8f073 | 2009-04-05 22:20:44 +0000 | [diff] [blame] | 248 |    If *file_encoding* is not given, it defaults to *data_encoding*. | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 249 |  | 
| Georg Brandl | 0d8f073 | 2009-04-05 22:20:44 +0000 | [diff] [blame] | 250 |    *errors* may be given to define the error handling. It defaults to | 
 | 251 |    ``'strict'``, which causes :exc:`ValueError` to be raised in case an encoding | 
 | 252 |    error occurs. | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 253 |  | 
 | 254 |  | 
| Georg Brandl | 0d8f073 | 2009-04-05 22:20:44 +0000 | [diff] [blame] | 255 | .. function:: iterencode(iterator, encoding, errors='strict', **kwargs) | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 256 |  | 
 | 257 |    Uses an incremental encoder to iteratively encode the input provided by | 
| Georg Brandl | 0d8f073 | 2009-04-05 22:20:44 +0000 | [diff] [blame] | 258 |    *iterator*. This function is a :term:`generator`.  *errors* (as well as any | 
| Georg Brandl | 9afde1c | 2007-11-01 20:32:30 +0000 | [diff] [blame] | 259 |    other keyword argument) is passed through to the incremental encoder. | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 260 |  | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 261 |  | 
| Georg Brandl | 0d8f073 | 2009-04-05 22:20:44 +0000 | [diff] [blame] | 262 | .. function:: iterdecode(iterator, encoding, errors='strict', **kwargs) | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 263 |  | 
 | 264 |    Uses an incremental decoder to iteratively decode the input provided by | 
| Georg Brandl | 0d8f073 | 2009-04-05 22:20:44 +0000 | [diff] [blame] | 265 |    *iterator*. This function is a :term:`generator`.  *errors* (as well as any | 
| Georg Brandl | 9afde1c | 2007-11-01 20:32:30 +0000 | [diff] [blame] | 266 |    other keyword argument) is passed through to the incremental decoder. | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 267 |  | 
| Georg Brandl | 0d8f073 | 2009-04-05 22:20:44 +0000 | [diff] [blame] | 268 |  | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 269 | The module also provides the following constants which are useful for reading | 
 | 270 | and writing to platform dependent files: | 
 | 271 |  | 
 | 272 |  | 
 | 273 | .. data:: BOM | 
 | 274 |           BOM_BE | 
 | 275 |           BOM_LE | 
 | 276 |           BOM_UTF8 | 
 | 277 |           BOM_UTF16 | 
 | 278 |           BOM_UTF16_BE | 
 | 279 |           BOM_UTF16_LE | 
 | 280 |           BOM_UTF32 | 
 | 281 |           BOM_UTF32_BE | 
 | 282 |           BOM_UTF32_LE | 
 | 283 |  | 
 | 284 |    These constants define various encodings of the Unicode byte order mark (BOM) | 
 | 285 |    used in UTF-16 and UTF-32 data streams to indicate the byte order used in the | 
 | 286 |    stream or file and in UTF-8 as a Unicode signature. :const:`BOM_UTF16` is either | 
 | 287 |    :const:`BOM_UTF16_BE` or :const:`BOM_UTF16_LE` depending on the platform's | 
 | 288 |    native byte order, :const:`BOM` is an alias for :const:`BOM_UTF16`, | 
 | 289 |    :const:`BOM_LE` for :const:`BOM_UTF16_LE` and :const:`BOM_BE` for | 
 | 290 |    :const:`BOM_UTF16_BE`. The others represent the BOM in UTF-8 and UTF-32 | 
 | 291 |    encodings. | 
 | 292 |  | 
 | 293 |  | 
 | 294 | .. _codec-base-classes: | 
 | 295 |  | 
 | 296 | Codec Base Classes | 
 | 297 | ------------------ | 
 | 298 |  | 
 | 299 | The :mod:`codecs` module defines a set of base classes which define the | 
| Georg Brandl | f08a9dd | 2008-06-10 16:57:31 +0000 | [diff] [blame] | 300 | interface and can also be used to easily write your own codecs for use in | 
 | 301 | Python. | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 302 |  | 
 | 303 | Each codec has to define four interfaces to make it usable as codec in Python: | 
 | 304 | stateless encoder, stateless decoder, stream reader and stream writer. The | 
 | 305 | stream reader and writers typically reuse the stateless encoder/decoder to | 
 | 306 | implement the file protocols. | 
 | 307 |  | 
 | 308 | The :class:`Codec` class defines the interface for stateless encoders/decoders. | 
 | 309 |  | 
 | 310 | To simplify and standardize error handling, the :meth:`encode` and | 
 | 311 | :meth:`decode` methods may implement different error handling schemes by | 
 | 312 | providing the *errors* string argument.  The following string values are defined | 
 | 313 | and implemented by all standard Python codecs: | 
 | 314 |  | 
 | 315 | +-------------------------+-----------------------------------------------+ | 
 | 316 | | Value                   | Meaning                                       | | 
 | 317 | +=========================+===============================================+ | 
 | 318 | | ``'strict'``            | Raise :exc:`UnicodeError` (or a subclass);    | | 
 | 319 | |                         | this is the default.                          | | 
 | 320 | +-------------------------+-----------------------------------------------+ | 
 | 321 | | ``'ignore'``            | Ignore the character and continue with the    | | 
 | 322 | |                         | next.                                         | | 
 | 323 | +-------------------------+-----------------------------------------------+ | 
 | 324 | | ``'replace'``           | Replace with a suitable replacement           | | 
 | 325 | |                         | character; Python will use the official       | | 
 | 326 | |                         | U+FFFD REPLACEMENT CHARACTER for the built-in | | 
 | 327 | |                         | Unicode codecs on decoding and '?' on         | | 
 | 328 | |                         | encoding.                                     | | 
 | 329 | +-------------------------+-----------------------------------------------+ | 
 | 330 | | ``'xmlcharrefreplace'`` | Replace with the appropriate XML character    | | 
 | 331 | |                         | reference (only for encoding).                | | 
 | 332 | +-------------------------+-----------------------------------------------+ | 
 | 333 | | ``'backslashreplace'``  | Replace with backslashed escape sequences     | | 
 | 334 | |                         | (only for encoding).                          | | 
 | 335 | +-------------------------+-----------------------------------------------+ | 
| Martin v. Löwis | 3d2eca0 | 2009-06-29 06:35:26 +0000 | [diff] [blame] | 336 | | ``'surrogateescape'``   | Replace byte with surrogate U+DCxx, as defined| | 
 | 337 | |                         | in :pep:`383`.                                | | 
| Martin v. Löwis | 011e842 | 2009-05-05 04:43:17 +0000 | [diff] [blame] | 338 | +-------------------------+-----------------------------------------------+ | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 339 |  | 
| Martin v. Löwis | db12d45 | 2009-05-02 18:52:14 +0000 | [diff] [blame] | 340 | In addition, the following error handlers are specific to a single codec: | 
 | 341 |  | 
| Martin v. Löwis | e0a2b72 | 2009-05-10 08:08:56 +0000 | [diff] [blame] | 342 | +-------------------+---------+-------------------------------------------+ | 
 | 343 | | Value             | Codec   | Meaning                                   | | 
 | 344 | +===================+=========+===========================================+ | 
 | 345 | |``'surrogatepass'``| utf-8   | Allow encoding and decoding of surrogate  | | 
 | 346 | |                   |         | codes in UTF-8.                           | | 
 | 347 | +-------------------+---------+-------------------------------------------+ | 
| Martin v. Löwis | db12d45 | 2009-05-02 18:52:14 +0000 | [diff] [blame] | 348 |  | 
 | 349 | .. versionadded:: 3.1 | 
| Martin v. Löwis | 43c5778 | 2009-05-10 08:15:24 +0000 | [diff] [blame] | 350 |    The ``'surrogateescape'`` and ``'surrogatepass'`` error handlers. | 
| Martin v. Löwis | db12d45 | 2009-05-02 18:52:14 +0000 | [diff] [blame] | 351 |  | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 352 | The set of allowed values can be extended via :meth:`register_error`. | 
 | 353 |  | 
 | 354 |  | 
 | 355 | .. _codec-objects: | 
 | 356 |  | 
 | 357 | Codec Objects | 
 | 358 | ^^^^^^^^^^^^^ | 
 | 359 |  | 
 | 360 | The :class:`Codec` class defines these methods which also define the function | 
 | 361 | interfaces of the stateless encoder and decoder: | 
 | 362 |  | 
 | 363 |  | 
 | 364 | .. method:: Codec.encode(input[, errors]) | 
 | 365 |  | 
 | 366 |    Encodes the object *input* and returns a tuple (output object, length consumed). | 
| Georg Brandl | 30c78d6 | 2008-05-11 14:52:00 +0000 | [diff] [blame] | 367 |    Encoding converts a string object to a bytes object using a particular | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 368 |    character set encoding (e.g., ``cp1252`` or ``iso-8859-1``). | 
 | 369 |  | 
 | 370 |    *errors* defines the error handling to apply. It defaults to ``'strict'`` | 
 | 371 |    handling. | 
 | 372 |  | 
 | 373 |    The method may not store state in the :class:`Codec` instance. Use | 
 | 374 |    :class:`StreamCodec` for codecs which have to keep state in order to make | 
 | 375 |    encoding/decoding efficient. | 
 | 376 |  | 
 | 377 |    The encoder must be able to handle zero length input and return an empty object | 
 | 378 |    of the output object type in this situation. | 
 | 379 |  | 
 | 380 |  | 
 | 381 | .. method:: Codec.decode(input[, errors]) | 
 | 382 |  | 
| Georg Brandl | 30c78d6 | 2008-05-11 14:52:00 +0000 | [diff] [blame] | 383 |    Decodes the object *input* and returns a tuple (output object, length | 
 | 384 |    consumed).  Decoding converts a bytes object encoded using a particular | 
 | 385 |    character set encoding to a string object. | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 386 |  | 
| Georg Brandl | 30c78d6 | 2008-05-11 14:52:00 +0000 | [diff] [blame] | 387 |    *input* must be a bytes object or one which provides the read-only character | 
 | 388 |    buffer interface -- for example, buffer objects and memory mapped files. | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 389 |  | 
 | 390 |    *errors* defines the error handling to apply. It defaults to ``'strict'`` | 
 | 391 |    handling. | 
 | 392 |  | 
 | 393 |    The method may not store state in the :class:`Codec` instance. Use | 
 | 394 |    :class:`StreamCodec` for codecs which have to keep state in order to make | 
 | 395 |    encoding/decoding efficient. | 
 | 396 |  | 
 | 397 |    The decoder must be able to handle zero length input and return an empty object | 
 | 398 |    of the output object type in this situation. | 
 | 399 |  | 
 | 400 | The :class:`IncrementalEncoder` and :class:`IncrementalDecoder` classes provide | 
 | 401 | the basic interface for incremental encoding and decoding. Encoding/decoding the | 
 | 402 | input isn't done with one call to the stateless encoder/decoder function, but | 
 | 403 | with multiple calls to the :meth:`encode`/:meth:`decode` method of the | 
 | 404 | incremental encoder/decoder. The incremental encoder/decoder keeps track of the | 
 | 405 | encoding/decoding process during method calls. | 
 | 406 |  | 
 | 407 | The joined output of calls to the :meth:`encode`/:meth:`decode` method is the | 
 | 408 | same as if all the single inputs were joined into one, and this input was | 
 | 409 | encoded/decoded with the stateless encoder/decoder. | 
 | 410 |  | 
 | 411 |  | 
 | 412 | .. _incremental-encoder-objects: | 
 | 413 |  | 
 | 414 | IncrementalEncoder Objects | 
 | 415 | ^^^^^^^^^^^^^^^^^^^^^^^^^^ | 
 | 416 |  | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 417 | The :class:`IncrementalEncoder` class is used for encoding an input in multiple | 
 | 418 | steps. It defines the following methods which every incremental encoder must | 
 | 419 | define in order to be compatible with the Python codec registry. | 
 | 420 |  | 
 | 421 |  | 
 | 422 | .. class:: IncrementalEncoder([errors]) | 
 | 423 |  | 
 | 424 |    Constructor for an :class:`IncrementalEncoder` instance. | 
 | 425 |  | 
 | 426 |    All incremental encoders must provide this constructor interface. They are free | 
 | 427 |    to add additional keyword arguments, but only the ones defined here are used by | 
 | 428 |    the Python codec registry. | 
 | 429 |  | 
 | 430 |    The :class:`IncrementalEncoder` may implement different error handling schemes | 
 | 431 |    by providing the *errors* keyword argument. These parameters are predefined: | 
 | 432 |  | 
 | 433 |    * ``'strict'`` Raise :exc:`ValueError` (or a subclass); this is the default. | 
 | 434 |  | 
 | 435 |    * ``'ignore'`` Ignore the character and continue with the next. | 
 | 436 |  | 
 | 437 |    * ``'replace'`` Replace with a suitable replacement character | 
 | 438 |  | 
 | 439 |    * ``'xmlcharrefreplace'`` Replace with the appropriate XML character reference | 
 | 440 |  | 
 | 441 |    * ``'backslashreplace'`` Replace with backslashed escape sequences. | 
 | 442 |  | 
 | 443 |    The *errors* argument will be assigned to an attribute of the same name. | 
 | 444 |    Assigning to this attribute makes it possible to switch between different error | 
 | 445 |    handling strategies during the lifetime of the :class:`IncrementalEncoder` | 
 | 446 |    object. | 
 | 447 |  | 
 | 448 |    The set of allowed values for the *errors* argument can be extended with | 
 | 449 |    :func:`register_error`. | 
 | 450 |  | 
 | 451 |  | 
| Benjamin Peterson | e41251e | 2008-04-25 01:59:09 +0000 | [diff] [blame] | 452 |    .. method:: encode(object[, final]) | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 453 |  | 
| Benjamin Peterson | e41251e | 2008-04-25 01:59:09 +0000 | [diff] [blame] | 454 |       Encodes *object* (taking the current state of the encoder into account) | 
 | 455 |       and returns the resulting encoded object. If this is the last call to | 
 | 456 |       :meth:`encode` *final* must be true (the default is false). | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 457 |  | 
 | 458 |  | 
| Benjamin Peterson | e41251e | 2008-04-25 01:59:09 +0000 | [diff] [blame] | 459 |    .. method:: reset() | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 460 |  | 
| Benjamin Peterson | e41251e | 2008-04-25 01:59:09 +0000 | [diff] [blame] | 461 |       Reset the encoder to the initial state. | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 462 |  | 
 | 463 |  | 
 | 464 | .. method:: IncrementalEncoder.getstate() | 
 | 465 |  | 
 | 466 |    Return the current state of the encoder which must be an integer. The | 
 | 467 |    implementation should make sure that ``0`` is the most common state. (States | 
 | 468 |    that are more complicated than integers can be converted into an integer by | 
 | 469 |    marshaling/pickling the state and encoding the bytes of the resulting string | 
 | 470 |    into an integer). | 
 | 471 |  | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 472 |  | 
 | 473 | .. method:: IncrementalEncoder.setstate(state) | 
 | 474 |  | 
 | 475 |    Set the state of the encoder to *state*. *state* must be an encoder state | 
 | 476 |    returned by :meth:`getstate`. | 
 | 477 |  | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 478 |  | 
 | 479 | .. _incremental-decoder-objects: | 
 | 480 |  | 
 | 481 | IncrementalDecoder Objects | 
 | 482 | ^^^^^^^^^^^^^^^^^^^^^^^^^^ | 
 | 483 |  | 
 | 484 | The :class:`IncrementalDecoder` class is used for decoding an input in multiple | 
 | 485 | steps. It defines the following methods which every incremental decoder must | 
 | 486 | define in order to be compatible with the Python codec registry. | 
 | 487 |  | 
 | 488 |  | 
 | 489 | .. class:: IncrementalDecoder([errors]) | 
 | 490 |  | 
 | 491 |    Constructor for an :class:`IncrementalDecoder` instance. | 
 | 492 |  | 
 | 493 |    All incremental decoders must provide this constructor interface. They are free | 
 | 494 |    to add additional keyword arguments, but only the ones defined here are used by | 
 | 495 |    the Python codec registry. | 
 | 496 |  | 
 | 497 |    The :class:`IncrementalDecoder` may implement different error handling schemes | 
 | 498 |    by providing the *errors* keyword argument. These parameters are predefined: | 
 | 499 |  | 
 | 500 |    * ``'strict'`` Raise :exc:`ValueError` (or a subclass); this is the default. | 
 | 501 |  | 
 | 502 |    * ``'ignore'`` Ignore the character and continue with the next. | 
 | 503 |  | 
 | 504 |    * ``'replace'`` Replace with a suitable replacement character. | 
 | 505 |  | 
 | 506 |    The *errors* argument will be assigned to an attribute of the same name. | 
 | 507 |    Assigning to this attribute makes it possible to switch between different error | 
| Benjamin Peterson | 3e4f055 | 2008-09-02 00:31:15 +0000 | [diff] [blame] | 508 |    handling strategies during the lifetime of the :class:`IncrementalDecoder` | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 509 |    object. | 
 | 510 |  | 
 | 511 |    The set of allowed values for the *errors* argument can be extended with | 
 | 512 |    :func:`register_error`. | 
 | 513 |  | 
 | 514 |  | 
| Benjamin Peterson | e41251e | 2008-04-25 01:59:09 +0000 | [diff] [blame] | 515 |    .. method:: decode(object[, final]) | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 516 |  | 
| Benjamin Peterson | e41251e | 2008-04-25 01:59:09 +0000 | [diff] [blame] | 517 |       Decodes *object* (taking the current state of the decoder into account) | 
 | 518 |       and returns the resulting decoded object. If this is the last call to | 
 | 519 |       :meth:`decode` *final* must be true (the default is false). If *final* is | 
 | 520 |       true the decoder must decode the input completely and must flush all | 
 | 521 |       buffers. If this isn't possible (e.g. because of incomplete byte sequences | 
 | 522 |       at the end of the input) it must initiate error handling just like in the | 
 | 523 |       stateless case (which might raise an exception). | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 524 |  | 
 | 525 |  | 
| Benjamin Peterson | e41251e | 2008-04-25 01:59:09 +0000 | [diff] [blame] | 526 |    .. method:: reset() | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 527 |  | 
| Benjamin Peterson | e41251e | 2008-04-25 01:59:09 +0000 | [diff] [blame] | 528 |       Reset the decoder to the initial state. | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 529 |  | 
 | 530 |  | 
| Benjamin Peterson | e41251e | 2008-04-25 01:59:09 +0000 | [diff] [blame] | 531 |    .. method:: getstate() | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 532 |  | 
| Benjamin Peterson | e41251e | 2008-04-25 01:59:09 +0000 | [diff] [blame] | 533 |       Return the current state of the decoder. This must be a tuple with two | 
 | 534 |       items, the first must be the buffer containing the still undecoded | 
 | 535 |       input. The second must be an integer and can be additional state | 
 | 536 |       info. (The implementation should make sure that ``0`` is the most common | 
 | 537 |       additional state info.) If this additional state info is ``0`` it must be | 
 | 538 |       possible to set the decoder to the state which has no input buffered and | 
 | 539 |       ``0`` as the additional state info, so that feeding the previously | 
 | 540 |       buffered input to the decoder returns it to the previous state without | 
 | 541 |       producing any output. (Additional state info that is more complicated than | 
 | 542 |       integers can be converted into an integer by marshaling/pickling the info | 
 | 543 |       and encoding the bytes of the resulting string into an integer.) | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 544 |  | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 545 |  | 
| Benjamin Peterson | e41251e | 2008-04-25 01:59:09 +0000 | [diff] [blame] | 546 |    .. method:: setstate(state) | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 547 |  | 
| Benjamin Peterson | e41251e | 2008-04-25 01:59:09 +0000 | [diff] [blame] | 548 |       Set the state of the encoder to *state*. *state* must be a decoder state | 
 | 549 |       returned by :meth:`getstate`. | 
 | 550 |  | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 551 |  | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 552 | The :class:`StreamWriter` and :class:`StreamReader` classes provide generic | 
 | 553 | working interfaces which can be used to implement new encoding submodules very | 
 | 554 | easily. See :mod:`encodings.utf_8` for an example of how this is done. | 
 | 555 |  | 
 | 556 |  | 
 | 557 | .. _stream-writer-objects: | 
 | 558 |  | 
 | 559 | StreamWriter Objects | 
 | 560 | ^^^^^^^^^^^^^^^^^^^^ | 
 | 561 |  | 
 | 562 | The :class:`StreamWriter` class is a subclass of :class:`Codec` and defines the | 
 | 563 | following methods which every stream writer must define in order to be | 
 | 564 | compatible with the Python codec registry. | 
 | 565 |  | 
 | 566 |  | 
 | 567 | .. class:: StreamWriter(stream[, errors]) | 
 | 568 |  | 
 | 569 |    Constructor for a :class:`StreamWriter` instance. | 
 | 570 |  | 
 | 571 |    All stream writers must provide this constructor interface. They are free to add | 
 | 572 |    additional keyword arguments, but only the ones defined here are used by the | 
 | 573 |    Python codec registry. | 
 | 574 |  | 
 | 575 |    *stream* must be a file-like object open for writing binary data. | 
 | 576 |  | 
 | 577 |    The :class:`StreamWriter` may implement different error handling schemes by | 
 | 578 |    providing the *errors* keyword argument. These parameters are predefined: | 
 | 579 |  | 
 | 580 |    * ``'strict'`` Raise :exc:`ValueError` (or a subclass); this is the default. | 
 | 581 |  | 
 | 582 |    * ``'ignore'`` Ignore the character and continue with the next. | 
 | 583 |  | 
 | 584 |    * ``'replace'`` Replace with a suitable replacement character | 
 | 585 |  | 
 | 586 |    * ``'xmlcharrefreplace'`` Replace with the appropriate XML character reference | 
 | 587 |  | 
 | 588 |    * ``'backslashreplace'`` Replace with backslashed escape sequences. | 
 | 589 |  | 
 | 590 |    The *errors* argument will be assigned to an attribute of the same name. | 
 | 591 |    Assigning to this attribute makes it possible to switch between different error | 
 | 592 |    handling strategies during the lifetime of the :class:`StreamWriter` object. | 
 | 593 |  | 
 | 594 |    The set of allowed values for the *errors* argument can be extended with | 
 | 595 |    :func:`register_error`. | 
 | 596 |  | 
 | 597 |  | 
| Benjamin Peterson | e41251e | 2008-04-25 01:59:09 +0000 | [diff] [blame] | 598 |    .. method:: write(object) | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 599 |  | 
| Benjamin Peterson | e41251e | 2008-04-25 01:59:09 +0000 | [diff] [blame] | 600 |       Writes the object's contents encoded to the stream. | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 601 |  | 
 | 602 |  | 
| Benjamin Peterson | e41251e | 2008-04-25 01:59:09 +0000 | [diff] [blame] | 603 |    .. method:: writelines(list) | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 604 |  | 
| Benjamin Peterson | e41251e | 2008-04-25 01:59:09 +0000 | [diff] [blame] | 605 |       Writes the concatenated list of strings to the stream (possibly by reusing | 
 | 606 |       the :meth:`write` method). | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 607 |  | 
 | 608 |  | 
| Benjamin Peterson | e41251e | 2008-04-25 01:59:09 +0000 | [diff] [blame] | 609 |    .. method:: reset() | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 610 |  | 
| Benjamin Peterson | e41251e | 2008-04-25 01:59:09 +0000 | [diff] [blame] | 611 |       Flushes and resets the codec buffers used for keeping state. | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 612 |  | 
| Benjamin Peterson | e41251e | 2008-04-25 01:59:09 +0000 | [diff] [blame] | 613 |       Calling this method should ensure that the data on the output is put into | 
 | 614 |       a clean state that allows appending of new fresh data without having to | 
 | 615 |       rescan the whole stream to recover state. | 
 | 616 |  | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 617 |  | 
 | 618 | In addition to the above methods, the :class:`StreamWriter` must also inherit | 
 | 619 | all other methods and attributes from the underlying stream. | 
 | 620 |  | 
 | 621 |  | 
 | 622 | .. _stream-reader-objects: | 
 | 623 |  | 
 | 624 | StreamReader Objects | 
 | 625 | ^^^^^^^^^^^^^^^^^^^^ | 
 | 626 |  | 
 | 627 | The :class:`StreamReader` class is a subclass of :class:`Codec` and defines the | 
 | 628 | following methods which every stream reader must define in order to be | 
 | 629 | compatible with the Python codec registry. | 
 | 630 |  | 
 | 631 |  | 
 | 632 | .. class:: StreamReader(stream[, errors]) | 
 | 633 |  | 
 | 634 |    Constructor for a :class:`StreamReader` instance. | 
 | 635 |  | 
 | 636 |    All stream readers must provide this constructor interface. They are free to add | 
 | 637 |    additional keyword arguments, but only the ones defined here are used by the | 
 | 638 |    Python codec registry. | 
 | 639 |  | 
 | 640 |    *stream* must be a file-like object open for reading (binary) data. | 
 | 641 |  | 
 | 642 |    The :class:`StreamReader` may implement different error handling schemes by | 
 | 643 |    providing the *errors* keyword argument. These parameters are defined: | 
 | 644 |  | 
 | 645 |    * ``'strict'`` Raise :exc:`ValueError` (or a subclass); this is the default. | 
 | 646 |  | 
 | 647 |    * ``'ignore'`` Ignore the character and continue with the next. | 
 | 648 |  | 
 | 649 |    * ``'replace'`` Replace with a suitable replacement character. | 
 | 650 |  | 
 | 651 |    The *errors* argument will be assigned to an attribute of the same name. | 
 | 652 |    Assigning to this attribute makes it possible to switch between different error | 
 | 653 |    handling strategies during the lifetime of the :class:`StreamReader` object. | 
 | 654 |  | 
 | 655 |    The set of allowed values for the *errors* argument can be extended with | 
 | 656 |    :func:`register_error`. | 
 | 657 |  | 
 | 658 |  | 
| Benjamin Peterson | e41251e | 2008-04-25 01:59:09 +0000 | [diff] [blame] | 659 |    .. method:: read([size[, chars, [firstline]]]) | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 660 |  | 
| Benjamin Peterson | e41251e | 2008-04-25 01:59:09 +0000 | [diff] [blame] | 661 |       Decodes data from the stream and returns the resulting object. | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 662 |  | 
| Benjamin Peterson | e41251e | 2008-04-25 01:59:09 +0000 | [diff] [blame] | 663 |       *chars* indicates the number of characters to read from the | 
 | 664 |       stream. :func:`read` will never return more than *chars* characters, but | 
 | 665 |       it might return less, if there are not enough characters available. | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 666 |  | 
| Benjamin Peterson | e41251e | 2008-04-25 01:59:09 +0000 | [diff] [blame] | 667 |       *size* indicates the approximate maximum number of bytes to read from the | 
 | 668 |       stream for decoding purposes. The decoder can modify this setting as | 
 | 669 |       appropriate. The default value -1 indicates to read and decode as much as | 
 | 670 |       possible.  *size* is intended to prevent having to decode huge files in | 
 | 671 |       one step. | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 672 |  | 
| Benjamin Peterson | e41251e | 2008-04-25 01:59:09 +0000 | [diff] [blame] | 673 |       *firstline* indicates that it would be sufficient to only return the first | 
 | 674 |       line, if there are decoding errors on later lines. | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 675 |  | 
| Benjamin Peterson | e41251e | 2008-04-25 01:59:09 +0000 | [diff] [blame] | 676 |       The method should use a greedy read strategy meaning that it should read | 
 | 677 |       as much data as is allowed within the definition of the encoding and the | 
 | 678 |       given size, e.g.  if optional encoding endings or state markers are | 
 | 679 |       available on the stream, these should be read too. | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 680 |  | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 681 |  | 
| Benjamin Peterson | e41251e | 2008-04-25 01:59:09 +0000 | [diff] [blame] | 682 |    .. method:: readline([size[, keepends]]) | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 683 |  | 
| Benjamin Peterson | e41251e | 2008-04-25 01:59:09 +0000 | [diff] [blame] | 684 |       Read one line from the input stream and return the decoded data. | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 685 |  | 
| Benjamin Peterson | e41251e | 2008-04-25 01:59:09 +0000 | [diff] [blame] | 686 |       *size*, if given, is passed as size argument to the stream's | 
 | 687 |       :meth:`readline` method. | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 688 |  | 
| Benjamin Peterson | e41251e | 2008-04-25 01:59:09 +0000 | [diff] [blame] | 689 |       If *keepends* is false line-endings will be stripped from the lines | 
 | 690 |       returned. | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 691 |  | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 692 |  | 
| Benjamin Peterson | e41251e | 2008-04-25 01:59:09 +0000 | [diff] [blame] | 693 |    .. method:: readlines([sizehint[, keepends]]) | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 694 |  | 
| Benjamin Peterson | e41251e | 2008-04-25 01:59:09 +0000 | [diff] [blame] | 695 |       Read all lines available on the input stream and return them as a list of | 
 | 696 |       lines. | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 697 |  | 
| Benjamin Peterson | e41251e | 2008-04-25 01:59:09 +0000 | [diff] [blame] | 698 |       Line-endings are implemented using the codec's decoder method and are | 
 | 699 |       included in the list entries if *keepends* is true. | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 700 |  | 
| Benjamin Peterson | e41251e | 2008-04-25 01:59:09 +0000 | [diff] [blame] | 701 |       *sizehint*, if given, is passed as the *size* argument to the stream's | 
 | 702 |       :meth:`read` method. | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 703 |  | 
 | 704 |  | 
| Benjamin Peterson | e41251e | 2008-04-25 01:59:09 +0000 | [diff] [blame] | 705 |    .. method:: reset() | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 706 |  | 
| Benjamin Peterson | e41251e | 2008-04-25 01:59:09 +0000 | [diff] [blame] | 707 |       Resets the codec buffers used for keeping state. | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 708 |  | 
| Benjamin Peterson | e41251e | 2008-04-25 01:59:09 +0000 | [diff] [blame] | 709 |       Note that no stream repositioning should take place.  This method is | 
 | 710 |       primarily intended to be able to recover from decoding errors. | 
 | 711 |  | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 712 |  | 
 | 713 | In addition to the above methods, the :class:`StreamReader` must also inherit | 
 | 714 | all other methods and attributes from the underlying stream. | 
 | 715 |  | 
 | 716 | The next two base classes are included for convenience. They are not needed by | 
 | 717 | the codec registry, but may provide useful in practice. | 
 | 718 |  | 
 | 719 |  | 
 | 720 | .. _stream-reader-writer: | 
 | 721 |  | 
 | 722 | StreamReaderWriter Objects | 
 | 723 | ^^^^^^^^^^^^^^^^^^^^^^^^^^ | 
 | 724 |  | 
 | 725 | The :class:`StreamReaderWriter` allows wrapping streams which work in both read | 
 | 726 | and write modes. | 
 | 727 |  | 
 | 728 | The design is such that one can use the factory functions returned by the | 
 | 729 | :func:`lookup` function to construct the instance. | 
 | 730 |  | 
 | 731 |  | 
 | 732 | .. class:: StreamReaderWriter(stream, Reader, Writer, errors) | 
 | 733 |  | 
 | 734 |    Creates a :class:`StreamReaderWriter` instance. *stream* must be a file-like | 
 | 735 |    object. *Reader* and *Writer* must be factory functions or classes providing the | 
 | 736 |    :class:`StreamReader` and :class:`StreamWriter` interface resp. Error handling | 
 | 737 |    is done in the same way as defined for the stream readers and writers. | 
 | 738 |  | 
 | 739 | :class:`StreamReaderWriter` instances define the combined interfaces of | 
 | 740 | :class:`StreamReader` and :class:`StreamWriter` classes. They inherit all other | 
 | 741 | methods and attributes from the underlying stream. | 
 | 742 |  | 
 | 743 |  | 
 | 744 | .. _stream-recoder-objects: | 
 | 745 |  | 
 | 746 | StreamRecoder Objects | 
 | 747 | ^^^^^^^^^^^^^^^^^^^^^ | 
 | 748 |  | 
 | 749 | The :class:`StreamRecoder` provide a frontend - backend view of encoding data | 
 | 750 | which is sometimes useful when dealing with different encoding environments. | 
 | 751 |  | 
 | 752 | The design is such that one can use the factory functions returned by the | 
 | 753 | :func:`lookup` function to construct the instance. | 
 | 754 |  | 
 | 755 |  | 
 | 756 | .. class:: StreamRecoder(stream, encode, decode, Reader, Writer, errors) | 
 | 757 |  | 
 | 758 |    Creates a :class:`StreamRecoder` instance which implements a two-way conversion: | 
 | 759 |    *encode* and *decode* work on the frontend (the input to :meth:`read` and output | 
 | 760 |    of :meth:`write`) while *Reader* and *Writer* work on the backend (reading and | 
 | 761 |    writing to the stream). | 
 | 762 |  | 
 | 763 |    You can use these objects to do transparent direct recodings from e.g. Latin-1 | 
 | 764 |    to UTF-8 and back. | 
 | 765 |  | 
 | 766 |    *stream* must be a file-like object. | 
 | 767 |  | 
 | 768 |    *encode*, *decode* must adhere to the :class:`Codec` interface. *Reader*, | 
 | 769 |    *Writer* must be factory functions or classes providing objects of the | 
 | 770 |    :class:`StreamReader` and :class:`StreamWriter` interface respectively. | 
 | 771 |  | 
 | 772 |    *encode* and *decode* are needed for the frontend translation, *Reader* and | 
| Georg Brandl | 30c78d6 | 2008-05-11 14:52:00 +0000 | [diff] [blame] | 773 |    *Writer* for the backend translation. | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 774 |  | 
 | 775 |    Error handling is done in the same way as defined for the stream readers and | 
 | 776 |    writers. | 
 | 777 |  | 
| Benjamin Peterson | e41251e | 2008-04-25 01:59:09 +0000 | [diff] [blame] | 778 |  | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 779 | :class:`StreamRecoder` instances define the combined interfaces of | 
 | 780 | :class:`StreamReader` and :class:`StreamWriter` classes. They inherit all other | 
 | 781 | methods and attributes from the underlying stream. | 
 | 782 |  | 
 | 783 |  | 
 | 784 | .. _encodings-overview: | 
 | 785 |  | 
 | 786 | Encodings and Unicode | 
 | 787 | --------------------- | 
 | 788 |  | 
| Georg Brandl | 30c78d6 | 2008-05-11 14:52:00 +0000 | [diff] [blame] | 789 | Strings are stored internally as sequences of codepoints (to be precise | 
| Georg Brandl | 60203b4 | 2010-10-06 10:11:56 +0000 | [diff] [blame] | 790 | as :c:type:`Py_UNICODE` arrays). Depending on the way Python is compiled (either | 
| Georg Brandl | 52d168a | 2008-01-07 18:10:24 +0000 | [diff] [blame] | 791 | via :option:`--without-wide-unicode` or :option:`--with-wide-unicode`, with the | 
| Georg Brandl | 60203b4 | 2010-10-06 10:11:56 +0000 | [diff] [blame] | 792 | former being the default) :c:type:`Py_UNICODE` is either a 16-bit or 32-bit data | 
| Georg Brandl | 30c78d6 | 2008-05-11 14:52:00 +0000 | [diff] [blame] | 793 | type. Once a string object is used outside of CPU and memory, CPU endianness | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 794 | and how these arrays are stored as bytes become an issue.  Transforming a | 
| Georg Brandl | 30c78d6 | 2008-05-11 14:52:00 +0000 | [diff] [blame] | 795 | string object into a sequence of bytes is called encoding and recreating the | 
 | 796 | string object from the sequence of bytes is known as decoding.  There are many | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 797 | different methods for how this transformation can be done (these methods are | 
 | 798 | also called encodings). The simplest method is to map the codepoints 0-255 to | 
| Georg Brandl | 30c78d6 | 2008-05-11 14:52:00 +0000 | [diff] [blame] | 799 | the bytes ``0x0``-``0xff``. This means that a string object that contains | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 800 | codepoints above ``U+00FF`` can't be encoded with this method (which is called | 
| Georg Brandl | 30c78d6 | 2008-05-11 14:52:00 +0000 | [diff] [blame] | 801 | ``'latin-1'`` or ``'iso-8859-1'``). :func:`str.encode` will raise a | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 802 | :exc:`UnicodeEncodeError` that looks like this: ``UnicodeEncodeError: 'latin-1' | 
| Georg Brandl | 30c78d6 | 2008-05-11 14:52:00 +0000 | [diff] [blame] | 803 | codec can't encode character '\u1234' in position 3: ordinal not in | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 804 | range(256)``. | 
 | 805 |  | 
 | 806 | There's another group of encodings (the so called charmap encodings) that choose | 
| Georg Brandl | 30c78d6 | 2008-05-11 14:52:00 +0000 | [diff] [blame] | 807 | a different subset of all Unicode code points and how these codepoints are | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 808 | mapped to the bytes ``0x0``-``0xff``. To see how this is done simply open | 
 | 809 | e.g. :file:`encodings/cp1252.py` (which is an encoding that is used primarily on | 
 | 810 | Windows). There's a string constant with 256 characters that shows you which | 
 | 811 | character is mapped to which byte value. | 
 | 812 |  | 
 | 813 | All of these encodings can only encode 256 of the 65536 (or 1114111) codepoints | 
| Georg Brandl | 30c78d6 | 2008-05-11 14:52:00 +0000 | [diff] [blame] | 814 | defined in Unicode. A simple and straightforward way that can store each Unicode | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 815 | code point, is to store each codepoint as two consecutive bytes. There are two | 
 | 816 | possibilities: Store the bytes in big endian or in little endian order. These | 
 | 817 | two encodings are called UTF-16-BE and UTF-16-LE respectively. Their | 
 | 818 | disadvantage is that if e.g. you use UTF-16-BE on a little endian machine you | 
 | 819 | will always have to swap bytes on encoding and decoding. UTF-16 avoids this | 
 | 820 | problem: Bytes will always be in natural endianness. When these bytes are read | 
 | 821 | by a CPU with a different endianness, then bytes have to be swapped though. To | 
 | 822 | be able to detect the endianness of a UTF-16 byte sequence, there's the so | 
 | 823 | called BOM (the "Byte Order Mark"). This is the Unicode character ``U+FEFF``. | 
 | 824 | This character will be prepended to every UTF-16 byte sequence. The byte swapped | 
 | 825 | version of this character (``0xFFFE``) is an illegal character that may not | 
 | 826 | appear in a Unicode text. So when the first character in an UTF-16 byte sequence | 
 | 827 | appears to be a ``U+FFFE`` the bytes have to be swapped on decoding. | 
 | 828 | Unfortunately upto Unicode 4.0 the character ``U+FEFF`` had a second purpose as | 
 | 829 | a ``ZERO WIDTH NO-BREAK SPACE``: A character that has no width and doesn't allow | 
 | 830 | a word to be split. It can e.g. be used to give hints to a ligature algorithm. | 
 | 831 | With Unicode 4.0 using ``U+FEFF`` as a ``ZERO WIDTH NO-BREAK SPACE`` has been | 
 | 832 | deprecated (with ``U+2060`` (``WORD JOINER``) assuming this role). Nevertheless | 
 | 833 | Unicode software still must be able to handle ``U+FEFF`` in both roles: As a BOM | 
 | 834 | it's a device to determine the storage layout of the encoded bytes, and vanishes | 
| Georg Brandl | 30c78d6 | 2008-05-11 14:52:00 +0000 | [diff] [blame] | 835 | once the byte sequence has been decoded into a string; as a ``ZERO WIDTH | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 836 | NO-BREAK SPACE`` it's a normal character that will be decoded like any other. | 
 | 837 |  | 
 | 838 | There's another encoding that is able to encoding the full range of Unicode | 
 | 839 | characters: UTF-8. UTF-8 is an 8-bit encoding, which means there are no issues | 
 | 840 | with byte order in UTF-8. Each byte in a UTF-8 byte sequence consists of two | 
 | 841 | parts: Marker bits (the most significant bits) and payload bits. The marker bits | 
 | 842 | are a sequence of zero to six 1 bits followed by a 0 bit. Unicode characters are | 
 | 843 | encoded like this (with x being payload bits, which when concatenated give the | 
 | 844 | Unicode character): | 
 | 845 |  | 
 | 846 | +-----------------------------------+----------------------------------------------+ | 
 | 847 | | Range                             | Encoding                                     | | 
 | 848 | +===================================+==============================================+ | 
 | 849 | | ``U-00000000`` ... ``U-0000007F`` | 0xxxxxxx                                     | | 
 | 850 | +-----------------------------------+----------------------------------------------+ | 
 | 851 | | ``U-00000080`` ... ``U-000007FF`` | 110xxxxx 10xxxxxx                            | | 
 | 852 | +-----------------------------------+----------------------------------------------+ | 
 | 853 | | ``U-00000800`` ... ``U-0000FFFF`` | 1110xxxx 10xxxxxx 10xxxxxx                   | | 
 | 854 | +-----------------------------------+----------------------------------------------+ | 
 | 855 | | ``U-00010000`` ... ``U-001FFFFF`` | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx          | | 
 | 856 | +-----------------------------------+----------------------------------------------+ | 
 | 857 | | ``U-00200000`` ... ``U-03FFFFFF`` | 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx | | 
 | 858 | +-----------------------------------+----------------------------------------------+ | 
 | 859 | | ``U-04000000`` ... ``U-7FFFFFFF`` | 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx | | 
 | 860 | |                                   | 10xxxxxx                                     | | 
 | 861 | +-----------------------------------+----------------------------------------------+ | 
 | 862 |  | 
 | 863 | The least significant bit of the Unicode character is the rightmost x bit. | 
 | 864 |  | 
 | 865 | As UTF-8 is an 8-bit encoding no BOM is required and any ``U+FEFF`` character in | 
| Georg Brandl | 30c78d6 | 2008-05-11 14:52:00 +0000 | [diff] [blame] | 866 | the decoded string (even if it's the first character) is treated as a ``ZERO | 
 | 867 | WIDTH NO-BREAK SPACE``. | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 868 |  | 
 | 869 | Without external information it's impossible to reliably determine which | 
| Georg Brandl | 30c78d6 | 2008-05-11 14:52:00 +0000 | [diff] [blame] | 870 | encoding was used for encoding a string. Each charmap encoding can | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 871 | decode any random byte sequence. However that's not possible with UTF-8, as | 
 | 872 | UTF-8 byte sequences have a structure that doesn't allow arbitrary byte | 
| Thomas Wouters | 89d996e | 2007-09-08 17:39:28 +0000 | [diff] [blame] | 873 | sequences. To increase the reliability with which a UTF-8 encoding can be | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 874 | detected, Microsoft invented a variant of UTF-8 (that Python 2.5 calls | 
 | 875 | ``"utf-8-sig"``) for its Notepad program: Before any of the Unicode characters | 
 | 876 | is written to the file, a UTF-8 encoded BOM (which looks like this as a byte | 
 | 877 | sequence: ``0xef``, ``0xbb``, ``0xbf``) is written. As it's rather improbable | 
 | 878 | that any charmap encoded file starts with these byte values (which would e.g. | 
 | 879 | map to | 
 | 880 |  | 
 | 881 |    | LATIN SMALL LETTER I WITH DIAERESIS | 
 | 882 |    | RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK | 
 | 883 |    | INVERTED QUESTION MARK | 
 | 884 |  | 
 | 885 | in iso-8859-1), this increases the probability that a utf-8-sig encoding can be | 
 | 886 | correctly guessed from the byte sequence. So here the BOM is not used to be able | 
 | 887 | to determine the byte order used for generating the byte sequence, but as a | 
 | 888 | signature that helps in guessing the encoding. On encoding the utf-8-sig codec | 
 | 889 | will write ``0xef``, ``0xbb``, ``0xbf`` as the first three bytes to the file. On | 
 | 890 | decoding utf-8-sig will skip those three bytes if they appear as the first three | 
 | 891 | bytes in the file. | 
 | 892 |  | 
 | 893 |  | 
 | 894 | .. _standard-encodings: | 
 | 895 |  | 
 | 896 | Standard Encodings | 
 | 897 | ------------------ | 
 | 898 |  | 
 | 899 | Python comes with a number of codecs built-in, either implemented as C functions | 
 | 900 | or with dictionaries as mapping tables. The following table lists the codecs by | 
 | 901 | name, together with a few common aliases, and the languages for which the | 
 | 902 | encoding is likely used. Neither the list of aliases nor the list of languages | 
 | 903 | is meant to be exhaustive. Notice that spelling alternatives that only differ in | 
| Georg Brandl | a6053b4 | 2009-09-01 08:11:14 +0000 | [diff] [blame] | 904 | case or use a hyphen instead of an underscore are also valid aliases; therefore, | 
 | 905 | e.g. ``'utf-8'`` is a valid alias for the ``'utf_8'`` codec. | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 906 |  | 
 | 907 | Many of the character sets support the same languages. They vary in individual | 
 | 908 | characters (e.g. whether the EURO SIGN is supported or not), and in the | 
 | 909 | assignment of characters to code positions. For the European languages in | 
 | 910 | particular, the following variants typically exist: | 
 | 911 |  | 
 | 912 | * an ISO 8859 codeset | 
 | 913 |  | 
 | 914 | * a Microsoft Windows code page, which is typically derived from a 8859 codeset, | 
 | 915 |   but replaces control characters with additional graphic characters | 
 | 916 |  | 
 | 917 | * an IBM EBCDIC code page | 
 | 918 |  | 
 | 919 | * an IBM PC code page, which is ASCII compatible | 
 | 920 |  | 
 | 921 | +-----------------+--------------------------------+--------------------------------+ | 
 | 922 | | Codec           | Aliases                        | Languages                      | | 
 | 923 | +=================+================================+================================+ | 
 | 924 | | ascii           | 646, us-ascii                  | English                        | | 
 | 925 | +-----------------+--------------------------------+--------------------------------+ | 
 | 926 | | big5            | big5-tw, csbig5                | Traditional Chinese            | | 
 | 927 | +-----------------+--------------------------------+--------------------------------+ | 
 | 928 | | big5hkscs       | big5-hkscs, hkscs              | Traditional Chinese            | | 
 | 929 | +-----------------+--------------------------------+--------------------------------+ | 
 | 930 | | cp037           | IBM037, IBM039                 | English                        | | 
 | 931 | +-----------------+--------------------------------+--------------------------------+ | 
 | 932 | | cp424           | EBCDIC-CP-HE, IBM424           | Hebrew                         | | 
 | 933 | +-----------------+--------------------------------+--------------------------------+ | 
 | 934 | | cp437           | 437, IBM437                    | English                        | | 
 | 935 | +-----------------+--------------------------------+--------------------------------+ | 
 | 936 | | cp500           | EBCDIC-CP-BE, EBCDIC-CP-CH,    | Western Europe                 | | 
 | 937 | |                 | IBM500                         |                                | | 
 | 938 | +-----------------+--------------------------------+--------------------------------+ | 
| Amaury Forgeot d'Arc | ae6388d | 2009-07-15 19:21:18 +0000 | [diff] [blame] | 939 | | cp720           |                                | Arabic                         | | 
 | 940 | +-----------------+--------------------------------+--------------------------------+ | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 941 | | cp737           |                                | Greek                          | | 
 | 942 | +-----------------+--------------------------------+--------------------------------+ | 
 | 943 | | cp775           | IBM775                         | Baltic languages               | | 
 | 944 | +-----------------+--------------------------------+--------------------------------+ | 
 | 945 | | cp850           | 850, IBM850                    | Western Europe                 | | 
 | 946 | +-----------------+--------------------------------+--------------------------------+ | 
 | 947 | | cp852           | 852, IBM852                    | Central and Eastern Europe     | | 
 | 948 | +-----------------+--------------------------------+--------------------------------+ | 
 | 949 | | cp855           | 855, IBM855                    | Bulgarian, Byelorussian,       | | 
 | 950 | |                 |                                | Macedonian, Russian, Serbian   | | 
 | 951 | +-----------------+--------------------------------+--------------------------------+ | 
 | 952 | | cp856           |                                | Hebrew                         | | 
 | 953 | +-----------------+--------------------------------+--------------------------------+ | 
 | 954 | | cp857           | 857, IBM857                    | Turkish                        | | 
 | 955 | +-----------------+--------------------------------+--------------------------------+ | 
| Benjamin Peterson | 5a6214a | 2010-06-27 22:41:29 +0000 | [diff] [blame] | 956 | | cp858           | 858, IBM858                    | Western Europe                 | | 
 | 957 | +-----------------+--------------------------------+--------------------------------+ | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 958 | | cp860           | 860, IBM860                    | Portuguese                     | | 
 | 959 | +-----------------+--------------------------------+--------------------------------+ | 
 | 960 | | cp861           | 861, CP-IS, IBM861             | Icelandic                      | | 
 | 961 | +-----------------+--------------------------------+--------------------------------+ | 
 | 962 | | cp862           | 862, IBM862                    | Hebrew                         | | 
 | 963 | +-----------------+--------------------------------+--------------------------------+ | 
 | 964 | | cp863           | 863, IBM863                    | Canadian                       | | 
 | 965 | +-----------------+--------------------------------+--------------------------------+ | 
 | 966 | | cp864           | IBM864                         | Arabic                         | | 
 | 967 | +-----------------+--------------------------------+--------------------------------+ | 
 | 968 | | cp865           | 865, IBM865                    | Danish, Norwegian              | | 
 | 969 | +-----------------+--------------------------------+--------------------------------+ | 
 | 970 | | cp866           | 866, IBM866                    | Russian                        | | 
 | 971 | +-----------------+--------------------------------+--------------------------------+ | 
 | 972 | | cp869           | 869, CP-GR, IBM869             | Greek                          | | 
 | 973 | +-----------------+--------------------------------+--------------------------------+ | 
 | 974 | | cp874           |                                | Thai                           | | 
 | 975 | +-----------------+--------------------------------+--------------------------------+ | 
 | 976 | | cp875           |                                | Greek                          | | 
 | 977 | +-----------------+--------------------------------+--------------------------------+ | 
 | 978 | | cp932           | 932, ms932, mskanji, ms-kanji  | Japanese                       | | 
 | 979 | +-----------------+--------------------------------+--------------------------------+ | 
 | 980 | | cp949           | 949, ms949, uhc                | Korean                         | | 
 | 981 | +-----------------+--------------------------------+--------------------------------+ | 
 | 982 | | cp950           | 950, ms950                     | Traditional Chinese            | | 
 | 983 | +-----------------+--------------------------------+--------------------------------+ | 
 | 984 | | cp1006          |                                | Urdu                           | | 
 | 985 | +-----------------+--------------------------------+--------------------------------+ | 
 | 986 | | cp1026          | ibm1026                        | Turkish                        | | 
 | 987 | +-----------------+--------------------------------+--------------------------------+ | 
 | 988 | | cp1140          | ibm1140                        | Western Europe                 | | 
 | 989 | +-----------------+--------------------------------+--------------------------------+ | 
 | 990 | | cp1250          | windows-1250                   | Central and Eastern Europe     | | 
 | 991 | +-----------------+--------------------------------+--------------------------------+ | 
 | 992 | | cp1251          | windows-1251                   | Bulgarian, Byelorussian,       | | 
 | 993 | |                 |                                | Macedonian, Russian, Serbian   | | 
 | 994 | +-----------------+--------------------------------+--------------------------------+ | 
 | 995 | | cp1252          | windows-1252                   | Western Europe                 | | 
 | 996 | +-----------------+--------------------------------+--------------------------------+ | 
 | 997 | | cp1253          | windows-1253                   | Greek                          | | 
 | 998 | +-----------------+--------------------------------+--------------------------------+ | 
 | 999 | | cp1254          | windows-1254                   | Turkish                        | | 
 | 1000 | +-----------------+--------------------------------+--------------------------------+ | 
 | 1001 | | cp1255          | windows-1255                   | Hebrew                         | | 
 | 1002 | +-----------------+--------------------------------+--------------------------------+ | 
| Benjamin Peterson | 4ac9ce4 | 2009-10-04 14:49:41 +0000 | [diff] [blame] | 1003 | | cp1256          | windows-1256                   | Arabic                         | | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 1004 | +-----------------+--------------------------------+--------------------------------+ | 
 | 1005 | | cp1257          | windows-1257                   | Baltic languages               | | 
 | 1006 | +-----------------+--------------------------------+--------------------------------+ | 
 | 1007 | | cp1258          | windows-1258                   | Vietnamese                     | | 
 | 1008 | +-----------------+--------------------------------+--------------------------------+ | 
 | 1009 | | euc_jp          | eucjp, ujis, u-jis             | Japanese                       | | 
 | 1010 | +-----------------+--------------------------------+--------------------------------+ | 
 | 1011 | | euc_jis_2004    | jisx0213, eucjis2004           | Japanese                       | | 
 | 1012 | +-----------------+--------------------------------+--------------------------------+ | 
 | 1013 | | euc_jisx0213    | eucjisx0213                    | Japanese                       | | 
 | 1014 | +-----------------+--------------------------------+--------------------------------+ | 
 | 1015 | | euc_kr          | euckr, korean, ksc5601,        | Korean                         | | 
 | 1016 | |                 | ks_c-5601, ks_c-5601-1987,     |                                | | 
 | 1017 | |                 | ksx1001, ks_x-1001             |                                | | 
 | 1018 | +-----------------+--------------------------------+--------------------------------+ | 
 | 1019 | | gb2312          | chinese, csiso58gb231280, euc- | Simplified Chinese             | | 
 | 1020 | |                 | cn, euccn, eucgb2312-cn,       |                                | | 
 | 1021 | |                 | gb2312-1980, gb2312-80, iso-   |                                | | 
 | 1022 | |                 | ir-58                          |                                | | 
 | 1023 | +-----------------+--------------------------------+--------------------------------+ | 
 | 1024 | | gbk             | 936, cp936, ms936              | Unified Chinese                | | 
 | 1025 | +-----------------+--------------------------------+--------------------------------+ | 
 | 1026 | | gb18030         | gb18030-2000                   | Unified Chinese                | | 
 | 1027 | +-----------------+--------------------------------+--------------------------------+ | 
 | 1028 | | hz              | hzgb, hz-gb, hz-gb-2312        | Simplified Chinese             | | 
 | 1029 | +-----------------+--------------------------------+--------------------------------+ | 
 | 1030 | | iso2022_jp      | csiso2022jp, iso2022jp,        | Japanese                       | | 
 | 1031 | |                 | iso-2022-jp                    |                                | | 
 | 1032 | +-----------------+--------------------------------+--------------------------------+ | 
 | 1033 | | iso2022_jp_1    | iso2022jp-1, iso-2022-jp-1     | Japanese                       | | 
 | 1034 | +-----------------+--------------------------------+--------------------------------+ | 
 | 1035 | | iso2022_jp_2    | iso2022jp-2, iso-2022-jp-2     | Japanese, Korean, Simplified   | | 
 | 1036 | |                 |                                | Chinese, Western Europe, Greek | | 
 | 1037 | +-----------------+--------------------------------+--------------------------------+ | 
 | 1038 | | iso2022_jp_2004 | iso2022jp-2004,                | Japanese                       | | 
 | 1039 | |                 | iso-2022-jp-2004               |                                | | 
 | 1040 | +-----------------+--------------------------------+--------------------------------+ | 
 | 1041 | | iso2022_jp_3    | iso2022jp-3, iso-2022-jp-3     | Japanese                       | | 
 | 1042 | +-----------------+--------------------------------+--------------------------------+ | 
 | 1043 | | iso2022_jp_ext  | iso2022jp-ext, iso-2022-jp-ext | Japanese                       | | 
 | 1044 | +-----------------+--------------------------------+--------------------------------+ | 
 | 1045 | | iso2022_kr      | csiso2022kr, iso2022kr,        | Korean                         | | 
 | 1046 | |                 | iso-2022-kr                    |                                | | 
 | 1047 | +-----------------+--------------------------------+--------------------------------+ | 
 | 1048 | | latin_1         | iso-8859-1, iso8859-1, 8859,   | West Europe                    | | 
 | 1049 | |                 | cp819, latin, latin1, L1       |                                | | 
 | 1050 | +-----------------+--------------------------------+--------------------------------+ | 
 | 1051 | | iso8859_2       | iso-8859-2, latin2, L2         | Central and Eastern Europe     | | 
 | 1052 | +-----------------+--------------------------------+--------------------------------+ | 
 | 1053 | | iso8859_3       | iso-8859-3, latin3, L3         | Esperanto, Maltese             | | 
 | 1054 | +-----------------+--------------------------------+--------------------------------+ | 
| Christian Heimes | c3f30c4 | 2008-02-22 16:37:40 +0000 | [diff] [blame] | 1055 | | iso8859_4       | iso-8859-4, latin4, L4         | Baltic languages               | | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 1056 | +-----------------+--------------------------------+--------------------------------+ | 
 | 1057 | | iso8859_5       | iso-8859-5, cyrillic           | Bulgarian, Byelorussian,       | | 
 | 1058 | |                 |                                | Macedonian, Russian, Serbian   | | 
 | 1059 | +-----------------+--------------------------------+--------------------------------+ | 
 | 1060 | | iso8859_6       | iso-8859-6, arabic             | Arabic                         | | 
 | 1061 | +-----------------+--------------------------------+--------------------------------+ | 
 | 1062 | | iso8859_7       | iso-8859-7, greek, greek8      | Greek                          | | 
 | 1063 | +-----------------+--------------------------------+--------------------------------+ | 
 | 1064 | | iso8859_8       | iso-8859-8, hebrew             | Hebrew                         | | 
 | 1065 | +-----------------+--------------------------------+--------------------------------+ | 
 | 1066 | | iso8859_9       | iso-8859-9, latin5, L5         | Turkish                        | | 
 | 1067 | +-----------------+--------------------------------+--------------------------------+ | 
 | 1068 | | iso8859_10      | iso-8859-10, latin6, L6        | Nordic languages               | | 
 | 1069 | +-----------------+--------------------------------+--------------------------------+ | 
| Georg Brandl | 93dc9eb | 2010-03-14 10:56:14 +0000 | [diff] [blame] | 1070 | | iso8859_13      | iso-8859-13, latin7, L7        | Baltic languages               | | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 1071 | +-----------------+--------------------------------+--------------------------------+ | 
 | 1072 | | iso8859_14      | iso-8859-14, latin8, L8        | Celtic languages               | | 
 | 1073 | +-----------------+--------------------------------+--------------------------------+ | 
| Georg Brandl | 93dc9eb | 2010-03-14 10:56:14 +0000 | [diff] [blame] | 1074 | | iso8859_15      | iso-8859-15, latin9, L9        | Western Europe                 | | 
 | 1075 | +-----------------+--------------------------------+--------------------------------+ | 
 | 1076 | | iso8859_16      | iso-8859-16, latin10, L10      | South-Eastern Europe           | | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 1077 | +-----------------+--------------------------------+--------------------------------+ | 
 | 1078 | | johab           | cp1361, ms1361                 | Korean                         | | 
 | 1079 | +-----------------+--------------------------------+--------------------------------+ | 
 | 1080 | | koi8_r          |                                | Russian                        | | 
 | 1081 | +-----------------+--------------------------------+--------------------------------+ | 
 | 1082 | | koi8_u          |                                | Ukrainian                      | | 
 | 1083 | +-----------------+--------------------------------+--------------------------------+ | 
 | 1084 | | mac_cyrillic    | maccyrillic                    | Bulgarian, Byelorussian,       | | 
 | 1085 | |                 |                                | Macedonian, Russian, Serbian   | | 
 | 1086 | +-----------------+--------------------------------+--------------------------------+ | 
 | 1087 | | mac_greek       | macgreek                       | Greek                          | | 
 | 1088 | +-----------------+--------------------------------+--------------------------------+ | 
 | 1089 | | mac_iceland     | maciceland                     | Icelandic                      | | 
 | 1090 | +-----------------+--------------------------------+--------------------------------+ | 
 | 1091 | | mac_latin2      | maclatin2, maccentraleurope    | Central and Eastern Europe     | | 
 | 1092 | +-----------------+--------------------------------+--------------------------------+ | 
| Benjamin Peterson | 23110e7 | 2010-08-21 02:54:44 +0000 | [diff] [blame] | 1093 | | mac_roman       | macroman, macintosh            | Western Europe                 | | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 1094 | +-----------------+--------------------------------+--------------------------------+ | 
 | 1095 | | mac_turkish     | macturkish                     | Turkish                        | | 
 | 1096 | +-----------------+--------------------------------+--------------------------------+ | 
 | 1097 | | ptcp154         | csptcp154, pt154, cp154,       | Kazakh                         | | 
 | 1098 | |                 | cyrillic-asian                 |                                | | 
 | 1099 | +-----------------+--------------------------------+--------------------------------+ | 
 | 1100 | | shift_jis       | csshiftjis, shiftjis, sjis,    | Japanese                       | | 
 | 1101 | |                 | s_jis                          |                                | | 
 | 1102 | +-----------------+--------------------------------+--------------------------------+ | 
 | 1103 | | shift_jis_2004  | shiftjis2004, sjis_2004,       | Japanese                       | | 
 | 1104 | |                 | sjis2004                       |                                | | 
 | 1105 | +-----------------+--------------------------------+--------------------------------+ | 
 | 1106 | | shift_jisx0213  | shiftjisx0213, sjisx0213,      | Japanese                       | | 
 | 1107 | |                 | s_jisx0213                     |                                | | 
 | 1108 | +-----------------+--------------------------------+--------------------------------+ | 
| Walter Dörwald | 41980ca | 2007-08-16 21:55:45 +0000 | [diff] [blame] | 1109 | | utf_32          | U32, utf32                     | all languages                  | | 
 | 1110 | +-----------------+--------------------------------+--------------------------------+ | 
 | 1111 | | utf_32_be       | UTF-32BE                       | all languages                  | | 
 | 1112 | +-----------------+--------------------------------+--------------------------------+ | 
 | 1113 | | utf_32_le       | UTF-32LE                       | all languages                  | | 
 | 1114 | +-----------------+--------------------------------+--------------------------------+ | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 1115 | | utf_16          | U16, utf16                     | all languages                  | | 
 | 1116 | +-----------------+--------------------------------+--------------------------------+ | 
 | 1117 | | utf_16_be       | UTF-16BE                       | all languages (BMP only)       | | 
 | 1118 | +-----------------+--------------------------------+--------------------------------+ | 
 | 1119 | | utf_16_le       | UTF-16LE                       | all languages (BMP only)       | | 
 | 1120 | +-----------------+--------------------------------+--------------------------------+ | 
 | 1121 | | utf_7           | U7, unicode-1-1-utf-7          | all languages                  | | 
 | 1122 | +-----------------+--------------------------------+--------------------------------+ | 
 | 1123 | | utf_8           | U8, UTF, utf8                  | all languages                  | | 
 | 1124 | +-----------------+--------------------------------+--------------------------------+ | 
 | 1125 | | utf_8_sig       |                                | all languages                  | | 
 | 1126 | +-----------------+--------------------------------+--------------------------------+ | 
 | 1127 |  | 
| Georg Brandl | 226878c | 2007-08-31 10:15:37 +0000 | [diff] [blame] | 1128 | .. XXX fix here, should be in above table | 
 | 1129 |  | 
| Georg Brandl | 30c78d6 | 2008-05-11 14:52:00 +0000 | [diff] [blame] | 1130 | +--------------------+---------+---------------------------+ | 
 | 1131 | | Codec              | Aliases | Purpose                   | | 
 | 1132 | +====================+=========+===========================+ | 
 | 1133 | | idna               |         | Implements :rfc:`3490`,   | | 
 | 1134 | |                    |         | see also                  | | 
 | 1135 | |                    |         | :mod:`encodings.idna`     | | 
 | 1136 | +--------------------+---------+---------------------------+ | 
 | 1137 | | mbcs               | dbcs    | Windows only: Encode      | | 
 | 1138 | |                    |         | operand according to the  | | 
 | 1139 | |                    |         | ANSI codepage (CP_ACP)    | | 
 | 1140 | +--------------------+---------+---------------------------+ | 
 | 1141 | | palmos             |         | Encoding of PalmOS 3.5    | | 
 | 1142 | +--------------------+---------+---------------------------+ | 
 | 1143 | | punycode           |         | Implements :rfc:`3492`    | | 
 | 1144 | +--------------------+---------+---------------------------+ | 
 | 1145 | | raw_unicode_escape |         | Produce a string that is  | | 
 | 1146 | |                    |         | suitable as raw Unicode   | | 
 | 1147 | |                    |         | literal in Python source  | | 
 | 1148 | |                    |         | code                      | | 
 | 1149 | +--------------------+---------+---------------------------+ | 
 | 1150 | | undefined          |         | Raise an exception for    | | 
 | 1151 | |                    |         | all conversions. Can be   | | 
 | 1152 | |                    |         | used as the system        | | 
 | 1153 | |                    |         | encoding if no automatic  | | 
 | 1154 | |                    |         | coercion between byte and | | 
 | 1155 | |                    |         | Unicode strings is        | | 
 | 1156 | |                    |         | desired.                  | | 
 | 1157 | +--------------------+---------+---------------------------+ | 
 | 1158 | | unicode_escape     |         | Produce a string that is  | | 
 | 1159 | |                    |         | suitable as Unicode       | | 
 | 1160 | |                    |         | literal in Python source  | | 
 | 1161 | |                    |         | code                      | | 
 | 1162 | +--------------------+---------+---------------------------+ | 
 | 1163 | | unicode_internal   |         | Return the internal       | | 
 | 1164 | |                    |         | representation of the     | | 
 | 1165 | |                    |         | operand                   | | 
 | 1166 | +--------------------+---------+---------------------------+ | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 1167 |  | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 1168 |  | 
 | 1169 | :mod:`encodings.idna` --- Internationalized Domain Names in Applications | 
 | 1170 | ------------------------------------------------------------------------ | 
 | 1171 |  | 
 | 1172 | .. module:: encodings.idna | 
 | 1173 |    :synopsis: Internationalized Domain Names implementation | 
 | 1174 | .. moduleauthor:: Martin v. Löwis | 
 | 1175 |  | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 1176 | This module implements :rfc:`3490` (Internationalized Domain Names in | 
 | 1177 | Applications) and :rfc:`3492` (Nameprep: A Stringprep Profile for | 
 | 1178 | Internationalized Domain Names (IDN)). It builds upon the ``punycode`` encoding | 
 | 1179 | and :mod:`stringprep`. | 
 | 1180 |  | 
 | 1181 | These RFCs together define a protocol to support non-ASCII characters in domain | 
 | 1182 | names. A domain name containing non-ASCII characters (such as | 
 | 1183 | ``www.Alliancefrançaise.nu``) is converted into an ASCII-compatible encoding | 
 | 1184 | (ACE, such as ``www.xn--alliancefranaise-npb.nu``). The ACE form of the domain | 
 | 1185 | name is then used in all places where arbitrary characters are not allowed by | 
 | 1186 | the protocol, such as DNS queries, HTTP :mailheader:`Host` fields, and so | 
 | 1187 | on. This conversion is carried out in the application; if possible invisible to | 
 | 1188 | the user: The application should transparently convert Unicode domain labels to | 
 | 1189 | IDNA on the wire, and convert back ACE labels to Unicode before presenting them | 
 | 1190 | to the user. | 
 | 1191 |  | 
 | 1192 | Python supports this conversion in several ways: The ``idna`` codec allows to | 
 | 1193 | convert between Unicode and the ACE. Furthermore, the :mod:`socket` module | 
 | 1194 | transparently converts Unicode host names to ACE, so that applications need not | 
 | 1195 | be concerned about converting host names themselves when they pass them to the | 
 | 1196 | socket module. On top of that, modules that have host names as function | 
| Georg Brandl | 2442015 | 2008-05-26 16:32:26 +0000 | [diff] [blame] | 1197 | parameters, such as :mod:`http.client` and :mod:`ftplib`, accept Unicode host | 
 | 1198 | names (:mod:`http.client` then also transparently sends an IDNA hostname in the | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 1199 | :mailheader:`Host` field if it sends that field at all). | 
 | 1200 |  | 
 | 1201 | When receiving host names from the wire (such as in reverse name lookup), no | 
 | 1202 | automatic conversion to Unicode is performed: Applications wishing to present | 
 | 1203 | such host names to the user should decode them to Unicode. | 
 | 1204 |  | 
 | 1205 | The module :mod:`encodings.idna` also implements the nameprep procedure, which | 
 | 1206 | performs certain normalizations on host names, to achieve case-insensitivity of | 
 | 1207 | international domain names, and to unify similar characters. The nameprep | 
 | 1208 | functions can be used directly if desired. | 
 | 1209 |  | 
 | 1210 |  | 
 | 1211 | .. function:: nameprep(label) | 
 | 1212 |  | 
 | 1213 |    Return the nameprepped version of *label*. The implementation currently assumes | 
 | 1214 |    query strings, so ``AllowUnassigned`` is true. | 
 | 1215 |  | 
 | 1216 |  | 
 | 1217 | .. function:: ToASCII(label) | 
 | 1218 |  | 
 | 1219 |    Convert a label to ASCII, as specified in :rfc:`3490`. ``UseSTD3ASCIIRules`` is | 
 | 1220 |    assumed to be false. | 
 | 1221 |  | 
 | 1222 |  | 
 | 1223 | .. function:: ToUnicode(label) | 
 | 1224 |  | 
 | 1225 |    Convert a label to Unicode, as specified in :rfc:`3490`. | 
 | 1226 |  | 
 | 1227 |  | 
| Victor Stinner | 554f3f0 | 2010-06-16 23:33:54 +0000 | [diff] [blame] | 1228 | :mod:`encodings.mbcs` --- Windows ANSI codepage | 
 | 1229 | ----------------------------------------------- | 
 | 1230 |  | 
 | 1231 | .. module:: encodings.mbcs | 
 | 1232 |    :synopsis: Windows ANSI codepage | 
 | 1233 |  | 
 | 1234 | Encode operand according to the ANSI codepage (CP_ACP). This codec only | 
 | 1235 | supports ``'strict'`` and ``'replace'`` error handlers to encode, and | 
 | 1236 | ``'strict'`` and ``'ignore'`` error handlers to decode. | 
 | 1237 |  | 
 | 1238 | Availability: Windows only. | 
 | 1239 |  | 
 | 1240 | .. versionchanged:: 3.2 | 
 | 1241 |    Before 3.2, the *errors* argument was ignored; ``'replace'`` was always used | 
 | 1242 |    to encode, and ``'ignore'`` to decode. | 
 | 1243 |  | 
 | 1244 |  | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 1245 | :mod:`encodings.utf_8_sig` --- UTF-8 codec with BOM signature | 
 | 1246 | ------------------------------------------------------------- | 
 | 1247 |  | 
 | 1248 | .. module:: encodings.utf_8_sig | 
 | 1249 |    :synopsis: UTF-8 codec with BOM signature | 
 | 1250 | .. moduleauthor:: Walter Dörwald | 
 | 1251 |  | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 1252 | This module implements a variant of the UTF-8 codec: On encoding a UTF-8 encoded | 
 | 1253 | BOM will be prepended to the UTF-8 encoded bytes. For the stateful encoder this | 
 | 1254 | is only done once (on the first write to the byte stream).  For decoding an | 
 | 1255 | optional UTF-8 encoded BOM at the start of the data will be skipped. | 
 | 1256 |  |