blob: 05c037501c0465a970e32329a99882714162e53a [file] [log] [blame]
Fred Drakeb7979c72000-04-06 14:21:58 +00001\section{\module{codecs} ---
Fred Drake69ca9502000-04-06 16:09:59 +00002 Codec registry and base classes}
Fred Drakeb7979c72000-04-06 14:21:58 +00003
Fred Drake69ca9502000-04-06 16:09:59 +00004\declaremodule{standard}{codecs}
Fred Drakeb7979c72000-04-06 14:21:58 +00005\modulesynopsis{Encode and decode data and streams.}
6\moduleauthor{Marc-Andre Lemburg}{mal@lemburg.com}
7\sectionauthor{Marc-Andre Lemburg}{mal@lemburg.com}
Martin v. Löwis2548c732003-04-18 10:39:54 +00008\sectionauthor{Martin v. L\"owis}{martin@v.loewis.de}
Fred Drakeb7979c72000-04-06 14:21:58 +00009
10\index{Unicode}
11\index{Codecs}
12\indexii{Codecs}{encode}
13\indexii{Codecs}{decode}
14\index{streams}
15\indexii{stackable}{streams}
16
17
18This module defines base classes for standard Python codecs (encoders
19and decoders) and provides access to the internal Python codec
Walter Dörwald3aeb6322002-09-02 13:14:32 +000020registry which manages the codec and error handling lookup process.
Fred Drakeb7979c72000-04-06 14:21:58 +000021
22It defines the following functions:
23
24\begin{funcdesc}{register}{search_function}
25Register a codec search function. Search functions are expected to
26take one argument, the encoding name in all lower case letters, and
Walter Dörwaldabb02e52006-03-15 11:35:15 +000027return a \class{CodecInfo} object having the following attributes:
28
29\begin{itemize}
30 \item \code{name} The name of the encoding;
31 \item \code{encoder} The stateless encoding function;
32 \item \code{decoder} The stateless decoding function;
33 \item \code{incrementalencoder} An incremental encoder class or factory function;
34 \item \code{incrementaldecoder} An incremental decoder class or factory function;
35 \item \code{streamwriter} A stream writer class or factory function;
36 \item \code{streamreader} A stream reader class or factory function.
37\end{itemize}
38
39The various functions or classes take the following arguments:
Fred Drakeb7979c72000-04-06 14:21:58 +000040
41 \var{encoder} and \var{decoder}: These must be functions or methods
Fred Drake602aa772000-10-12 20:50:55 +000042 which have the same interface as the
43 \method{encode()}/\method{decode()} methods of Codec instances (see
44 Codec Interface). The functions/methods are expected to work in a
45 stateless mode.
Fred Drakeb7979c72000-04-06 14:21:58 +000046
Walter Dörwaldabb02e52006-03-15 11:35:15 +000047 \var{incrementalencoder} and \var{incrementalencoder}: These have to be
48 factory functions providing the following interface:
49
50 \code{factory(\var{errors}='strict')}
51
52 The factory functions must return objects providing the interfaces
53 defined by the base classes \class{IncrementalEncoder} and
54 \class{IncrementalEncoder}, respectively. Incremental codecs can maintain
55 state.
56
57 \var{streamreader} and \var{streamwriter}: These have to be
Fred Drakeb7979c72000-04-06 14:21:58 +000058 factory functions providing the following interface:
59
Fred Drake602aa772000-10-12 20:50:55 +000060 \code{factory(\var{stream}, \var{errors}='strict')}
Fred Drakeb7979c72000-04-06 14:21:58 +000061
62 The factory functions must return objects providing the interfaces
Fred Drake69ca9502000-04-06 16:09:59 +000063 defined by the base classes \class{StreamWriter} and
64 \class{StreamReader}, respectively. Stream codecs can maintain
65 state.
Fred Drakeb7979c72000-04-06 14:21:58 +000066
Fred Drake69ca9502000-04-06 16:09:59 +000067 Possible values for errors are \code{'strict'} (raise an exception
68 in case of an encoding error), \code{'replace'} (replace malformed
Walter Dörwald72f86162002-11-19 21:51:35 +000069 data with a suitable replacement marker, such as \character{?}),
Fred Drake69ca9502000-04-06 16:09:59 +000070 \code{'ignore'} (ignore malformed data and continue without further
Walter Dörwald72f86162002-11-19 21:51:35 +000071 notice), \code{'xmlcharrefreplace'} (replace with the appropriate XML
72 character reference (for encoding only)) and \code{'backslashreplace'}
73 (replace with backslashed escape sequences (for encoding only)) as
74 well as any other error handling name defined via
75 \function{register_error()}.
Fred Drakeb7979c72000-04-06 14:21:58 +000076
77In case a search function cannot find a given encoding, it should
Fred Drake69ca9502000-04-06 16:09:59 +000078return \code{None}.
Fred Drakeb7979c72000-04-06 14:21:58 +000079\end{funcdesc}
80
81\begin{funcdesc}{lookup}{encoding}
Walter Dörwaldabb02e52006-03-15 11:35:15 +000082Looks up the codec info in the Python codec registry and returns a
83\class{CodecInfo} object as defined above.
Fred Drakeb7979c72000-04-06 14:21:58 +000084
85Encodings are first looked up in the registry's cache. If not found,
Walter Dörwaldabb02e52006-03-15 11:35:15 +000086the list of registered search functions is scanned. If no \class{CodecInfo}
87object is found, a \exception{LookupError} is raised. Otherwise, the
88\class{CodecInfo} object is stored in the cache and returned to the caller.
Fred Drakeb7979c72000-04-06 14:21:58 +000089\end{funcdesc}
90
Skip Montanarob02ea652002-04-17 19:33:06 +000091To simplify access to the various codecs, the module provides these
Marc-André Lemburg494f2ae2001-09-19 11:33:31 +000092additional functions which use \function{lookup()} for the codec
93lookup:
94
95\begin{funcdesc}{getencoder}{encoding}
Andrew M. Kuchling84a7ee72006-04-21 12:38:41 +000096Look up the codec for the given encoding and return its encoder
Marc-André Lemburg494f2ae2001-09-19 11:33:31 +000097function.
98
99Raises a \exception{LookupError} in case the encoding cannot be found.
100\end{funcdesc}
101
102\begin{funcdesc}{getdecoder}{encoding}
Andrew M. Kuchling84a7ee72006-04-21 12:38:41 +0000103Look up the codec for the given encoding and return its decoder
Marc-André Lemburg494f2ae2001-09-19 11:33:31 +0000104function.
105
106Raises a \exception{LookupError} in case the encoding cannot be found.
107\end{funcdesc}
108
Walter Dörwaldabb02e52006-03-15 11:35:15 +0000109\begin{funcdesc}{getincrementalencoder}{encoding}
Andrew M. Kuchling84a7ee72006-04-21 12:38:41 +0000110Look up the codec for the given encoding and return its incremental encoder
Walter Dörwaldabb02e52006-03-15 11:35:15 +0000111class or factory function.
112
113Raises a \exception{LookupError} in case the encoding cannot be found or the
114codec doesn't support an incremental encoder.
Neal Norwitz6bed1c12006-03-16 07:49:19 +0000115\versionadded{2.5}
Walter Dörwaldabb02e52006-03-15 11:35:15 +0000116\end{funcdesc}
117
118\begin{funcdesc}{getincrementaldecoder}{encoding}
Andrew M. Kuchling84a7ee72006-04-21 12:38:41 +0000119Look up the codec for the given encoding and return its incremental decoder
Walter Dörwaldabb02e52006-03-15 11:35:15 +0000120class or factory function.
121
122Raises a \exception{LookupError} in case the encoding cannot be found or the
123codec doesn't support an incremental decoder.
Neal Norwitz6bed1c12006-03-16 07:49:19 +0000124\versionadded{2.5}
Walter Dörwaldabb02e52006-03-15 11:35:15 +0000125\end{funcdesc}
126
Marc-André Lemburg494f2ae2001-09-19 11:33:31 +0000127\begin{funcdesc}{getreader}{encoding}
Andrew M. Kuchling84a7ee72006-04-21 12:38:41 +0000128Look up the codec for the given encoding and return its StreamReader
Marc-André Lemburg494f2ae2001-09-19 11:33:31 +0000129class or factory function.
130
131Raises a \exception{LookupError} in case the encoding cannot be found.
132\end{funcdesc}
133
134\begin{funcdesc}{getwriter}{encoding}
Andrew M. Kuchling84a7ee72006-04-21 12:38:41 +0000135Look up the codec for the given encoding and return its StreamWriter
Marc-André Lemburg494f2ae2001-09-19 11:33:31 +0000136class or factory function.
137
138Raises a \exception{LookupError} in case the encoding cannot be found.
139\end{funcdesc}
140
Walter Dörwald3aeb6322002-09-02 13:14:32 +0000141\begin{funcdesc}{register_error}{name, error_handler}
142Register the error handling function \var{error_handler} under the
Raymond Hettinger8a64d402002-09-08 22:26:13 +0000143name \var{name}. \var{error_handler} will be called during encoding
Walter Dörwald3aeb6322002-09-02 13:14:32 +0000144and decoding in case of an error, when \var{name} is specified as the
Walter Dörwald2e0b18a2003-01-31 17:19:08 +0000145errors parameter.
146
147For encoding \var{error_handler} will be called with a
148\exception{UnicodeEncodeError} instance, which contains information about
149the location of the error. The error handler must either raise this or
150a different exception or return a tuple with a replacement for the
151unencodable part of the input and a position where encoding should
152continue. The encoder will encode the replacement and continue encoding
153the original input at the specified position. Negative position values
154will be treated as being relative to the end of the input string. If the
Georg Brandldb815ab2006-03-17 16:26:31 +0000155resulting position is out of bound an \exception{IndexError} will be raised.
Walter Dörwald2e0b18a2003-01-31 17:19:08 +0000156
157Decoding and translating works similar, except \exception{UnicodeDecodeError}
158or \exception{UnicodeTranslateError} will be passed to the handler and
159that the replacement from the error handler will be put into the output
160directly.
Walter Dörwald3aeb6322002-09-02 13:14:32 +0000161\end{funcdesc}
162
163\begin{funcdesc}{lookup_error}{name}
George Yoshidacd84b922006-04-21 16:34:17 +0000164Return the error handler previously registered under the name \var{name}.
Walter Dörwald3aeb6322002-09-02 13:14:32 +0000165
166Raises a \exception{LookupError} in case the handler cannot be found.
167\end{funcdesc}
168
169\begin{funcdesc}{strict_errors}{exception}
170Implements the \code{strict} error handling.
171\end{funcdesc}
172
173\begin{funcdesc}{replace_errors}{exception}
174Implements the \code{replace} error handling.
175\end{funcdesc}
176
177\begin{funcdesc}{ignore_errors}{exception}
178Implements the \code{ignore} error handling.
179\end{funcdesc}
180
181\begin{funcdesc}{xmlcharrefreplace_errors_errors}{exception}
182Implements the \code{xmlcharrefreplace} error handling.
183\end{funcdesc}
184
185\begin{funcdesc}{backslashreplace_errors_errors}{exception}
186Implements the \code{backslashreplace} error handling.
187\end{funcdesc}
188
Walter Dörwald1a7a8942002-11-02 13:32:07 +0000189To simplify working with encoded files or stream, the module
190also defines these utility functions:
191
Fred Drakee1b304d2000-07-24 19:35:52 +0000192\begin{funcdesc}{open}{filename, mode\optional{, encoding\optional{,
193 errors\optional{, buffering}}}}
Fred Drakeb7979c72000-04-06 14:21:58 +0000194Open an encoded file using the given \var{mode} and return
195a wrapped version providing transparent encoding/decoding.
196
Fred Drake0aa811c2001-10-20 04:24:09 +0000197\note{The wrapped version will only accept the object format
Fred Drakee1b304d2000-07-24 19:35:52 +0000198defined by the codecs, i.e.\ Unicode objects for most built-in
199codecs. Output is also codec-dependent and will usually be Unicode as
Fred Drake0aa811c2001-10-20 04:24:09 +0000200well.}
Fred Drakeb7979c72000-04-06 14:21:58 +0000201
202\var{encoding} specifies the encoding which is to be used for the
Raymond Hettinger7e431102003-09-22 15:00:55 +0000203file.
Fred Drakeb7979c72000-04-06 14:21:58 +0000204
205\var{errors} may be given to define the error handling. It defaults
Fred Drakee1b304d2000-07-24 19:35:52 +0000206to \code{'strict'} which causes a \exception{ValueError} to be raised
207in case an encoding error occurs.
Fred Drakeb7979c72000-04-06 14:21:58 +0000208
Fred Drake69ca9502000-04-06 16:09:59 +0000209\var{buffering} has the same meaning as for the built-in
210\function{open()} function. It defaults to line buffered.
Fred Drakeb7979c72000-04-06 14:21:58 +0000211\end{funcdesc}
212
Fred Drakee1b304d2000-07-24 19:35:52 +0000213\begin{funcdesc}{EncodedFile}{file, input\optional{,
214 output\optional{, errors}}}
Fred Drakeb7979c72000-04-06 14:21:58 +0000215Return a wrapped version of file which provides transparent
216encoding translation.
217
218Strings written to the wrapped file are interpreted according to the
219given \var{input} encoding and then written to the original file as
Fred Drakee1b304d2000-07-24 19:35:52 +0000220strings using the \var{output} encoding. The intermediate encoding will
Fred Drakeb7979c72000-04-06 14:21:58 +0000221usually be Unicode but depends on the specified codecs.
222
Fred Drakee1b304d2000-07-24 19:35:52 +0000223If \var{output} is not given, it defaults to \var{input}.
Fred Drakeb7979c72000-04-06 14:21:58 +0000224
225\var{errors} may be given to define the error handling. It defaults to
Fred Drakee1b304d2000-07-24 19:35:52 +0000226\code{'strict'}, which causes \exception{ValueError} to be raised in case
Fred Drakeb7979c72000-04-06 14:21:58 +0000227an encoding error occurs.
228\end{funcdesc}
229
Walter Dörwaldabb02e52006-03-15 11:35:15 +0000230\begin{funcdesc}{iterencode}{iterable, encoding\optional{, errors}}
231Uses an incremental encoder to iteratively encode the input provided by
232\var{iterable}. This function is a generator. \var{errors} (as well as
233any other keyword argument) is passed through to the incremental encoder.
Neal Norwitz6bed1c12006-03-16 07:49:19 +0000234\versionadded{2.5}
Walter Dörwaldabb02e52006-03-15 11:35:15 +0000235\end{funcdesc}
236
237\begin{funcdesc}{iterdecode}{iterable, encoding\optional{, errors}}
238Uses an incremental decoder to iteratively decode the input provided by
239\var{iterable}. This function is a generator. \var{errors} (as well as
240any other keyword argument) is passed through to the incremental encoder.
Neal Norwitz6bed1c12006-03-16 07:49:19 +0000241\versionadded{2.5}
Walter Dörwaldabb02e52006-03-15 11:35:15 +0000242\end{funcdesc}
243
Fred Drakeb7979c72000-04-06 14:21:58 +0000244The module also provides the following constants which are useful
245for reading and writing to platform dependent files:
246
247\begin{datadesc}{BOM}
248\dataline{BOM_BE}
249\dataline{BOM_LE}
Walter Dörwald474458d2002-06-04 15:16:29 +0000250\dataline{BOM_UTF8}
251\dataline{BOM_UTF16}
252\dataline{BOM_UTF16_BE}
253\dataline{BOM_UTF16_LE}
254\dataline{BOM_UTF32}
255\dataline{BOM_UTF32_BE}
256\dataline{BOM_UTF32_LE}
257These constants define various encodings of the Unicode byte order mark
258(BOM) used in UTF-16 and UTF-32 data streams to indicate the byte order
259used in the stream or file and in UTF-8 as a Unicode signature.
260\constant{BOM_UTF16} is either \constant{BOM_UTF16_BE} or
261\constant{BOM_UTF16_LE} depending on the platform's native byte order,
262\constant{BOM} is an alias for \constant{BOM_UTF16}, \constant{BOM_LE}
263for \constant{BOM_UTF16_LE} and \constant{BOM_BE} for \constant{BOM_UTF16_BE}.
264The others represent the BOM in UTF-8 and UTF-32 encodings.
Fred Drakeb7979c72000-04-06 14:21:58 +0000265\end{datadesc}
266
Fred Drakedc40ac02001-01-22 20:17:54 +0000267
Walter Dörwaldd4bfe2c2005-11-25 17:17:12 +0000268\subsection{Codec Base Classes \label{codec-base-classes}}
Fred Drake602aa772000-10-12 20:50:55 +0000269
Fred Drake9984e702005-10-20 17:52:05 +0000270The \module{codecs} module defines a set of base classes which define the
Fred Drake602aa772000-10-12 20:50:55 +0000271interface and can also be used to easily write you own codecs for use
272in Python.
273
274Each codec has to define four interfaces to make it usable as codec in
275Python: stateless encoder, stateless decoder, stream reader and stream
276writer. The stream reader and writers typically reuse the stateless
277encoder/decoder to implement the file protocols.
278
279The \class{Codec} class defines the interface for stateless
280encoders/decoders.
281
282To simplify and standardize error handling, the \method{encode()} and
283\method{decode()} methods may implement different error handling
284schemes by providing the \var{errors} string argument. The following
285string values are defined and implemented by all standard Python
286codecs:
287
Fred Drakedc40ac02001-01-22 20:17:54 +0000288\begin{tableii}{l|l}{code}{Value}{Meaning}
Walter Dörwald430b1562002-11-07 22:33:17 +0000289 \lineii{'strict'}{Raise \exception{UnicodeError} (or a subclass);
Fred Drakedc40ac02001-01-22 20:17:54 +0000290 this is the default.}
291 \lineii{'ignore'}{Ignore the character and continue with the next.}
292 \lineii{'replace'}{Replace with a suitable replacement character;
293 Python will use the official U+FFFD REPLACEMENT
Walter Dörwald430b1562002-11-07 22:33:17 +0000294 CHARACTER for the built-in Unicode codecs on
295 decoding and '?' on encoding.}
296 \lineii{'xmlcharrefreplace'}{Replace with the appropriate XML
297 character reference (only for encoding).}
298 \lineii{'backslashreplace'}{Replace with backslashed escape sequences
299 (only for encoding).}
Fred Drakedc40ac02001-01-22 20:17:54 +0000300\end{tableii}
Fred Drake602aa772000-10-12 20:50:55 +0000301
Walter Dörwald430b1562002-11-07 22:33:17 +0000302The set of allowed values can be extended via \method{register_error}.
303
Fred Drake602aa772000-10-12 20:50:55 +0000304
305\subsubsection{Codec Objects \label{codec-objects}}
306
307The \class{Codec} class defines these methods which also define the
308function interfaces of the stateless encoder and decoder:
309
310\begin{methoddesc}{encode}{input\optional{, errors}}
311 Encodes the object \var{input} and returns a tuple (output object,
Skip Montanaro6c7bc312002-04-16 15:12:10 +0000312 length consumed). While codecs are not restricted to use with Unicode, in
313 a Unicode context, encoding converts a Unicode object to a plain string
314 using a particular character set encoding (e.g., \code{cp1252} or
315 \code{iso-8859-1}).
Fred Drake602aa772000-10-12 20:50:55 +0000316
317 \var{errors} defines the error handling to apply. It defaults to
318 \code{'strict'} handling.
319
320 The method may not store state in the \class{Codec} instance. Use
321 \class{StreamCodec} for codecs which have to keep state in order to
322 make encoding/decoding efficient.
323
324 The encoder must be able to handle zero length input and return an
325 empty object of the output object type in this situation.
326\end{methoddesc}
327
328\begin{methoddesc}{decode}{input\optional{, errors}}
329 Decodes the object \var{input} and returns a tuple (output object,
Skip Montanaro6c7bc312002-04-16 15:12:10 +0000330 length consumed). In a Unicode context, decoding converts a plain string
331 encoded using a particular character set encoding to a Unicode object.
Fred Drake602aa772000-10-12 20:50:55 +0000332
333 \var{input} must be an object which provides the \code{bf_getreadbuf}
334 buffer slot. Python strings, buffer objects and memory mapped files
335 are examples of objects providing this slot.
336
337 \var{errors} defines the error handling to apply. It defaults to
338 \code{'strict'} handling.
339
340 The method may not store state in the \class{Codec} instance. Use
341 \class{StreamCodec} for codecs which have to keep state in order to
342 make encoding/decoding efficient.
343
344 The decoder must be able to handle zero length input and return an
345 empty object of the output object type in this situation.
346\end{methoddesc}
347
Walter Dörwaldabb02e52006-03-15 11:35:15 +0000348The \class{IncrementalEncoder} and \class{IncrementalDecoder} classes provide
349the basic interface for incremental encoding and decoding. Encoding/decoding the
350input isn't done with one call to the stateless encoder/decoder function,
351but with multiple calls to the \method{encode}/\method{decode} method of the
352incremental encoder/decoder. The incremental encoder/decoder keeps track of
353the encoding/decoding process during method calls.
354
355The joined output of calls to the \method{encode}/\method{decode} method is the
Andrew M. Kuchlingba67a8a2006-04-21 12:58:30 +0000356same as if all the single inputs were joined into one, and this input was
Walter Dörwaldabb02e52006-03-15 11:35:15 +0000357encoded/decoded with the stateless encoder/decoder.
358
359
360\subsubsection{IncrementalEncoder Objects \label{incremental-encoder-objects}}
361
Neal Norwitz6bed1c12006-03-16 07:49:19 +0000362\versionadded{2.5}
363
Walter Dörwaldabb02e52006-03-15 11:35:15 +0000364The \class{IncrementalEncoder} class is used for encoding an input in multiple
365steps. It defines the following methods which every incremental encoder must
Andrew M. Kuchlingba67a8a2006-04-21 12:58:30 +0000366define in order to be compatible with the Python codec registry.
Walter Dörwaldabb02e52006-03-15 11:35:15 +0000367
368\begin{classdesc}{IncrementalEncoder}{\optional{errors}}
George Yoshidacd84b922006-04-21 16:34:17 +0000369 Constructor for an \class{IncrementalEncoder} instance.
Walter Dörwaldabb02e52006-03-15 11:35:15 +0000370
371 All incremental encoders must provide this constructor interface. They are
372 free to add additional keyword arguments, but only the ones defined
373 here are used by the Python codec registry.
374
375 The \class{IncrementalEncoder} may implement different error handling
376 schemes by providing the \var{errors} keyword argument. These
377 parameters are predefined:
378
379 \begin{itemize}
380 \item \code{'strict'} Raise \exception{ValueError} (or a subclass);
381 this is the default.
382 \item \code{'ignore'} Ignore the character and continue with the next.
383 \item \code{'replace'} Replace with a suitable replacement character
384 \item \code{'xmlcharrefreplace'} Replace with the appropriate XML
385 character reference
386 \item \code{'backslashreplace'} Replace with backslashed escape sequences.
387 \end{itemize}
388
389 The \var{errors} argument will be assigned to an attribute of the
390 same name. Assigning to this attribute makes it possible to switch
391 between different error handling strategies during the lifetime
392 of the \class{IncrementalEncoder} object.
393
394 The set of allowed values for the \var{errors} argument can
395 be extended with \function{register_error()}.
396\end{classdesc}
397
398\begin{methoddesc}{encode}{object\optional{, final}}
399 Encodes \var{object} (taking the current state of the encoder into account)
400 and returns the resulting encoded object. If this is the last call to
401 \method{encode} \var{final} must be true (the default is false).
402\end{methoddesc}
403
404\begin{methoddesc}{reset}{}
405 Reset the encoder to the initial state.
406\end{methoddesc}
407
408
409\subsubsection{IncrementalDecoder Objects \label{incremental-decoder-objects}}
410
411The \class{IncrementalDecoder} class is used for decoding an input in multiple
412steps. It defines the following methods which every incremental decoder must
Andrew M. Kuchlingba67a8a2006-04-21 12:58:30 +0000413define in order to be compatible with the Python codec registry.
Walter Dörwaldabb02e52006-03-15 11:35:15 +0000414
415\begin{classdesc}{IncrementalDecoder}{\optional{errors}}
George Yoshidacd84b922006-04-21 16:34:17 +0000416 Constructor for an \class{IncrementalDecoder} instance.
Walter Dörwaldabb02e52006-03-15 11:35:15 +0000417
418 All incremental decoders must provide this constructor interface. They are
419 free to add additional keyword arguments, but only the ones defined
420 here are used by the Python codec registry.
421
422 The \class{IncrementalDecoder} may implement different error handling
423 schemes by providing the \var{errors} keyword argument. These
424 parameters are predefined:
425
426 \begin{itemize}
427 \item \code{'strict'} Raise \exception{ValueError} (or a subclass);
428 this is the default.
429 \item \code{'ignore'} Ignore the character and continue with the next.
430 \item \code{'replace'} Replace with a suitable replacement character.
431 \end{itemize}
432
433 The \var{errors} argument will be assigned to an attribute of the
434 same name. Assigning to this attribute makes it possible to switch
435 between different error handling strategies during the lifetime
436 of the \class{IncrementalEncoder} object.
437
438 The set of allowed values for the \var{errors} argument can
439 be extended with \function{register_error()}.
440\end{classdesc}
441
442\begin{methoddesc}{decode}{object\optional{, final}}
443 Decodes \var{object} (taking the current state of the decoder into account)
444 and returns the resulting decoded object. If this is the last call to
445 \method{decode} \var{final} must be true (the default is false).
Walter Dörwalda35b05e2006-03-31 09:15:29 +0000446 If \var{final} is true the decoder must decode the input completely and must
447 flush all buffers. If this isn't possible (e.g. because of incomplete byte
448 sequences at the end of the input) it must initiate error handling just like
449 in the stateless case (which might raise an exception).
Walter Dörwaldabb02e52006-03-15 11:35:15 +0000450\end{methoddesc}
451
452\begin{methoddesc}{reset}{}
453 Reset the decoder to the initial state.
454\end{methoddesc}
455
456
Fred Drake602aa772000-10-12 20:50:55 +0000457The \class{StreamWriter} and \class{StreamReader} classes provide
458generic working interfaces which can be used to implement new
Andrew M. Kuchlingba67a8a2006-04-21 12:58:30 +0000459encoding submodules very easily. See \module{encodings.utf_8} for an
460example of how this is done.
Fred Drake602aa772000-10-12 20:50:55 +0000461
462
463\subsubsection{StreamWriter Objects \label{stream-writer-objects}}
464
465The \class{StreamWriter} class is a subclass of \class{Codec} and
466defines the following methods which every stream writer must define in
Andrew M. Kuchlingba67a8a2006-04-21 12:58:30 +0000467order to be compatible with the Python codec registry.
Fred Drake602aa772000-10-12 20:50:55 +0000468
469\begin{classdesc}{StreamWriter}{stream\optional{, errors}}
470 Constructor for a \class{StreamWriter} instance.
471
472 All stream writers must provide this constructor interface. They are
473 free to add additional keyword arguments, but only the ones defined
474 here are used by the Python codec registry.
475
Andrew M. Kuchlingba67a8a2006-04-21 12:58:30 +0000476 \var{stream} must be a file-like object open for writing binary
Fred Drake602aa772000-10-12 20:50:55 +0000477 data.
478
479 The \class{StreamWriter} may implement different error handling
480 schemes by providing the \var{errors} keyword argument. These
Walter Dörwald430b1562002-11-07 22:33:17 +0000481 parameters are predefined:
Fred Drake602aa772000-10-12 20:50:55 +0000482
483 \begin{itemize}
484 \item \code{'strict'} Raise \exception{ValueError} (or a subclass);
485 this is the default.
486 \item \code{'ignore'} Ignore the character and continue with the next.
487 \item \code{'replace'} Replace with a suitable replacement character
Walter Dörwald430b1562002-11-07 22:33:17 +0000488 \item \code{'xmlcharrefreplace'} Replace with the appropriate XML
489 character reference
490 \item \code{'backslashreplace'} Replace with backslashed escape sequences.
Fred Drake602aa772000-10-12 20:50:55 +0000491 \end{itemize}
Walter Dörwald430b1562002-11-07 22:33:17 +0000492
493 The \var{errors} argument will be assigned to an attribute of the
494 same name. Assigning to this attribute makes it possible to switch
495 between different error handling strategies during the lifetime
496 of the \class{StreamWriter} object.
497
498 The set of allowed values for the \var{errors} argument can
499 be extended with \function{register_error()}.
Fred Drake602aa772000-10-12 20:50:55 +0000500\end{classdesc}
501
502\begin{methoddesc}{write}{object}
503 Writes the object's contents encoded to the stream.
504\end{methoddesc}
505
506\begin{methoddesc}{writelines}{list}
507 Writes the concatenated list of strings to the stream (possibly by
508 reusing the \method{write()} method).
509\end{methoddesc}
510
511\begin{methoddesc}{reset}{}
512 Flushes and resets the codec buffers used for keeping state.
513
514 Calling this method should ensure that the data on the output is put
Andrew M. Kuchlingba67a8a2006-04-21 12:58:30 +0000515 into a clean state that allows appending of new fresh data without
Fred Drake602aa772000-10-12 20:50:55 +0000516 having to rescan the whole stream to recover state.
517\end{methoddesc}
518
519In addition to the above methods, the \class{StreamWriter} must also
Andrew M. Kuchlingba67a8a2006-04-21 12:58:30 +0000520inherit all other methods and attributes from the underlying stream.
Fred Drake602aa772000-10-12 20:50:55 +0000521
522
523\subsubsection{StreamReader Objects \label{stream-reader-objects}}
524
525The \class{StreamReader} class is a subclass of \class{Codec} and
526defines the following methods which every stream reader must define in
Andrew M. Kuchlingba67a8a2006-04-21 12:58:30 +0000527order to be compatible with the Python codec registry.
Fred Drake602aa772000-10-12 20:50:55 +0000528
529\begin{classdesc}{StreamReader}{stream\optional{, errors}}
530 Constructor for a \class{StreamReader} instance.
531
532 All stream readers must provide this constructor interface. They are
533 free to add additional keyword arguments, but only the ones defined
534 here are used by the Python codec registry.
535
536 \var{stream} must be a file-like object open for reading (binary)
537 data.
538
539 The \class{StreamReader} may implement different error handling
540 schemes by providing the \var{errors} keyword argument. These
541 parameters are defined:
542
543 \begin{itemize}
544 \item \code{'strict'} Raise \exception{ValueError} (or a subclass);
545 this is the default.
546 \item \code{'ignore'} Ignore the character and continue with the next.
547 \item \code{'replace'} Replace with a suitable replacement character.
548 \end{itemize}
Walter Dörwald430b1562002-11-07 22:33:17 +0000549
550 The \var{errors} argument will be assigned to an attribute of the
551 same name. Assigning to this attribute makes it possible to switch
552 between different error handling strategies during the lifetime
553 of the \class{StreamReader} object.
554
555 The set of allowed values for the \var{errors} argument can
556 be extended with \function{register_error()}.
Fred Drake602aa772000-10-12 20:50:55 +0000557\end{classdesc}
558
Martin v. Löwis56066d22005-08-24 07:38:12 +0000559\begin{methoddesc}{read}{\optional{size\optional{, chars, \optional{firstline}}}}
Fred Drake602aa772000-10-12 20:50:55 +0000560 Decodes data from the stream and returns the resulting object.
561
Walter Dörwald69652032004-09-07 20:24:22 +0000562 \var{chars} indicates the number of characters to read from the
Fred Drakea2544ee2004-09-10 01:16:49 +0000563 stream. \function{read()} will never return more than \var{chars}
Walter Dörwald69652032004-09-07 20:24:22 +0000564 characters, but it might return less, if there are not enough
565 characters available.
566
Fred Drake602aa772000-10-12 20:50:55 +0000567 \var{size} indicates the approximate maximum number of bytes to read
568 from the stream for decoding purposes. The decoder can modify this
569 setting as appropriate. The default value -1 indicates to read and
570 decode as much as possible. \var{size} is intended to prevent having
571 to decode huge files in one step.
572
Martin v. Löwis56066d22005-08-24 07:38:12 +0000573 \var{firstline} indicates that it would be sufficient to only return
574 the first line, if there are decoding errors on later lines.
575
Fred Drake602aa772000-10-12 20:50:55 +0000576 The method should use a greedy read strategy meaning that it should
577 read as much data as is allowed within the definition of the encoding
578 and the given size, e.g. if optional encoding endings or state
579 markers are available on the stream, these should be read too.
Walter Dörwald69652032004-09-07 20:24:22 +0000580
581 \versionchanged[\var{chars} argument added]{2.4}
Martin v. Löwis56066d22005-08-24 07:38:12 +0000582 \versionchanged[\var{firstline} argument added]{2.4.2}
Fred Drake602aa772000-10-12 20:50:55 +0000583\end{methoddesc}
584
Walter Dörwald69652032004-09-07 20:24:22 +0000585\begin{methoddesc}{readline}{\optional{size\optional{, keepends}}}
Fred Drake602aa772000-10-12 20:50:55 +0000586 Read one line from the input stream and return the
587 decoded data.
588
Fred Drake602aa772000-10-12 20:50:55 +0000589 \var{size}, if given, is passed as size argument to the stream's
590 \method{readline()} method.
Walter Dörwald69652032004-09-07 20:24:22 +0000591
Andrew M. Kuchlingba67a8a2006-04-21 12:58:30 +0000592 If \var{keepends} is false line-endings will be stripped from the
Walter Dörwald69652032004-09-07 20:24:22 +0000593 lines returned.
594
595 \versionchanged[\var{keepends} argument added]{2.4}
Fred Drake602aa772000-10-12 20:50:55 +0000596\end{methoddesc}
597
Walter Dörwald69652032004-09-07 20:24:22 +0000598\begin{methoddesc}{readlines}{\optional{sizehint\optional{, keepends}}}
Andrew M. Kuchlingba67a8a2006-04-21 12:58:30 +0000599 Read all lines available on the input stream and return them as a list
Fred Drake602aa772000-10-12 20:50:55 +0000600 of lines.
601
Andrew M. Kuchlingba67a8a2006-04-21 12:58:30 +0000602 Line-endings are implemented using the codec's decoder method and are
Walter Dörwald69652032004-09-07 20:24:22 +0000603 included in the list entries if \var{keepends} is true.
Fred Drake602aa772000-10-12 20:50:55 +0000604
Andrew M. Kuchlingba67a8a2006-04-21 12:58:30 +0000605 \var{sizehint}, if given, is passed as the \var{size} argument to the
Fred Drake602aa772000-10-12 20:50:55 +0000606 stream's \method{read()} method.
607\end{methoddesc}
608
609\begin{methoddesc}{reset}{}
610 Resets the codec buffers used for keeping state.
611
612 Note that no stream repositioning should take place. This method is
613 primarily intended to be able to recover from decoding errors.
614\end{methoddesc}
615
616In addition to the above methods, the \class{StreamReader} must also
Andrew M. Kuchlingba67a8a2006-04-21 12:58:30 +0000617inherit all other methods and attributes from the underlying stream.
Fred Drake602aa772000-10-12 20:50:55 +0000618
619The next two base classes are included for convenience. They are not
620needed by the codec registry, but may provide useful in practice.
621
622
623\subsubsection{StreamReaderWriter Objects \label{stream-reader-writer}}
624
625The \class{StreamReaderWriter} allows wrapping streams which work in
626both read and write modes.
627
628The design is such that one can use the factory functions returned by
629the \function{lookup()} function to construct the instance.
630
631\begin{classdesc}{StreamReaderWriter}{stream, Reader, Writer, errors}
632 Creates a \class{StreamReaderWriter} instance.
633 \var{stream} must be a file-like object.
634 \var{Reader} and \var{Writer} must be factory functions or classes
635 providing the \class{StreamReader} and \class{StreamWriter} interface
636 resp.
637 Error handling is done in the same way as defined for the
638 stream readers and writers.
639\end{classdesc}
640
641\class{StreamReaderWriter} instances define the combined interfaces of
642\class{StreamReader} and \class{StreamWriter} classes. They inherit
Andrew M. Kuchlingba67a8a2006-04-21 12:58:30 +0000643all other methods and attributes from the underlying stream.
Fred Drake602aa772000-10-12 20:50:55 +0000644
645
646\subsubsection{StreamRecoder Objects \label{stream-recoder-objects}}
647
648The \class{StreamRecoder} provide a frontend - backend view of
649encoding data which is sometimes useful when dealing with different
650encoding environments.
651
652The design is such that one can use the factory functions returned by
653the \function{lookup()} function to construct the instance.
654
655\begin{classdesc}{StreamRecoder}{stream, encode, decode,
656 Reader, Writer, errors}
657 Creates a \class{StreamRecoder} instance which implements a two-way
658 conversion: \var{encode} and \var{decode} work on the frontend (the
659 input to \method{read()} and output of \method{write()}) while
660 \var{Reader} and \var{Writer} work on the backend (reading and
661 writing to the stream).
662
663 You can use these objects to do transparent direct recodings from
664 e.g.\ Latin-1 to UTF-8 and back.
665
666 \var{stream} must be a file-like object.
667
668 \var{encode}, \var{decode} must adhere to the \class{Codec}
Andrew M. Kuchlingba67a8a2006-04-21 12:58:30 +0000669 interface. \var{Reader}, \var{Writer} must be factory functions or
Raymond Hettingerf17d65d2003-08-12 00:01:16 +0000670 classes providing objects of the \class{StreamReader} and
Fred Drake602aa772000-10-12 20:50:55 +0000671 \class{StreamWriter} interface respectively.
672
673 \var{encode} and \var{decode} are needed for the frontend
674 translation, \var{Reader} and \var{Writer} for the backend
675 translation. The intermediate format used is determined by the two
Andrew M. Kuchlingba67a8a2006-04-21 12:58:30 +0000676 sets of codecs, e.g. the Unicode codecs will use Unicode as the
Fred Drake602aa772000-10-12 20:50:55 +0000677 intermediate encoding.
678
679 Error handling is done in the same way as defined for the
680 stream readers and writers.
681\end{classdesc}
682
683\class{StreamRecoder} instances define the combined interfaces of
684\class{StreamReader} and \class{StreamWriter} classes. They inherit
Andrew M. Kuchlingba67a8a2006-04-21 12:58:30 +0000685all other methods and attributes from the underlying stream.
Fred Drake602aa772000-10-12 20:50:55 +0000686
Martin v. Löwis412ed3b2006-01-08 10:45:39 +0000687\subsection{Encodings and Unicode\label{encodings-overview}}
688
689Unicode strings are stored internally as sequences of codepoints (to
Georg Brandl131e4f72006-01-23 21:33:48 +0000690be precise as \ctype{Py_UNICODE} arrays). Depending on the way Python is
691compiled (either via \longprogramopt{enable-unicode=ucs2} or
692\longprogramopt{enable-unicode=ucs4}, with the former being the default)
693\ctype{Py_UNICODE} is either a 16-bit or
Martin v. Löwis412ed3b2006-01-08 10:45:39 +000069432-bit data type. Once a Unicode object is used outside of CPU and
695memory, CPU endianness and how these arrays are stored as bytes become
696an issue. Transforming a unicode object into a sequence of bytes is
697called encoding and recreating the unicode object from the sequence of
Andrew M. Kuchlingba67a8a2006-04-21 12:58:30 +0000698bytes is known as decoding. There are many different methods for how this
Martin v. Löwis412ed3b2006-01-08 10:45:39 +0000699transformation can be done (these methods are also called encodings).
700The simplest method is to map the codepoints 0-255 to the bytes
Georg Brandl131e4f72006-01-23 21:33:48 +0000701\code{0x0}-\code{0xff}. This means that a unicode object that contains
702codepoints above \code{U+00FF} can't be encoded with this method (which
Georg Brandldb815ab2006-03-17 16:26:31 +0000703is called \code{'latin-1'} or \code{'iso-8859-1'}).
704\function{unicode.encode()} will raise a \exception{UnicodeEncodeError}
705that looks like this: \samp{UnicodeEncodeError: 'latin-1' codec can't
706encode character u'\e u1234' in position 3: ordinal not in range(256)}.
Martin v. Löwis412ed3b2006-01-08 10:45:39 +0000707
708There's another group of encodings (the so called charmap encodings)
709that choose a different subset of all unicode code points and how
Georg Brandl131e4f72006-01-23 21:33:48 +0000710these codepoints are mapped to the bytes \code{0x0}-\code{0xff.}
711To see how this is done simply open e.g. \file{encodings/cp1252.py}
712(which is an encoding that is used primarily on Windows).
713There's a string constant with 256 characters that shows you which
714character is mapped to which byte value.
Martin v. Löwis412ed3b2006-01-08 10:45:39 +0000715
716All of these encodings can only encode 256 of the 65536 (or 1114111)
717codepoints defined in unicode. A simple and straightforward way that
718can store each Unicode code point, is to store each codepoint as two
719consecutive bytes. There are two possibilities: Store the bytes in big
720endian or in little endian order. These two encodings are called
721UTF-16-BE and UTF-16-LE respectively. Their disadvantage is that if
722e.g. you use UTF-16-BE on a little endian machine you will always have
723to swap bytes on encoding and decoding. UTF-16 avoids this problem:
724Bytes will always be in natural endianness. When these bytes are read
725by a CPU with a different endianness, then bytes have to be swapped
726though. To be able to detect the endianness of a UTF-16 byte sequence,
727there's the so called BOM (the "Byte Order Mark"). This is the Unicode
Georg Brandl131e4f72006-01-23 21:33:48 +0000728character \code{U+FEFF}. This character will be prepended to every UTF-16
729byte sequence. The byte swapped version of this character (\code{0xFFFE}) is
Martin v. Löwis412ed3b2006-01-08 10:45:39 +0000730an illegal character that may not appear in a Unicode text. So when
Georg Brandl131e4f72006-01-23 21:33:48 +0000731the first character in an UTF-16 byte sequence appears to be a \code{U+FFFE}
Martin v. Löwis412ed3b2006-01-08 10:45:39 +0000732the bytes have to be swapped on decoding. Unfortunately upto Unicode
Georg Brandl131e4f72006-01-23 21:33:48 +00007334.0 the character \code{U+FEFF} had a second purpose as a \samp{ZERO WIDTH
734NO-BREAK SPACE}: A character that has no width and doesn't allow a
Martin v. Löwis412ed3b2006-01-08 10:45:39 +0000735word to be split. It can e.g. be used to give hints to a ligature
Georg Brandl131e4f72006-01-23 21:33:48 +0000736algorithm. With Unicode 4.0 using \code{U+FEFF} as a \samp{ZERO WIDTH NO-BREAK
737SPACE} has been deprecated (with \code{U+2060} (\samp{WORD JOINER}) assuming
738this role). Nevertheless Unicode software still must be able to handle
739\code{U+FEFF} in both roles: As a BOM it's a device to determine the storage
Martin v. Löwis412ed3b2006-01-08 10:45:39 +0000740layout of the encoded bytes, and vanishes once the byte sequence has
Georg Brandl131e4f72006-01-23 21:33:48 +0000741been decoded into a Unicode string; as a \samp{ZERO WIDTH NO-BREAK SPACE}
Martin v. Löwis412ed3b2006-01-08 10:45:39 +0000742it's a normal character that will be decoded like any other.
743
744There's another encoding that is able to encoding the full range of
Andrew M. Kuchlingba67a8a2006-04-21 12:58:30 +0000745Unicode characters: UTF-8. UTF-8 is an 8-bit encoding, which means
Martin v. Löwis412ed3b2006-01-08 10:45:39 +0000746there are no issues with byte order in UTF-8. Each byte in a UTF-8
747byte sequence consists of two parts: Marker bits (the most significant
748bits) and payload bits. The marker bits are a sequence of zero to six
7491 bits followed by a 0 bit. Unicode characters are encoded like this
Walter Dörwaldb754fe42006-01-09 12:45:01 +0000750(with x being payload bits, which when concatenated give the Unicode
Martin v. Löwis412ed3b2006-01-08 10:45:39 +0000751character):
752
Walter Dörwaldb075fce2006-02-21 18:51:32 +0000753\begin{tableii}{l|l}{textrm}{Range}{Encoding}
Georg Brandl131e4f72006-01-23 21:33:48 +0000754\lineii{\code{U-00000000} ... \code{U-0000007F}}{0xxxxxxx}
755\lineii{\code{U-00000080} ... \code{U-000007FF}}{110xxxxx 10xxxxxx}
756\lineii{\code{U-00000800} ... \code{U-0000FFFF}}{1110xxxx 10xxxxxx 10xxxxxx}
757\lineii{\code{U-00010000} ... \code{U-001FFFFF}}{11110xxx 10xxxxxx 10xxxxxx 10xxxxxx}
758\lineii{\code{U-00200000} ... \code{U-03FFFFFF}}{111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx}
759\lineii{\code{U-04000000} ... \code{U-7FFFFFFF}}{1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx}
Martin v. Löwis412ed3b2006-01-08 10:45:39 +0000760\end{tableii}
761
762The least significant bit of the Unicode character is the rightmost x
763bit.
764
Andrew M. Kuchlingba67a8a2006-04-21 12:58:30 +0000765As UTF-8 is an 8-bit encoding no BOM is required and any \code{U+FEFF}
Martin v. Löwis412ed3b2006-01-08 10:45:39 +0000766character in the decoded Unicode string (even if it's the first
Georg Brandl131e4f72006-01-23 21:33:48 +0000767character) is treated as a \samp{ZERO WIDTH NO-BREAK SPACE}.
Martin v. Löwis412ed3b2006-01-08 10:45:39 +0000768
769Without external information it's impossible to reliably determine
770which encoding was used for encoding a Unicode string. Each charmap
771encoding can decode any random byte sequence. However that's not
772possible with UTF-8, as UTF-8 byte sequences have a structure that
773doesn't allow arbitrary byte sequence. To increase the reliability
Walter Dörwaldb754fe42006-01-09 12:45:01 +0000774with which a UTF-8 encoding can be detected, Microsoft invented a
Georg Brandl131e4f72006-01-23 21:33:48 +0000775variant of UTF-8 (that Python 2.5 calls \code{"utf-8-sig"}) for its Notepad
Martin v. Löwis412ed3b2006-01-08 10:45:39 +0000776program: Before any of the Unicode characters is written to the file,
Georg Brandl131e4f72006-01-23 21:33:48 +0000777a UTF-8 encoded BOM (which looks like this as a byte sequence: \code{0xef},
Andrew M. Kuchlingba67a8a2006-04-21 12:58:30 +0000778\code{0xbb}, \code{0xbf}) is written. As it's rather improbable that any
Georg Brandl131e4f72006-01-23 21:33:48 +0000779charmap encoded file starts with these byte values (which would e.g. map to
Martin v. Löwis412ed3b2006-01-08 10:45:39 +0000780
Georg Brandl131e4f72006-01-23 21:33:48 +0000781 LATIN SMALL LETTER I WITH DIAERESIS \\
782 RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK \\
Martin v. Löwis412ed3b2006-01-08 10:45:39 +0000783 INVERTED QUESTION MARK
784
785in iso-8859-1), this increases the probability that a utf-8-sig
786encoding can be correctly guessed from the byte sequence. So here the
787BOM is not used to be able to determine the byte order used for
788generating the byte sequence, but as a signature that helps in
789guessing the encoding. On encoding the utf-8-sig codec will write
Georg Brandl131e4f72006-01-23 21:33:48 +0000790\code{0xef}, \code{0xbb}, \code{0xbf} as the first three bytes to the file.
791On decoding utf-8-sig will skip those three bytes if they appear as the
792first three bytes in the file.
Martin v. Löwis412ed3b2006-01-08 10:45:39 +0000793
794
Skip Montanaroecf7a522004-07-01 19:26:04 +0000795\subsection{Standard Encodings\label{standard-encodings}}
Martin v. Löwis5c37a772002-12-31 12:39:07 +0000796
Andrew M. Kuchlingba67a8a2006-04-21 12:58:30 +0000797Python comes with a number of codecs built-in, either implemented as C
798functions or with dictionaries as mapping tables. The following table
Martin v. Löwis5c37a772002-12-31 12:39:07 +0000799lists the codecs by name, together with a few common aliases, and the
800languages for which the encoding is likely used. Neither the list of
801aliases nor the list of languages is meant to be exhaustive. Notice
802that spelling alternatives that only differ in case or use a hyphen
803instead of an underscore are also valid aliases.
804
805Many of the character sets support the same languages. They vary in
806individual characters (e.g. whether the EURO SIGN is supported or
807not), and in the assignment of characters to code positions. For the
808European languages in particular, the following variants typically
809exist:
810
811\begin{itemize}
812\item an ISO 8859 codeset
813\item a Microsoft Windows code page, which is typically derived from
814 a 8859 codeset, but replaces control characters with additional
815 graphic characters
816\item an IBM EBCDIC code page
Fred Draked4be7472003-04-30 15:02:07 +0000817\item an IBM PC code page, which is \ASCII{} compatible
Martin v. Löwis5c37a772002-12-31 12:39:07 +0000818\end{itemize}
819
820\begin{longtableiii}{l|l|l}{textrm}{Codec}{Aliases}{Languages}
821
822\lineiii{ascii}
823 {646, us-ascii}
824 {English}
825
Hye-Shik Chang3e2a3062004-01-17 14:29:29 +0000826\lineiii{big5}
Hye-Shik Chang910d8f12004-07-17 14:44:43 +0000827 {big5-tw, csbig5}
Hye-Shik Chang3e2a3062004-01-17 14:29:29 +0000828 {Traditional Chinese}
829
Hye-Shik Chang2bb146f2004-07-18 03:06:29 +0000830\lineiii{big5hkscs}
831 {big5-hkscs, hkscs}
832 {Traditional Chinese}
833
Martin v. Löwis5c37a772002-12-31 12:39:07 +0000834\lineiii{cp037}
835 {IBM037, IBM039}
836 {English}
837
838\lineiii{cp424}
839 {EBCDIC-CP-HE, IBM424}
840 {Hebrew}
841
842\lineiii{cp437}
843 {437, IBM437}
844 {English}
845
846\lineiii{cp500}
847 {EBCDIC-CP-BE, EBCDIC-CP-CH, IBM500}
848 {Western Europe}
849
850\lineiii{cp737}
851 {}
852 {Greek}
853
854\lineiii{cp775}
855 {IBM775}
856 {Baltic languages}
857
858\lineiii{cp850}
859 {850, IBM850}
860 {Western Europe}
861
862\lineiii{cp852}
863 {852, IBM852}
864 {Central and Eastern Europe}
865
866\lineiii{cp855}
867 {855, IBM855}
868 {Bulgarian, Byelorussian, Macedonian, Russian, Serbian}
869
870\lineiii{cp856}
871 {}
872 {Hebrew}
873
874\lineiii{cp857}
875 {857, IBM857}
876 {Turkish}
877
878\lineiii{cp860}
879 {860, IBM860}
880 {Portuguese}
881
882\lineiii{cp861}
883 {861, CP-IS, IBM861}
884 {Icelandic}
885
886\lineiii{cp862}
887 {862, IBM862}
888 {Hebrew}
889
890\lineiii{cp863}
891 {863, IBM863}
892 {Canadian}
893
894\lineiii{cp864}
895 {IBM864}
896 {Arabic}
897
898\lineiii{cp865}
899 {865, IBM865}
900 {Danish, Norwegian}
901
Skip Montanaro78bace72004-07-02 02:14:34 +0000902\lineiii{cp866}
903 {866, IBM866}
904 {Russian}
905
Martin v. Löwis5c37a772002-12-31 12:39:07 +0000906\lineiii{cp869}
907 {869, CP-GR, IBM869}
908 {Greek}
909
910\lineiii{cp874}
911 {}
912 {Thai}
913
914\lineiii{cp875}
915 {}
916 {Greek}
917
Hye-Shik Chang3e2a3062004-01-17 14:29:29 +0000918\lineiii{cp932}
Hye-Shik Chang910d8f12004-07-17 14:44:43 +0000919 {932, ms932, mskanji, ms-kanji}
Hye-Shik Chang3e2a3062004-01-17 14:29:29 +0000920 {Japanese}
921
922\lineiii{cp949}
923 {949, ms949, uhc}
924 {Korean}
925
926\lineiii{cp950}
927 {950, ms950}
928 {Traditional Chinese}
929
Martin v. Löwis5c37a772002-12-31 12:39:07 +0000930\lineiii{cp1006}
931 {}
932 {Urdu}
933
934\lineiii{cp1026}
935 {ibm1026}
936 {Turkish}
937
938\lineiii{cp1140}
939 {ibm1140}
940 {Western Europe}
941
942\lineiii{cp1250}
943 {windows-1250}
944 {Central and Eastern Europe}
945
946\lineiii{cp1251}
947 {windows-1251}
948 {Bulgarian, Byelorussian, Macedonian, Russian, Serbian}
949
950\lineiii{cp1252}
951 {windows-1252}
952 {Western Europe}
953
954\lineiii{cp1253}
955 {windows-1253}
956 {Greek}
957
958\lineiii{cp1254}
959 {windows-1254}
960 {Turkish}
961
962\lineiii{cp1255}
963 {windows-1255}
964 {Hebrew}
965
966\lineiii{cp1256}
967 {windows1256}
968 {Arabic}
969
970\lineiii{cp1257}
971 {windows-1257}
972 {Baltic languages}
973
974\lineiii{cp1258}
975 {windows-1258}
976 {Vietnamese}
977
Hye-Shik Chang3e2a3062004-01-17 14:29:29 +0000978\lineiii{euc_jp}
Hye-Shik Chang910d8f12004-07-17 14:44:43 +0000979 {eucjp, ujis, u-jis}
Hye-Shik Chang3e2a3062004-01-17 14:29:29 +0000980 {Japanese}
981
Hye-Shik Chang2bb146f2004-07-18 03:06:29 +0000982\lineiii{euc_jis_2004}
983 {jisx0213, eucjis2004}
984 {Japanese}
985
Hye-Shik Chang3e2a3062004-01-17 14:29:29 +0000986\lineiii{euc_jisx0213}
Hye-Shik Chang2bb146f2004-07-18 03:06:29 +0000987 {eucjisx0213}
Hye-Shik Chang3e2a3062004-01-17 14:29:29 +0000988 {Japanese}
989
990\lineiii{euc_kr}
Hye-Shik Chang910d8f12004-07-17 14:44:43 +0000991 {euckr, korean, ksc5601, ks_c-5601, ks_c-5601-1987, ksx1001, ks_x-1001}
Hye-Shik Chang3e2a3062004-01-17 14:29:29 +0000992 {Korean}
993
994\lineiii{gb2312}
Hye-Shik Chang910d8f12004-07-17 14:44:43 +0000995 {chinese, csiso58gb231280, euc-cn, euccn, eucgb2312-cn, gb2312-1980,
996 gb2312-80, iso-ir-58}
Hye-Shik Chang3e2a3062004-01-17 14:29:29 +0000997 {Simplified Chinese}
998
999\lineiii{gbk}
1000 {936, cp936, ms936}
1001 {Unified Chinese}
1002
1003\lineiii{gb18030}
Hye-Shik Chang910d8f12004-07-17 14:44:43 +00001004 {gb18030-2000}
Hye-Shik Chang3e2a3062004-01-17 14:29:29 +00001005 {Unified Chinese}
1006
1007\lineiii{hz}
Hye-Shik Chang910d8f12004-07-17 14:44:43 +00001008 {hzgb, hz-gb, hz-gb-2312}
Hye-Shik Chang3e2a3062004-01-17 14:29:29 +00001009 {Simplified Chinese}
1010
1011\lineiii{iso2022_jp}
Hye-Shik Chang910d8f12004-07-17 14:44:43 +00001012 {csiso2022jp, iso2022jp, iso-2022-jp}
Hye-Shik Chang3e2a3062004-01-17 14:29:29 +00001013 {Japanese}
1014
1015\lineiii{iso2022_jp_1}
Hye-Shik Chang910d8f12004-07-17 14:44:43 +00001016 {iso2022jp-1, iso-2022-jp-1}
Hye-Shik Chang3e2a3062004-01-17 14:29:29 +00001017 {Japanese}
1018
1019\lineiii{iso2022_jp_2}
Hye-Shik Chang910d8f12004-07-17 14:44:43 +00001020 {iso2022jp-2, iso-2022-jp-2}
Hye-Shik Chang3e2a3062004-01-17 14:29:29 +00001021 {Japanese, Korean, Simplified Chinese, Western Europe, Greek}
1022
Hye-Shik Chang2bb146f2004-07-18 03:06:29 +00001023\lineiii{iso2022_jp_2004}
1024 {iso2022jp-2004, iso-2022-jp-2004}
1025 {Japanese}
1026
Hye-Shik Chang3e2a3062004-01-17 14:29:29 +00001027\lineiii{iso2022_jp_3}
Hye-Shik Chang910d8f12004-07-17 14:44:43 +00001028 {iso2022jp-3, iso-2022-jp-3}
Hye-Shik Chang3e2a3062004-01-17 14:29:29 +00001029 {Japanese}
1030
1031\lineiii{iso2022_jp_ext}
Hye-Shik Chang910d8f12004-07-17 14:44:43 +00001032 {iso2022jp-ext, iso-2022-jp-ext}
Hye-Shik Chang3e2a3062004-01-17 14:29:29 +00001033 {Japanese}
1034
1035\lineiii{iso2022_kr}
Hye-Shik Chang910d8f12004-07-17 14:44:43 +00001036 {csiso2022kr, iso2022kr, iso-2022-kr}
Hye-Shik Chang3e2a3062004-01-17 14:29:29 +00001037 {Korean}
1038
Martin v. Löwis5c37a772002-12-31 12:39:07 +00001039\lineiii{latin_1}
1040 {iso-8859-1, iso8859-1, 8859, cp819, latin, latin1, L1}
1041 {West Europe}
1042
1043\lineiii{iso8859_2}
1044 {iso-8859-2, latin2, L2}
1045 {Central and Eastern Europe}
1046
1047\lineiii{iso8859_3}
1048 {iso-8859-3, latin3, L3}
1049 {Esperanto, Maltese}
1050
1051\lineiii{iso8859_4}
1052 {iso-8859-4, latin4, L4}
1053 {Baltic languagues}
1054
1055\lineiii{iso8859_5}
1056 {iso-8859-5, cyrillic}
1057 {Bulgarian, Byelorussian, Macedonian, Russian, Serbian}
1058
1059\lineiii{iso8859_6}
1060 {iso-8859-6, arabic}
1061 {Arabic}
1062
1063\lineiii{iso8859_7}
1064 {iso-8859-7, greek, greek8}
1065 {Greek}
1066
1067\lineiii{iso8859_8}
1068 {iso-8859-8, hebrew}
1069 {Hebrew}
1070
1071\lineiii{iso8859_9}
1072 {iso-8859-9, latin5, L5}
1073 {Turkish}
1074
1075\lineiii{iso8859_10}
1076 {iso-8859-10, latin6, L6}
1077 {Nordic languages}
1078
1079\lineiii{iso8859_13}
1080 {iso-8859-13}
1081 {Baltic languages}
1082
1083\lineiii{iso8859_14}
1084 {iso-8859-14, latin8, L8}
1085 {Celtic languages}
1086
1087\lineiii{iso8859_15}
1088 {iso-8859-15}
1089 {Western Europe}
1090
Hye-Shik Chang3e2a3062004-01-17 14:29:29 +00001091\lineiii{johab}
1092 {cp1361, ms1361}
1093 {Korean}
1094
Martin v. Löwis5c37a772002-12-31 12:39:07 +00001095\lineiii{koi8_r}
1096 {}
1097 {Russian}
1098
1099\lineiii{koi8_u}
1100 {}
1101 {Ukrainian}
1102
1103\lineiii{mac_cyrillic}
1104 {maccyrillic}
1105 {Bulgarian, Byelorussian, Macedonian, Russian, Serbian}
1106
1107\lineiii{mac_greek}
1108 {macgreek}
1109 {Greek}
1110
1111\lineiii{mac_iceland}
1112 {maciceland}
1113 {Icelandic}
1114
1115\lineiii{mac_latin2}
1116 {maclatin2, maccentraleurope}
1117 {Central and Eastern Europe}
1118
1119\lineiii{mac_roman}
1120 {macroman}
1121 {Western Europe}
1122
1123\lineiii{mac_turkish}
1124 {macturkish}
1125 {Turkish}
1126
Hye-Shik Chang5c5316f2004-03-19 08:06:07 +00001127\lineiii{ptcp154}
1128 {csptcp154, pt154, cp154, cyrillic-asian}
1129 {Kazakh}
1130
Hye-Shik Chang3e2a3062004-01-17 14:29:29 +00001131\lineiii{shift_jis}
1132 {csshiftjis, shiftjis, sjis, s_jis}
1133 {Japanese}
1134
Hye-Shik Chang2bb146f2004-07-18 03:06:29 +00001135\lineiii{shift_jis_2004}
1136 {shiftjis2004, sjis_2004, sjis2004}
1137 {Japanese}
1138
Hye-Shik Chang3e2a3062004-01-17 14:29:29 +00001139\lineiii{shift_jisx0213}
1140 {shiftjisx0213, sjisx0213, s_jisx0213}
1141 {Japanese}
1142
Martin v. Löwis5c37a772002-12-31 12:39:07 +00001143\lineiii{utf_16}
1144 {U16, utf16}
1145 {all languages}
1146
1147\lineiii{utf_16_be}
1148 {UTF-16BE}
1149 {all languages (BMP only)}
1150
1151\lineiii{utf_16_le}
1152 {UTF-16LE}
1153 {all languages (BMP only)}
1154
1155\lineiii{utf_7}
Walter Dörwald007f8df2005-10-09 19:42:27 +00001156 {U7, unicode-1-1-utf-7}
Martin v. Löwis5c37a772002-12-31 12:39:07 +00001157 {all languages}
1158
1159\lineiii{utf_8}
1160 {U8, UTF, utf8}
1161 {all languages}
1162
Martin v. Löwis412ed3b2006-01-08 10:45:39 +00001163\lineiii{utf_8_sig}
1164 {}
1165 {all languages}
1166
Martin v. Löwis5c37a772002-12-31 12:39:07 +00001167\end{longtableiii}
1168
1169A number of codecs are specific to Python, so their codec names have
1170no meaning outside Python. Some of them don't convert from Unicode
1171strings to byte strings, but instead use the property of the Python
1172codecs machinery that any bijective function with one argument can be
1173considered as an encoding.
1174
1175For the codecs listed below, the result in the ``encoding'' direction
1176is always a byte string. The result of the ``decoding'' direction is
1177listed as operand type in the table.
1178
1179\begin{tableiv}{l|l|l|l}{textrm}{Codec}{Aliases}{Operand type}{Purpose}
1180
1181\lineiv{base64_codec}
1182 {base64, base-64}
1183 {byte string}
1184 {Convert operand to MIME base64}
1185
Raymond Hettinger9a80c5d2003-09-23 20:21:01 +00001186\lineiv{bz2_codec}
1187 {bz2}
1188 {byte string}
1189 {Compress the operand using bz2}
1190
Martin v. Löwis5c37a772002-12-31 12:39:07 +00001191\lineiv{hex_codec}
1192 {hex}
1193 {byte string}
Fred Draked4be7472003-04-30 15:02:07 +00001194 {Convert operand to hexadecimal representation, with two
1195 digits per byte}
Martin v. Löwis5c37a772002-12-31 12:39:07 +00001196
Martin v. Löwis2548c732003-04-18 10:39:54 +00001197\lineiv{idna}
1198 {}
1199 {Unicode string}
Fred Draked4be7472003-04-30 15:02:07 +00001200 {Implements \rfc{3490}.
Raymond Hettingeraa1178b2003-09-01 23:13:04 +00001201 \versionadded{2.3}
Fred Draked4be7472003-04-30 15:02:07 +00001202 See also \refmodule{encodings.idna}}
Martin v. Löwis2548c732003-04-18 10:39:54 +00001203
Martin v. Löwis5c37a772002-12-31 12:39:07 +00001204\lineiv{mbcs}
1205 {dbcs}
1206 {Unicode string}
1207 {Windows only: Encode operand according to the ANSI codepage (CP_ACP)}
1208
1209\lineiv{palmos}
1210 {}
1211 {Unicode string}
1212 {Encoding of PalmOS 3.5}
1213
Martin v. Löwis2548c732003-04-18 10:39:54 +00001214\lineiv{punycode}
1215 {}
1216 {Unicode string}
Fred Draked4be7472003-04-30 15:02:07 +00001217 {Implements \rfc{3492}.
1218 \versionadded{2.3}}
Martin v. Löwis2548c732003-04-18 10:39:54 +00001219
Martin v. Löwis5c37a772002-12-31 12:39:07 +00001220\lineiv{quopri_codec}
1221 {quopri, quoted-printable, quotedprintable}
1222 {byte string}
1223 {Convert operand to MIME quoted printable}
1224
1225\lineiv{raw_unicode_escape}
1226 {}
1227 {Unicode string}
Fred Draked4be7472003-04-30 15:02:07 +00001228 {Produce a string that is suitable as raw Unicode literal in
1229 Python source code}
Martin v. Löwis5c37a772002-12-31 12:39:07 +00001230
1231\lineiv{rot_13}
1232 {rot13}
Georg Brandl729156e2006-04-06 11:25:33 +00001233 {Unicode string}
Martin v. Löwis5c37a772002-12-31 12:39:07 +00001234 {Returns the Caesar-cypher encryption of the operand}
1235
1236\lineiv{string_escape}
1237 {}
1238 {byte string}
Fred Draked4be7472003-04-30 15:02:07 +00001239 {Produce a string that is suitable as string literal in
1240 Python source code}
Martin v. Löwis5c37a772002-12-31 12:39:07 +00001241
1242\lineiv{undefined}
1243 {}
1244 {any}
Georg Brandl8f4b4db2006-03-09 10:16:42 +00001245 {Raise an exception for all conversions. Can be used as the
Fred Draked4be7472003-04-30 15:02:07 +00001246 system encoding if no automatic coercion between byte and
1247 Unicode strings is desired.}
Martin v. Löwis5c37a772002-12-31 12:39:07 +00001248
1249\lineiv{unicode_escape}
1250 {}
1251 {Unicode string}
Fred Draked4be7472003-04-30 15:02:07 +00001252 {Produce a string that is suitable as Unicode literal in
1253 Python source code}
Martin v. Löwis5c37a772002-12-31 12:39:07 +00001254
1255\lineiv{unicode_internal}
1256 {}
1257 {Unicode string}
Raymond Hettinger68804312005-01-01 00:28:46 +00001258 {Return the internal representation of the operand}
Martin v. Löwis5c37a772002-12-31 12:39:07 +00001259
1260\lineiv{uu_codec}
1261 {uu}
1262 {byte string}
1263 {Convert the operand using uuencode}
1264
1265\lineiv{zlib_codec}
1266 {zip, zlib}
1267 {byte string}
1268 {Compress the operand using gzip}
1269
1270\end{tableiv}
Martin v. Löwis2548c732003-04-18 10:39:54 +00001271
1272\subsection{\module{encodings.idna} ---
1273 Internationalized Domain Names in Applications}
1274
1275\declaremodule{standard}{encodings.idna}
1276\modulesynopsis{Internationalized Domain Names implementation}
Fred Draked4be7472003-04-30 15:02:07 +00001277% XXX The next line triggers a formatting bug, so it's commented out
1278% until that can be fixed.
1279%\moduleauthor{Martin v. L\"owis}
1280
1281\versionadded{2.3}
Martin v. Löwis2548c732003-04-18 10:39:54 +00001282
1283This module implements \rfc{3490} (Internationalized Domain Names in
1284Applications) and \rfc{3492} (Nameprep: A Stringprep Profile for
1285Internationalized Domain Names (IDN)). It builds upon the
Fred Draked24c7672003-07-16 05:17:23 +00001286\code{punycode} encoding and \refmodule{stringprep}.
Martin v. Löwis2548c732003-04-18 10:39:54 +00001287
Fred Draked4be7472003-04-30 15:02:07 +00001288These RFCs together define a protocol to support non-\ASCII{} characters
1289in domain names. A domain name containing non-\ASCII{} characters (such
Fred Draked24c7672003-07-16 05:17:23 +00001290as ``www.Alliancefran\c caise.nu'') is converted into an
Fred Draked4be7472003-04-30 15:02:07 +00001291\ASCII-compatible encoding (ACE, such as
Martin v. Löwis2548c732003-04-18 10:39:54 +00001292``www.xn--alliancefranaise-npb.nu''). The ACE form of the domain name
1293is then used in all places where arbitrary characters are not allowed
Fred Draked4be7472003-04-30 15:02:07 +00001294by the protocol, such as DNS queries, HTTP \mailheader{Host} fields, and so
Martin v. Löwis2548c732003-04-18 10:39:54 +00001295on. This conversion is carried out in the application; if possible
1296invisible to the user: The application should transparently convert
1297Unicode domain labels to IDNA on the wire, and convert back ACE labels
1298to Unicode before presenting them to the user.
1299
1300Python supports this conversion in several ways: The \code{idna} codec
1301allows to convert between Unicode and the ACE. Furthermore, the
Fred Draked24c7672003-07-16 05:17:23 +00001302\refmodule{socket} module transparently converts Unicode host names to
Martin v. Löwis2548c732003-04-18 10:39:54 +00001303ACE, so that applications need not be concerned about converting host
1304names themselves when they pass them to the socket module. On top of
1305that, modules that have host names as function parameters, such as
Fred Draked24c7672003-07-16 05:17:23 +00001306\refmodule{httplib} and \refmodule{ftplib}, accept Unicode host names
1307(\refmodule{httplib} then also transparently sends an IDNA hostname in
1308the \mailheader{Host} field if it sends that field at all).
Martin v. Löwis2548c732003-04-18 10:39:54 +00001309
1310When receiving host names from the wire (such as in reverse name
1311lookup), no automatic conversion to Unicode is performed: Applications
1312wishing to present such host names to the user should decode them to
1313Unicode.
1314
1315The module \module{encodings.idna} also implements the nameprep
1316procedure, which performs certain normalizations on host names, to
1317achieve case-insensitivity of international domain names, and to unify
1318similar characters. The nameprep functions can be used directly if
1319desired.
1320
1321\begin{funcdesc}{nameprep}{label}
1322Return the nameprepped version of \var{label}. The implementation
1323currently assumes query strings, so \code{AllowUnassigned} is
1324true.
1325\end{funcdesc}
1326
Raymond Hettingerb5155e32003-06-18 01:58:31 +00001327\begin{funcdesc}{ToASCII}{label}
Fred Draked4be7472003-04-30 15:02:07 +00001328Convert a label to \ASCII, as specified in \rfc{3490}.
Martin v. Löwis2548c732003-04-18 10:39:54 +00001329\code{UseSTD3ASCIIRules} is assumed to be false.
1330\end{funcdesc}
1331
1332\begin{funcdesc}{ToUnicode}{label}
1333Convert a label to Unicode, as specified in \rfc{3490}.
1334\end{funcdesc}
Martin v. Löwis412ed3b2006-01-08 10:45:39 +00001335
1336 \subsection{\module{encodings.utf_8_sig} ---
1337 UTF-8 codec with BOM signature}
1338\declaremodule{standard}{encodings.utf-8-sig} % XXX utf_8_sig gives TeX errors
1339\modulesynopsis{UTF-8 codec with BOM signature}
George Yoshida0d840282006-04-21 16:21:12 +00001340\moduleauthor{Walter D\"orwald}{}
Martin v. Löwis412ed3b2006-01-08 10:45:39 +00001341
1342\versionadded{2.5}
1343
1344This module implements a variant of the UTF-8 codec: On encoding a
1345UTF-8 encoded BOM will be prepended to the UTF-8 encoded bytes. For
1346the stateful encoder this is only done once (on the first write to the
1347byte stream). For decoding an optional UTF-8 encoded BOM at the start
1348of the data will be skipped.