blob: ee141d93e429f6526946759b5c5ed501978a4dbc [file] [log] [blame]
Fred Drakeb7979c72000-04-06 14:21:58 +00001\section{\module{codecs} ---
Fred Drake69ca9502000-04-06 16:09:59 +00002 Codec registry and base classes}
Fred Drakeb7979c72000-04-06 14:21:58 +00003
Fred Drake69ca9502000-04-06 16:09:59 +00004\declaremodule{standard}{codecs}
Fred Drakeb7979c72000-04-06 14:21:58 +00005\modulesynopsis{Encode and decode data and streams.}
6\moduleauthor{Marc-Andre Lemburg}{mal@lemburg.com}
7\sectionauthor{Marc-Andre Lemburg}{mal@lemburg.com}
Martin v. Löwis2548c732003-04-18 10:39:54 +00008\sectionauthor{Martin v. L\"owis}{martin@v.loewis.de}
Fred Drakeb7979c72000-04-06 14:21:58 +00009
10\index{Unicode}
11\index{Codecs}
12\indexii{Codecs}{encode}
13\indexii{Codecs}{decode}
14\index{streams}
15\indexii{stackable}{streams}
16
17
18This module defines base classes for standard Python codecs (encoders
19and decoders) and provides access to the internal Python codec
Walter Dörwald3aeb6322002-09-02 13:14:32 +000020registry which manages the codec and error handling lookup process.
Fred Drakeb7979c72000-04-06 14:21:58 +000021
22It defines the following functions:
23
24\begin{funcdesc}{register}{search_function}
25Register a codec search function. Search functions are expected to
26take one argument, the encoding name in all lower case letters, and
Thomas Woutersa9773292006-04-21 09:43:23 +000027return a \class{CodecInfo} object having the following attributes:
28
29\begin{itemize}
30 \item \code{name} The name of the encoding;
31 \item \code{encoder} The stateless encoding function;
32 \item \code{decoder} The stateless decoding function;
33 \item \code{incrementalencoder} An incremental encoder class or factory function;
34 \item \code{incrementaldecoder} An incremental decoder class or factory function;
35 \item \code{streamwriter} A stream writer class or factory function;
36 \item \code{streamreader} A stream reader class or factory function.
37\end{itemize}
38
39The various functions or classes take the following arguments:
Fred Drakeb7979c72000-04-06 14:21:58 +000040
41 \var{encoder} and \var{decoder}: These must be functions or methods
Fred Drake602aa772000-10-12 20:50:55 +000042 which have the same interface as the
43 \method{encode()}/\method{decode()} methods of Codec instances (see
44 Codec Interface). The functions/methods are expected to work in a
45 stateless mode.
Fred Drakeb7979c72000-04-06 14:21:58 +000046
Thomas Woutersa9773292006-04-21 09:43:23 +000047 \var{incrementalencoder} and \var{incrementalencoder}: These have to be
48 factory functions providing the following interface:
49
50 \code{factory(\var{errors}='strict')}
51
52 The factory functions must return objects providing the interfaces
53 defined by the base classes \class{IncrementalEncoder} and
54 \class{IncrementalEncoder}, respectively. Incremental codecs can maintain
55 state.
56
57 \var{streamreader} and \var{streamwriter}: These have to be
Fred Drakeb7979c72000-04-06 14:21:58 +000058 factory functions providing the following interface:
59
Fred Drake602aa772000-10-12 20:50:55 +000060 \code{factory(\var{stream}, \var{errors}='strict')}
Fred Drakeb7979c72000-04-06 14:21:58 +000061
62 The factory functions must return objects providing the interfaces
Fred Drake69ca9502000-04-06 16:09:59 +000063 defined by the base classes \class{StreamWriter} and
64 \class{StreamReader}, respectively. Stream codecs can maintain
65 state.
Fred Drakeb7979c72000-04-06 14:21:58 +000066
Fred Drake69ca9502000-04-06 16:09:59 +000067 Possible values for errors are \code{'strict'} (raise an exception
68 in case of an encoding error), \code{'replace'} (replace malformed
Walter Dörwald72f86162002-11-19 21:51:35 +000069 data with a suitable replacement marker, such as \character{?}),
Fred Drake69ca9502000-04-06 16:09:59 +000070 \code{'ignore'} (ignore malformed data and continue without further
Walter Dörwald72f86162002-11-19 21:51:35 +000071 notice), \code{'xmlcharrefreplace'} (replace with the appropriate XML
72 character reference (for encoding only)) and \code{'backslashreplace'}
73 (replace with backslashed escape sequences (for encoding only)) as
74 well as any other error handling name defined via
75 \function{register_error()}.
Fred Drakeb7979c72000-04-06 14:21:58 +000076
77In case a search function cannot find a given encoding, it should
Fred Drake69ca9502000-04-06 16:09:59 +000078return \code{None}.
Fred Drakeb7979c72000-04-06 14:21:58 +000079\end{funcdesc}
80
81\begin{funcdesc}{lookup}{encoding}
Thomas Woutersa9773292006-04-21 09:43:23 +000082Looks up the codec info in the Python codec registry and returns a
83\class{CodecInfo} object as defined above.
Fred Drakeb7979c72000-04-06 14:21:58 +000084
85Encodings are first looked up in the registry's cache. If not found,
Thomas Woutersa9773292006-04-21 09:43:23 +000086the list of registered search functions is scanned. If no \class{CodecInfo}
87object is found, a \exception{LookupError} is raised. Otherwise, the
88\class{CodecInfo} object is stored in the cache and returned to the caller.
Fred Drakeb7979c72000-04-06 14:21:58 +000089\end{funcdesc}
90
Skip Montanarob02ea652002-04-17 19:33:06 +000091To simplify access to the various codecs, the module provides these
Marc-André Lemburg494f2ae2001-09-19 11:33:31 +000092additional functions which use \function{lookup()} for the codec
93lookup:
94
95\begin{funcdesc}{getencoder}{encoding}
Thomas Woutersd4ec0c32006-04-21 16:44:05 +000096Look up the codec for the given encoding and return its encoder
Marc-André Lemburg494f2ae2001-09-19 11:33:31 +000097function.
98
99Raises a \exception{LookupError} in case the encoding cannot be found.
100\end{funcdesc}
101
102\begin{funcdesc}{getdecoder}{encoding}
Thomas Woutersd4ec0c32006-04-21 16:44:05 +0000103Look up the codec for the given encoding and return its decoder
Marc-André Lemburg494f2ae2001-09-19 11:33:31 +0000104function.
105
106Raises a \exception{LookupError} in case the encoding cannot be found.
107\end{funcdesc}
108
Thomas Woutersa9773292006-04-21 09:43:23 +0000109\begin{funcdesc}{getincrementalencoder}{encoding}
Thomas Woutersd4ec0c32006-04-21 16:44:05 +0000110Look up the codec for the given encoding and return its incremental encoder
Thomas Woutersa9773292006-04-21 09:43:23 +0000111class or factory function.
112
113Raises a \exception{LookupError} in case the encoding cannot be found or the
114codec doesn't support an incremental encoder.
Thomas Wouters49fd7fa2006-04-21 10:40:58 +0000115\versionadded{2.5}
Thomas Woutersa9773292006-04-21 09:43:23 +0000116\end{funcdesc}
117
118\begin{funcdesc}{getincrementaldecoder}{encoding}
Thomas Woutersd4ec0c32006-04-21 16:44:05 +0000119Look up the codec for the given encoding and return its incremental decoder
Thomas Woutersa9773292006-04-21 09:43:23 +0000120class or factory function.
121
122Raises a \exception{LookupError} in case the encoding cannot be found or the
123codec doesn't support an incremental decoder.
Thomas Wouters49fd7fa2006-04-21 10:40:58 +0000124\versionadded{2.5}
Thomas Woutersa9773292006-04-21 09:43:23 +0000125\end{funcdesc}
126
Marc-André Lemburg494f2ae2001-09-19 11:33:31 +0000127\begin{funcdesc}{getreader}{encoding}
Thomas Woutersd4ec0c32006-04-21 16:44:05 +0000128Look up the codec for the given encoding and return its StreamReader
Marc-André Lemburg494f2ae2001-09-19 11:33:31 +0000129class or factory function.
130
131Raises a \exception{LookupError} in case the encoding cannot be found.
132\end{funcdesc}
133
134\begin{funcdesc}{getwriter}{encoding}
Thomas Woutersd4ec0c32006-04-21 16:44:05 +0000135Look up the codec for the given encoding and return its StreamWriter
Marc-André Lemburg494f2ae2001-09-19 11:33:31 +0000136class or factory function.
137
138Raises a \exception{LookupError} in case the encoding cannot be found.
139\end{funcdesc}
140
Walter Dörwald3aeb6322002-09-02 13:14:32 +0000141\begin{funcdesc}{register_error}{name, error_handler}
142Register the error handling function \var{error_handler} under the
Raymond Hettinger8a64d402002-09-08 22:26:13 +0000143name \var{name}. \var{error_handler} will be called during encoding
Walter Dörwald3aeb6322002-09-02 13:14:32 +0000144and decoding in case of an error, when \var{name} is specified as the
Walter Dörwald2e0b18a2003-01-31 17:19:08 +0000145errors parameter.
146
147For encoding \var{error_handler} will be called with a
148\exception{UnicodeEncodeError} instance, which contains information about
149the location of the error. The error handler must either raise this or
150a different exception or return a tuple with a replacement for the
151unencodable part of the input and a position where encoding should
152continue. The encoder will encode the replacement and continue encoding
153the original input at the specified position. Negative position values
154will be treated as being relative to the end of the input string. If the
Thomas Wouters49fd7fa2006-04-21 10:40:58 +0000155resulting position is out of bound an \exception{IndexError} will be raised.
Walter Dörwald2e0b18a2003-01-31 17:19:08 +0000156
157Decoding and translating works similar, except \exception{UnicodeDecodeError}
158or \exception{UnicodeTranslateError} will be passed to the handler and
159that the replacement from the error handler will be put into the output
160directly.
Walter Dörwald3aeb6322002-09-02 13:14:32 +0000161\end{funcdesc}
162
163\begin{funcdesc}{lookup_error}{name}
Thomas Wouters477c8d52006-05-27 19:21:47 +0000164Return the error handler previously registered under the name \var{name}.
Walter Dörwald3aeb6322002-09-02 13:14:32 +0000165
166Raises a \exception{LookupError} in case the handler cannot be found.
167\end{funcdesc}
168
169\begin{funcdesc}{strict_errors}{exception}
170Implements the \code{strict} error handling.
171\end{funcdesc}
172
173\begin{funcdesc}{replace_errors}{exception}
174Implements the \code{replace} error handling.
175\end{funcdesc}
176
177\begin{funcdesc}{ignore_errors}{exception}
178Implements the \code{ignore} error handling.
179\end{funcdesc}
180
181\begin{funcdesc}{xmlcharrefreplace_errors_errors}{exception}
182Implements the \code{xmlcharrefreplace} error handling.
183\end{funcdesc}
184
185\begin{funcdesc}{backslashreplace_errors_errors}{exception}
186Implements the \code{backslashreplace} error handling.
187\end{funcdesc}
188
Walter Dörwald1a7a8942002-11-02 13:32:07 +0000189To simplify working with encoded files or stream, the module
190also defines these utility functions:
191
Fred Drakee1b304d2000-07-24 19:35:52 +0000192\begin{funcdesc}{open}{filename, mode\optional{, encoding\optional{,
193 errors\optional{, buffering}}}}
Fred Drakeb7979c72000-04-06 14:21:58 +0000194Open an encoded file using the given \var{mode} and return
195a wrapped version providing transparent encoding/decoding.
196
Fred Drake0aa811c2001-10-20 04:24:09 +0000197\note{The wrapped version will only accept the object format
Fred Drakee1b304d2000-07-24 19:35:52 +0000198defined by the codecs, i.e.\ Unicode objects for most built-in
199codecs. Output is also codec-dependent and will usually be Unicode as
Fred Drake0aa811c2001-10-20 04:24:09 +0000200well.}
Fred Drakeb7979c72000-04-06 14:21:58 +0000201
202\var{encoding} specifies the encoding which is to be used for the
Raymond Hettinger7e431102003-09-22 15:00:55 +0000203file.
Fred Drakeb7979c72000-04-06 14:21:58 +0000204
205\var{errors} may be given to define the error handling. It defaults
Fred Drakee1b304d2000-07-24 19:35:52 +0000206to \code{'strict'} which causes a \exception{ValueError} to be raised
207in case an encoding error occurs.
Fred Drakeb7979c72000-04-06 14:21:58 +0000208
Fred Drake69ca9502000-04-06 16:09:59 +0000209\var{buffering} has the same meaning as for the built-in
210\function{open()} function. It defaults to line buffered.
Fred Drakeb7979c72000-04-06 14:21:58 +0000211\end{funcdesc}
212
Fred Drakee1b304d2000-07-24 19:35:52 +0000213\begin{funcdesc}{EncodedFile}{file, input\optional{,
214 output\optional{, errors}}}
Fred Drakeb7979c72000-04-06 14:21:58 +0000215Return a wrapped version of file which provides transparent
216encoding translation.
217
218Strings written to the wrapped file are interpreted according to the
219given \var{input} encoding and then written to the original file as
Fred Drakee1b304d2000-07-24 19:35:52 +0000220strings using the \var{output} encoding. The intermediate encoding will
Fred Drakeb7979c72000-04-06 14:21:58 +0000221usually be Unicode but depends on the specified codecs.
222
Fred Drakee1b304d2000-07-24 19:35:52 +0000223If \var{output} is not given, it defaults to \var{input}.
Fred Drakeb7979c72000-04-06 14:21:58 +0000224
225\var{errors} may be given to define the error handling. It defaults to
Fred Drakee1b304d2000-07-24 19:35:52 +0000226\code{'strict'}, which causes \exception{ValueError} to be raised in case
Fred Drakeb7979c72000-04-06 14:21:58 +0000227an encoding error occurs.
228\end{funcdesc}
229
Thomas Woutersa9773292006-04-21 09:43:23 +0000230\begin{funcdesc}{iterencode}{iterable, encoding\optional{, errors}}
231Uses an incremental encoder to iteratively encode the input provided by
232\var{iterable}. This function is a generator. \var{errors} (as well as
233any other keyword argument) is passed through to the incremental encoder.
Thomas Wouters49fd7fa2006-04-21 10:40:58 +0000234\versionadded{2.5}
Thomas Woutersa9773292006-04-21 09:43:23 +0000235\end{funcdesc}
236
237\begin{funcdesc}{iterdecode}{iterable, encoding\optional{, errors}}
238Uses an incremental decoder to iteratively decode the input provided by
239\var{iterable}. This function is a generator. \var{errors} (as well as
Guido van Rossume7ba4952007-06-06 23:52:48 +0000240any other keyword argument) is passed through to the incremental decoder.
Thomas Wouters49fd7fa2006-04-21 10:40:58 +0000241\versionadded{2.5}
Thomas Woutersa9773292006-04-21 09:43:23 +0000242\end{funcdesc}
243
Fred Drakeb7979c72000-04-06 14:21:58 +0000244The module also provides the following constants which are useful
245for reading and writing to platform dependent files:
246
247\begin{datadesc}{BOM}
248\dataline{BOM_BE}
249\dataline{BOM_LE}
Walter Dörwald474458d2002-06-04 15:16:29 +0000250\dataline{BOM_UTF8}
251\dataline{BOM_UTF16}
252\dataline{BOM_UTF16_BE}
253\dataline{BOM_UTF16_LE}
254\dataline{BOM_UTF32}
255\dataline{BOM_UTF32_BE}
256\dataline{BOM_UTF32_LE}
257These constants define various encodings of the Unicode byte order mark
258(BOM) used in UTF-16 and UTF-32 data streams to indicate the byte order
259used in the stream or file and in UTF-8 as a Unicode signature.
260\constant{BOM_UTF16} is either \constant{BOM_UTF16_BE} or
261\constant{BOM_UTF16_LE} depending on the platform's native byte order,
262\constant{BOM} is an alias for \constant{BOM_UTF16}, \constant{BOM_LE}
263for \constant{BOM_UTF16_LE} and \constant{BOM_BE} for \constant{BOM_UTF16_BE}.
264The others represent the BOM in UTF-8 and UTF-32 encodings.
Fred Drakeb7979c72000-04-06 14:21:58 +0000265\end{datadesc}
266
Fred Drakedc40ac02001-01-22 20:17:54 +0000267
Walter Dörwaldd4bfe2c2005-11-25 17:17:12 +0000268\subsection{Codec Base Classes \label{codec-base-classes}}
Fred Drake602aa772000-10-12 20:50:55 +0000269
Fred Drake9984e702005-10-20 17:52:05 +0000270The \module{codecs} module defines a set of base classes which define the
Fred Drake602aa772000-10-12 20:50:55 +0000271interface and can also be used to easily write you own codecs for use
272in Python.
273
274Each codec has to define four interfaces to make it usable as codec in
275Python: stateless encoder, stateless decoder, stream reader and stream
276writer. The stream reader and writers typically reuse the stateless
277encoder/decoder to implement the file protocols.
278
279The \class{Codec} class defines the interface for stateless
280encoders/decoders.
281
282To simplify and standardize error handling, the \method{encode()} and
283\method{decode()} methods may implement different error handling
284schemes by providing the \var{errors} string argument. The following
285string values are defined and implemented by all standard Python
286codecs:
287
Fred Drakedc40ac02001-01-22 20:17:54 +0000288\begin{tableii}{l|l}{code}{Value}{Meaning}
Walter Dörwald430b1562002-11-07 22:33:17 +0000289 \lineii{'strict'}{Raise \exception{UnicodeError} (or a subclass);
Fred Drakedc40ac02001-01-22 20:17:54 +0000290 this is the default.}
291 \lineii{'ignore'}{Ignore the character and continue with the next.}
292 \lineii{'replace'}{Replace with a suitable replacement character;
293 Python will use the official U+FFFD REPLACEMENT
Walter Dörwald430b1562002-11-07 22:33:17 +0000294 CHARACTER for the built-in Unicode codecs on
295 decoding and '?' on encoding.}
296 \lineii{'xmlcharrefreplace'}{Replace with the appropriate XML
297 character reference (only for encoding).}
298 \lineii{'backslashreplace'}{Replace with backslashed escape sequences
299 (only for encoding).}
Fred Drakedc40ac02001-01-22 20:17:54 +0000300\end{tableii}
Fred Drake602aa772000-10-12 20:50:55 +0000301
Walter Dörwald430b1562002-11-07 22:33:17 +0000302The set of allowed values can be extended via \method{register_error}.
303
Fred Drake602aa772000-10-12 20:50:55 +0000304
305\subsubsection{Codec Objects \label{codec-objects}}
306
307The \class{Codec} class defines these methods which also define the
308function interfaces of the stateless encoder and decoder:
309
Guido van Rossumd8faa362007-04-27 19:54:29 +0000310\begin{methoddesc}[Codec]{encode}{input\optional{, errors}}
Fred Drake602aa772000-10-12 20:50:55 +0000311 Encodes the object \var{input} and returns a tuple (output object,
Skip Montanaro6c7bc312002-04-16 15:12:10 +0000312 length consumed). While codecs are not restricted to use with Unicode, in
313 a Unicode context, encoding converts a Unicode object to a plain string
314 using a particular character set encoding (e.g., \code{cp1252} or
315 \code{iso-8859-1}).
Fred Drake602aa772000-10-12 20:50:55 +0000316
317 \var{errors} defines the error handling to apply. It defaults to
318 \code{'strict'} handling.
319
320 The method may not store state in the \class{Codec} instance. Use
321 \class{StreamCodec} for codecs which have to keep state in order to
322 make encoding/decoding efficient.
323
324 The encoder must be able to handle zero length input and return an
325 empty object of the output object type in this situation.
326\end{methoddesc}
327
Guido van Rossumd8faa362007-04-27 19:54:29 +0000328\begin{methoddesc}[Codec]{decode}{input\optional{, errors}}
Fred Drake602aa772000-10-12 20:50:55 +0000329 Decodes the object \var{input} and returns a tuple (output object,
Skip Montanaro6c7bc312002-04-16 15:12:10 +0000330 length consumed). In a Unicode context, decoding converts a plain string
331 encoded using a particular character set encoding to a Unicode object.
Fred Drake602aa772000-10-12 20:50:55 +0000332
333 \var{input} must be an object which provides the \code{bf_getreadbuf}
334 buffer slot. Python strings, buffer objects and memory mapped files
335 are examples of objects providing this slot.
336
337 \var{errors} defines the error handling to apply. It defaults to
338 \code{'strict'} handling.
339
340 The method may not store state in the \class{Codec} instance. Use
341 \class{StreamCodec} for codecs which have to keep state in order to
342 make encoding/decoding efficient.
343
344 The decoder must be able to handle zero length input and return an
345 empty object of the output object type in this situation.
346\end{methoddesc}
347
Thomas Woutersa9773292006-04-21 09:43:23 +0000348The \class{IncrementalEncoder} and \class{IncrementalDecoder} classes provide
349the basic interface for incremental encoding and decoding. Encoding/decoding the
350input isn't done with one call to the stateless encoder/decoder function,
351but with multiple calls to the \method{encode}/\method{decode} method of the
352incremental encoder/decoder. The incremental encoder/decoder keeps track of
353the encoding/decoding process during method calls.
354
355The joined output of calls to the \method{encode}/\method{decode} method is the
Thomas Woutersd4ec0c32006-04-21 16:44:05 +0000356same as if all the single inputs were joined into one, and this input was
Thomas Woutersa9773292006-04-21 09:43:23 +0000357encoded/decoded with the stateless encoder/decoder.
358
359
360\subsubsection{IncrementalEncoder Objects \label{incremental-encoder-objects}}
361
Thomas Wouters49fd7fa2006-04-21 10:40:58 +0000362\versionadded{2.5}
363
Thomas Woutersa9773292006-04-21 09:43:23 +0000364The \class{IncrementalEncoder} class is used for encoding an input in multiple
365steps. It defines the following methods which every incremental encoder must
Thomas Woutersd4ec0c32006-04-21 16:44:05 +0000366define in order to be compatible with the Python codec registry.
Thomas Woutersa9773292006-04-21 09:43:23 +0000367
368\begin{classdesc}{IncrementalEncoder}{\optional{errors}}
Thomas Wouters477c8d52006-05-27 19:21:47 +0000369 Constructor for an \class{IncrementalEncoder} instance.
Thomas Woutersa9773292006-04-21 09:43:23 +0000370
371 All incremental encoders must provide this constructor interface. They are
372 free to add additional keyword arguments, but only the ones defined
373 here are used by the Python codec registry.
374
375 The \class{IncrementalEncoder} may implement different error handling
376 schemes by providing the \var{errors} keyword argument. These
377 parameters are predefined:
378
379 \begin{itemize}
380 \item \code{'strict'} Raise \exception{ValueError} (or a subclass);
381 this is the default.
382 \item \code{'ignore'} Ignore the character and continue with the next.
383 \item \code{'replace'} Replace with a suitable replacement character
384 \item \code{'xmlcharrefreplace'} Replace with the appropriate XML
385 character reference
386 \item \code{'backslashreplace'} Replace with backslashed escape sequences.
387 \end{itemize}
388
389 The \var{errors} argument will be assigned to an attribute of the
390 same name. Assigning to this attribute makes it possible to switch
391 between different error handling strategies during the lifetime
392 of the \class{IncrementalEncoder} object.
393
394 The set of allowed values for the \var{errors} argument can
395 be extended with \function{register_error()}.
396\end{classdesc}
397
398\begin{methoddesc}{encode}{object\optional{, final}}
399 Encodes \var{object} (taking the current state of the encoder into account)
400 and returns the resulting encoded object. If this is the last call to
401 \method{encode} \var{final} must be true (the default is false).
402\end{methoddesc}
403
404\begin{methoddesc}{reset}{}
405 Reset the encoder to the initial state.
406\end{methoddesc}
407
Walter Dörwald3abcb012007-04-16 22:10:50 +0000408\begin{methoddesc}{getstate}{}
409 Return the current state of the encoder which must be an integer.
410 The implementation should make sure that \code{0} is the most common state.
411 (States that are more complicated than integers can be converted into an
412 integer by marshaling/pickling the state and encoding the bytes of the
413 resulting string into an integer).
414 \versionadded{3.0}
415\end{methoddesc}
416
417\begin{methoddesc}{setstate}{state}
418 Set the state of the encoder to \var{state}. \var{state} must be an
419 encoder state returned by \method{getstate}.
420 \versionadded{3.0}
421\end{methoddesc}
422
Thomas Woutersa9773292006-04-21 09:43:23 +0000423
424\subsubsection{IncrementalDecoder Objects \label{incremental-decoder-objects}}
425
426The \class{IncrementalDecoder} class is used for decoding an input in multiple
427steps. It defines the following methods which every incremental decoder must
Thomas Woutersd4ec0c32006-04-21 16:44:05 +0000428define in order to be compatible with the Python codec registry.
Thomas Woutersa9773292006-04-21 09:43:23 +0000429
430\begin{classdesc}{IncrementalDecoder}{\optional{errors}}
Thomas Wouters477c8d52006-05-27 19:21:47 +0000431 Constructor for an \class{IncrementalDecoder} instance.
Thomas Woutersa9773292006-04-21 09:43:23 +0000432
433 All incremental decoders must provide this constructor interface. They are
434 free to add additional keyword arguments, but only the ones defined
435 here are used by the Python codec registry.
436
437 The \class{IncrementalDecoder} may implement different error handling
438 schemes by providing the \var{errors} keyword argument. These
439 parameters are predefined:
440
441 \begin{itemize}
442 \item \code{'strict'} Raise \exception{ValueError} (or a subclass);
443 this is the default.
444 \item \code{'ignore'} Ignore the character and continue with the next.
445 \item \code{'replace'} Replace with a suitable replacement character.
446 \end{itemize}
447
448 The \var{errors} argument will be assigned to an attribute of the
449 same name. Assigning to this attribute makes it possible to switch
450 between different error handling strategies during the lifetime
451 of the \class{IncrementalEncoder} object.
452
453 The set of allowed values for the \var{errors} argument can
454 be extended with \function{register_error()}.
455\end{classdesc}
456
457\begin{methoddesc}{decode}{object\optional{, final}}
458 Decodes \var{object} (taking the current state of the decoder into account)
459 and returns the resulting decoded object. If this is the last call to
460 \method{decode} \var{final} must be true (the default is false).
Thomas Wouters49fd7fa2006-04-21 10:40:58 +0000461 If \var{final} is true the decoder must decode the input completely and must
462 flush all buffers. If this isn't possible (e.g. because of incomplete byte
463 sequences at the end of the input) it must initiate error handling just like
464 in the stateless case (which might raise an exception).
Thomas Woutersa9773292006-04-21 09:43:23 +0000465\end{methoddesc}
466
467\begin{methoddesc}{reset}{}
468 Reset the decoder to the initial state.
469\end{methoddesc}
470
Walter Dörwald3abcb012007-04-16 22:10:50 +0000471\begin{methoddesc}{getstate}{}
472 Return the current state of the decoder. This must be a tuple with two
473 items, the first must be the buffer containing the still undecoded input.
474 The second must be an integer and can be additional state info.
475 (The implementation should make sure that \code{0} is the most common
476 additional state info.) If this additional state info is \code{0} it must
477 be possible to set the decoder to the state which has no input buffered
478 and \code{0} as the additional state info, so that feeding the previously
479 buffered input to the decoder returns it to the previous state without
480 producing any output. (Additional state info that is more complicated
481 than integers can be converted into an integer by marshaling/pickling
482 the info and encoding the bytes of the resulting string into an integer.)
483 \versionadded{3.0}
484\end{methoddesc}
485
486\begin{methoddesc}{setstate}{state}
487 Set the state of the encoder to \var{state}. \var{state} must be a
488 decoder state returned by \method{getstate}.
489 \versionadded{3.0}
490\end{methoddesc}
491
Thomas Woutersa9773292006-04-21 09:43:23 +0000492
Fred Drake602aa772000-10-12 20:50:55 +0000493The \class{StreamWriter} and \class{StreamReader} classes provide
494generic working interfaces which can be used to implement new
Thomas Woutersd4ec0c32006-04-21 16:44:05 +0000495encoding submodules very easily. See \module{encodings.utf_8} for an
496example of how this is done.
Fred Drake602aa772000-10-12 20:50:55 +0000497
498
499\subsubsection{StreamWriter Objects \label{stream-writer-objects}}
500
501The \class{StreamWriter} class is a subclass of \class{Codec} and
502defines the following methods which every stream writer must define in
Thomas Woutersd4ec0c32006-04-21 16:44:05 +0000503order to be compatible with the Python codec registry.
Fred Drake602aa772000-10-12 20:50:55 +0000504
505\begin{classdesc}{StreamWriter}{stream\optional{, errors}}
506 Constructor for a \class{StreamWriter} instance.
507
508 All stream writers must provide this constructor interface. They are
509 free to add additional keyword arguments, but only the ones defined
510 here are used by the Python codec registry.
511
Thomas Woutersd4ec0c32006-04-21 16:44:05 +0000512 \var{stream} must be a file-like object open for writing binary
Fred Drake602aa772000-10-12 20:50:55 +0000513 data.
514
515 The \class{StreamWriter} may implement different error handling
516 schemes by providing the \var{errors} keyword argument. These
Walter Dörwald430b1562002-11-07 22:33:17 +0000517 parameters are predefined:
Fred Drake602aa772000-10-12 20:50:55 +0000518
519 \begin{itemize}
520 \item \code{'strict'} Raise \exception{ValueError} (or a subclass);
521 this is the default.
522 \item \code{'ignore'} Ignore the character and continue with the next.
523 \item \code{'replace'} Replace with a suitable replacement character
Walter Dörwald430b1562002-11-07 22:33:17 +0000524 \item \code{'xmlcharrefreplace'} Replace with the appropriate XML
525 character reference
526 \item \code{'backslashreplace'} Replace with backslashed escape sequences.
Fred Drake602aa772000-10-12 20:50:55 +0000527 \end{itemize}
Walter Dörwald430b1562002-11-07 22:33:17 +0000528
529 The \var{errors} argument will be assigned to an attribute of the
530 same name. Assigning to this attribute makes it possible to switch
531 between different error handling strategies during the lifetime
532 of the \class{StreamWriter} object.
533
534 The set of allowed values for the \var{errors} argument can
535 be extended with \function{register_error()}.
Fred Drake602aa772000-10-12 20:50:55 +0000536\end{classdesc}
537
538\begin{methoddesc}{write}{object}
539 Writes the object's contents encoded to the stream.
540\end{methoddesc}
541
542\begin{methoddesc}{writelines}{list}
543 Writes the concatenated list of strings to the stream (possibly by
544 reusing the \method{write()} method).
545\end{methoddesc}
546
547\begin{methoddesc}{reset}{}
548 Flushes and resets the codec buffers used for keeping state.
549
550 Calling this method should ensure that the data on the output is put
Thomas Woutersd4ec0c32006-04-21 16:44:05 +0000551 into a clean state that allows appending of new fresh data without
Fred Drake602aa772000-10-12 20:50:55 +0000552 having to rescan the whole stream to recover state.
553\end{methoddesc}
554
555In addition to the above methods, the \class{StreamWriter} must also
Thomas Woutersd4ec0c32006-04-21 16:44:05 +0000556inherit all other methods and attributes from the underlying stream.
Fred Drake602aa772000-10-12 20:50:55 +0000557
558
559\subsubsection{StreamReader Objects \label{stream-reader-objects}}
560
561The \class{StreamReader} class is a subclass of \class{Codec} and
562defines the following methods which every stream reader must define in
Thomas Woutersd4ec0c32006-04-21 16:44:05 +0000563order to be compatible with the Python codec registry.
Fred Drake602aa772000-10-12 20:50:55 +0000564
565\begin{classdesc}{StreamReader}{stream\optional{, errors}}
566 Constructor for a \class{StreamReader} instance.
567
568 All stream readers must provide this constructor interface. They are
569 free to add additional keyword arguments, but only the ones defined
570 here are used by the Python codec registry.
571
572 \var{stream} must be a file-like object open for reading (binary)
573 data.
574
575 The \class{StreamReader} may implement different error handling
576 schemes by providing the \var{errors} keyword argument. These
577 parameters are defined:
578
579 \begin{itemize}
580 \item \code{'strict'} Raise \exception{ValueError} (or a subclass);
581 this is the default.
582 \item \code{'ignore'} Ignore the character and continue with the next.
583 \item \code{'replace'} Replace with a suitable replacement character.
584 \end{itemize}
Walter Dörwald430b1562002-11-07 22:33:17 +0000585
586 The \var{errors} argument will be assigned to an attribute of the
587 same name. Assigning to this attribute makes it possible to switch
588 between different error handling strategies during the lifetime
589 of the \class{StreamReader} object.
590
591 The set of allowed values for the \var{errors} argument can
592 be extended with \function{register_error()}.
Fred Drake602aa772000-10-12 20:50:55 +0000593\end{classdesc}
594
Martin v. Löwis56066d22005-08-24 07:38:12 +0000595\begin{methoddesc}{read}{\optional{size\optional{, chars, \optional{firstline}}}}
Fred Drake602aa772000-10-12 20:50:55 +0000596 Decodes data from the stream and returns the resulting object.
597
Walter Dörwald69652032004-09-07 20:24:22 +0000598 \var{chars} indicates the number of characters to read from the
Fred Drakea2544ee2004-09-10 01:16:49 +0000599 stream. \function{read()} will never return more than \var{chars}
Walter Dörwald69652032004-09-07 20:24:22 +0000600 characters, but it might return less, if there are not enough
601 characters available.
602
Fred Drake602aa772000-10-12 20:50:55 +0000603 \var{size} indicates the approximate maximum number of bytes to read
604 from the stream for decoding purposes. The decoder can modify this
605 setting as appropriate. The default value -1 indicates to read and
606 decode as much as possible. \var{size} is intended to prevent having
607 to decode huge files in one step.
608
Martin v. Löwis56066d22005-08-24 07:38:12 +0000609 \var{firstline} indicates that it would be sufficient to only return
610 the first line, if there are decoding errors on later lines.
611
Fred Drake602aa772000-10-12 20:50:55 +0000612 The method should use a greedy read strategy meaning that it should
613 read as much data as is allowed within the definition of the encoding
614 and the given size, e.g. if optional encoding endings or state
615 markers are available on the stream, these should be read too.
Walter Dörwald69652032004-09-07 20:24:22 +0000616
617 \versionchanged[\var{chars} argument added]{2.4}
Martin v. Löwis56066d22005-08-24 07:38:12 +0000618 \versionchanged[\var{firstline} argument added]{2.4.2}
Fred Drake602aa772000-10-12 20:50:55 +0000619\end{methoddesc}
620
Walter Dörwald69652032004-09-07 20:24:22 +0000621\begin{methoddesc}{readline}{\optional{size\optional{, keepends}}}
Fred Drake602aa772000-10-12 20:50:55 +0000622 Read one line from the input stream and return the
623 decoded data.
624
Fred Drake602aa772000-10-12 20:50:55 +0000625 \var{size}, if given, is passed as size argument to the stream's
626 \method{readline()} method.
Walter Dörwald69652032004-09-07 20:24:22 +0000627
Thomas Woutersd4ec0c32006-04-21 16:44:05 +0000628 If \var{keepends} is false line-endings will be stripped from the
Walter Dörwald69652032004-09-07 20:24:22 +0000629 lines returned.
630
631 \versionchanged[\var{keepends} argument added]{2.4}
Fred Drake602aa772000-10-12 20:50:55 +0000632\end{methoddesc}
633
Walter Dörwald69652032004-09-07 20:24:22 +0000634\begin{methoddesc}{readlines}{\optional{sizehint\optional{, keepends}}}
Thomas Woutersd4ec0c32006-04-21 16:44:05 +0000635 Read all lines available on the input stream and return them as a list
Fred Drake602aa772000-10-12 20:50:55 +0000636 of lines.
637
Thomas Woutersd4ec0c32006-04-21 16:44:05 +0000638 Line-endings are implemented using the codec's decoder method and are
Walter Dörwald69652032004-09-07 20:24:22 +0000639 included in the list entries if \var{keepends} is true.
Fred Drake602aa772000-10-12 20:50:55 +0000640
Thomas Woutersd4ec0c32006-04-21 16:44:05 +0000641 \var{sizehint}, if given, is passed as the \var{size} argument to the
Fred Drake602aa772000-10-12 20:50:55 +0000642 stream's \method{read()} method.
643\end{methoddesc}
644
645\begin{methoddesc}{reset}{}
646 Resets the codec buffers used for keeping state.
647
648 Note that no stream repositioning should take place. This method is
649 primarily intended to be able to recover from decoding errors.
650\end{methoddesc}
651
652In addition to the above methods, the \class{StreamReader} must also
Thomas Woutersd4ec0c32006-04-21 16:44:05 +0000653inherit all other methods and attributes from the underlying stream.
Fred Drake602aa772000-10-12 20:50:55 +0000654
655The next two base classes are included for convenience. They are not
656needed by the codec registry, but may provide useful in practice.
657
658
659\subsubsection{StreamReaderWriter Objects \label{stream-reader-writer}}
660
661The \class{StreamReaderWriter} allows wrapping streams which work in
662both read and write modes.
663
664The design is such that one can use the factory functions returned by
665the \function{lookup()} function to construct the instance.
666
667\begin{classdesc}{StreamReaderWriter}{stream, Reader, Writer, errors}
668 Creates a \class{StreamReaderWriter} instance.
669 \var{stream} must be a file-like object.
670 \var{Reader} and \var{Writer} must be factory functions or classes
671 providing the \class{StreamReader} and \class{StreamWriter} interface
672 resp.
673 Error handling is done in the same way as defined for the
674 stream readers and writers.
675\end{classdesc}
676
677\class{StreamReaderWriter} instances define the combined interfaces of
678\class{StreamReader} and \class{StreamWriter} classes. They inherit
Thomas Woutersd4ec0c32006-04-21 16:44:05 +0000679all other methods and attributes from the underlying stream.
Fred Drake602aa772000-10-12 20:50:55 +0000680
681
682\subsubsection{StreamRecoder Objects \label{stream-recoder-objects}}
683
684The \class{StreamRecoder} provide a frontend - backend view of
685encoding data which is sometimes useful when dealing with different
686encoding environments.
687
688The design is such that one can use the factory functions returned by
689the \function{lookup()} function to construct the instance.
690
691\begin{classdesc}{StreamRecoder}{stream, encode, decode,
692 Reader, Writer, errors}
693 Creates a \class{StreamRecoder} instance which implements a two-way
694 conversion: \var{encode} and \var{decode} work on the frontend (the
695 input to \method{read()} and output of \method{write()}) while
696 \var{Reader} and \var{Writer} work on the backend (reading and
697 writing to the stream).
698
699 You can use these objects to do transparent direct recodings from
700 e.g.\ Latin-1 to UTF-8 and back.
701
702 \var{stream} must be a file-like object.
703
704 \var{encode}, \var{decode} must adhere to the \class{Codec}
Thomas Woutersd4ec0c32006-04-21 16:44:05 +0000705 interface. \var{Reader}, \var{Writer} must be factory functions or
Raymond Hettingerf17d65d2003-08-12 00:01:16 +0000706 classes providing objects of the \class{StreamReader} and
Fred Drake602aa772000-10-12 20:50:55 +0000707 \class{StreamWriter} interface respectively.
708
709 \var{encode} and \var{decode} are needed for the frontend
710 translation, \var{Reader} and \var{Writer} for the backend
711 translation. The intermediate format used is determined by the two
Thomas Woutersd4ec0c32006-04-21 16:44:05 +0000712 sets of codecs, e.g. the Unicode codecs will use Unicode as the
Fred Drake602aa772000-10-12 20:50:55 +0000713 intermediate encoding.
714
715 Error handling is done in the same way as defined for the
716 stream readers and writers.
717\end{classdesc}
718
719\class{StreamRecoder} instances define the combined interfaces of
720\class{StreamReader} and \class{StreamWriter} classes. They inherit
Thomas Woutersd4ec0c32006-04-21 16:44:05 +0000721all other methods and attributes from the underlying stream.
Fred Drake602aa772000-10-12 20:50:55 +0000722
Martin v. Löwis412ed3b2006-01-08 10:45:39 +0000723\subsection{Encodings and Unicode\label{encodings-overview}}
724
725Unicode strings are stored internally as sequences of codepoints (to
Georg Brandl131e4f72006-01-23 21:33:48 +0000726be precise as \ctype{Py_UNICODE} arrays). Depending on the way Python is
727compiled (either via \longprogramopt{enable-unicode=ucs2} or
728\longprogramopt{enable-unicode=ucs4}, with the former being the default)
729\ctype{Py_UNICODE} is either a 16-bit or
Martin v. Löwis412ed3b2006-01-08 10:45:39 +000073032-bit data type. Once a Unicode object is used outside of CPU and
731memory, CPU endianness and how these arrays are stored as bytes become
732an issue. Transforming a unicode object into a sequence of bytes is
733called encoding and recreating the unicode object from the sequence of
Thomas Woutersd4ec0c32006-04-21 16:44:05 +0000734bytes is known as decoding. There are many different methods for how this
Martin v. Löwis412ed3b2006-01-08 10:45:39 +0000735transformation can be done (these methods are also called encodings).
736The simplest method is to map the codepoints 0-255 to the bytes
Georg Brandl131e4f72006-01-23 21:33:48 +0000737\code{0x0}-\code{0xff}. This means that a unicode object that contains
738codepoints above \code{U+00FF} can't be encoded with this method (which
Thomas Wouters49fd7fa2006-04-21 10:40:58 +0000739is called \code{'latin-1'} or \code{'iso-8859-1'}).
740\function{unicode.encode()} will raise a \exception{UnicodeEncodeError}
741that looks like this: \samp{UnicodeEncodeError: 'latin-1' codec can't
742encode character u'\e u1234' in position 3: ordinal not in range(256)}.
Martin v. Löwis412ed3b2006-01-08 10:45:39 +0000743
744There's another group of encodings (the so called charmap encodings)
745that choose a different subset of all unicode code points and how
Georg Brandl131e4f72006-01-23 21:33:48 +0000746these codepoints are mapped to the bytes \code{0x0}-\code{0xff.}
747To see how this is done simply open e.g. \file{encodings/cp1252.py}
748(which is an encoding that is used primarily on Windows).
749There's a string constant with 256 characters that shows you which
750character is mapped to which byte value.
Martin v. Löwis412ed3b2006-01-08 10:45:39 +0000751
752All of these encodings can only encode 256 of the 65536 (or 1114111)
753codepoints defined in unicode. A simple and straightforward way that
754can store each Unicode code point, is to store each codepoint as two
755consecutive bytes. There are two possibilities: Store the bytes in big
756endian or in little endian order. These two encodings are called
757UTF-16-BE and UTF-16-LE respectively. Their disadvantage is that if
758e.g. you use UTF-16-BE on a little endian machine you will always have
759to swap bytes on encoding and decoding. UTF-16 avoids this problem:
760Bytes will always be in natural endianness. When these bytes are read
761by a CPU with a different endianness, then bytes have to be swapped
762though. To be able to detect the endianness of a UTF-16 byte sequence,
763there's the so called BOM (the "Byte Order Mark"). This is the Unicode
Georg Brandl131e4f72006-01-23 21:33:48 +0000764character \code{U+FEFF}. This character will be prepended to every UTF-16
765byte sequence. The byte swapped version of this character (\code{0xFFFE}) is
Martin v. Löwis412ed3b2006-01-08 10:45:39 +0000766an illegal character that may not appear in a Unicode text. So when
Georg Brandl131e4f72006-01-23 21:33:48 +0000767the first character in an UTF-16 byte sequence appears to be a \code{U+FFFE}
Martin v. Löwis412ed3b2006-01-08 10:45:39 +0000768the bytes have to be swapped on decoding. Unfortunately upto Unicode
Georg Brandl131e4f72006-01-23 21:33:48 +00007694.0 the character \code{U+FEFF} had a second purpose as a \samp{ZERO WIDTH
770NO-BREAK SPACE}: A character that has no width and doesn't allow a
Martin v. Löwis412ed3b2006-01-08 10:45:39 +0000771word to be split. It can e.g. be used to give hints to a ligature
Georg Brandl131e4f72006-01-23 21:33:48 +0000772algorithm. With Unicode 4.0 using \code{U+FEFF} as a \samp{ZERO WIDTH NO-BREAK
773SPACE} has been deprecated (with \code{U+2060} (\samp{WORD JOINER}) assuming
774this role). Nevertheless Unicode software still must be able to handle
775\code{U+FEFF} in both roles: As a BOM it's a device to determine the storage
Martin v. Löwis412ed3b2006-01-08 10:45:39 +0000776layout of the encoded bytes, and vanishes once the byte sequence has
Georg Brandl131e4f72006-01-23 21:33:48 +0000777been decoded into a Unicode string; as a \samp{ZERO WIDTH NO-BREAK SPACE}
Martin v. Löwis412ed3b2006-01-08 10:45:39 +0000778it's a normal character that will be decoded like any other.
779
780There's another encoding that is able to encoding the full range of
Thomas Woutersd4ec0c32006-04-21 16:44:05 +0000781Unicode characters: UTF-8. UTF-8 is an 8-bit encoding, which means
Martin v. Löwis412ed3b2006-01-08 10:45:39 +0000782there are no issues with byte order in UTF-8. Each byte in a UTF-8
783byte sequence consists of two parts: Marker bits (the most significant
784bits) and payload bits. The marker bits are a sequence of zero to six
7851 bits followed by a 0 bit. Unicode characters are encoded like this
Walter Dörwaldb754fe42006-01-09 12:45:01 +0000786(with x being payload bits, which when concatenated give the Unicode
Martin v. Löwis412ed3b2006-01-08 10:45:39 +0000787character):
788
Walter Dörwaldb075fce2006-02-21 18:51:32 +0000789\begin{tableii}{l|l}{textrm}{Range}{Encoding}
Georg Brandl131e4f72006-01-23 21:33:48 +0000790\lineii{\code{U-00000000} ... \code{U-0000007F}}{0xxxxxxx}
791\lineii{\code{U-00000080} ... \code{U-000007FF}}{110xxxxx 10xxxxxx}
792\lineii{\code{U-00000800} ... \code{U-0000FFFF}}{1110xxxx 10xxxxxx 10xxxxxx}
793\lineii{\code{U-00010000} ... \code{U-001FFFFF}}{11110xxx 10xxxxxx 10xxxxxx 10xxxxxx}
794\lineii{\code{U-00200000} ... \code{U-03FFFFFF}}{111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx}
795\lineii{\code{U-04000000} ... \code{U-7FFFFFFF}}{1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx}
Martin v. Löwis412ed3b2006-01-08 10:45:39 +0000796\end{tableii}
797
798The least significant bit of the Unicode character is the rightmost x
799bit.
800
Thomas Woutersd4ec0c32006-04-21 16:44:05 +0000801As UTF-8 is an 8-bit encoding no BOM is required and any \code{U+FEFF}
Martin v. Löwis412ed3b2006-01-08 10:45:39 +0000802character in the decoded Unicode string (even if it's the first
Georg Brandl131e4f72006-01-23 21:33:48 +0000803character) is treated as a \samp{ZERO WIDTH NO-BREAK SPACE}.
Martin v. Löwis412ed3b2006-01-08 10:45:39 +0000804
805Without external information it's impossible to reliably determine
806which encoding was used for encoding a Unicode string. Each charmap
807encoding can decode any random byte sequence. However that's not
808possible with UTF-8, as UTF-8 byte sequences have a structure that
809doesn't allow arbitrary byte sequence. To increase the reliability
Walter Dörwaldb754fe42006-01-09 12:45:01 +0000810with which a UTF-8 encoding can be detected, Microsoft invented a
Georg Brandl131e4f72006-01-23 21:33:48 +0000811variant of UTF-8 (that Python 2.5 calls \code{"utf-8-sig"}) for its Notepad
Martin v. Löwis412ed3b2006-01-08 10:45:39 +0000812program: Before any of the Unicode characters is written to the file,
Georg Brandl131e4f72006-01-23 21:33:48 +0000813a UTF-8 encoded BOM (which looks like this as a byte sequence: \code{0xef},
Thomas Woutersd4ec0c32006-04-21 16:44:05 +0000814\code{0xbb}, \code{0xbf}) is written. As it's rather improbable that any
Georg Brandl131e4f72006-01-23 21:33:48 +0000815charmap encoded file starts with these byte values (which would e.g. map to
Martin v. Löwis412ed3b2006-01-08 10:45:39 +0000816
Georg Brandl131e4f72006-01-23 21:33:48 +0000817 LATIN SMALL LETTER I WITH DIAERESIS \\
818 RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK \\
Martin v. Löwis412ed3b2006-01-08 10:45:39 +0000819 INVERTED QUESTION MARK
820
821in iso-8859-1), this increases the probability that a utf-8-sig
822encoding can be correctly guessed from the byte sequence. So here the
823BOM is not used to be able to determine the byte order used for
824generating the byte sequence, but as a signature that helps in
825guessing the encoding. On encoding the utf-8-sig codec will write
Georg Brandl131e4f72006-01-23 21:33:48 +0000826\code{0xef}, \code{0xbb}, \code{0xbf} as the first three bytes to the file.
827On decoding utf-8-sig will skip those three bytes if they appear as the
828first three bytes in the file.
Martin v. Löwis412ed3b2006-01-08 10:45:39 +0000829
830
Skip Montanaroecf7a522004-07-01 19:26:04 +0000831\subsection{Standard Encodings\label{standard-encodings}}
Martin v. Löwis5c37a772002-12-31 12:39:07 +0000832
Thomas Woutersd4ec0c32006-04-21 16:44:05 +0000833Python comes with a number of codecs built-in, either implemented as C
834functions or with dictionaries as mapping tables. The following table
Martin v. Löwis5c37a772002-12-31 12:39:07 +0000835lists the codecs by name, together with a few common aliases, and the
836languages for which the encoding is likely used. Neither the list of
837aliases nor the list of languages is meant to be exhaustive. Notice
838that spelling alternatives that only differ in case or use a hyphen
839instead of an underscore are also valid aliases.
840
841Many of the character sets support the same languages. They vary in
842individual characters (e.g. whether the EURO SIGN is supported or
843not), and in the assignment of characters to code positions. For the
844European languages in particular, the following variants typically
845exist:
846
847\begin{itemize}
848\item an ISO 8859 codeset
849\item a Microsoft Windows code page, which is typically derived from
850 a 8859 codeset, but replaces control characters with additional
851 graphic characters
852\item an IBM EBCDIC code page
Fred Draked4be7472003-04-30 15:02:07 +0000853\item an IBM PC code page, which is \ASCII{} compatible
Martin v. Löwis5c37a772002-12-31 12:39:07 +0000854\end{itemize}
855
856\begin{longtableiii}{l|l|l}{textrm}{Codec}{Aliases}{Languages}
857
858\lineiii{ascii}
859 {646, us-ascii}
860 {English}
861
Hye-Shik Chang3e2a3062004-01-17 14:29:29 +0000862\lineiii{big5}
Hye-Shik Chang910d8f12004-07-17 14:44:43 +0000863 {big5-tw, csbig5}
Hye-Shik Chang3e2a3062004-01-17 14:29:29 +0000864 {Traditional Chinese}
865
Hye-Shik Chang2bb146f2004-07-18 03:06:29 +0000866\lineiii{big5hkscs}
867 {big5-hkscs, hkscs}
868 {Traditional Chinese}
869
Martin v. Löwis5c37a772002-12-31 12:39:07 +0000870\lineiii{cp037}
871 {IBM037, IBM039}
872 {English}
873
874\lineiii{cp424}
875 {EBCDIC-CP-HE, IBM424}
876 {Hebrew}
877
878\lineiii{cp437}
879 {437, IBM437}
880 {English}
881
882\lineiii{cp500}
883 {EBCDIC-CP-BE, EBCDIC-CP-CH, IBM500}
884 {Western Europe}
885
886\lineiii{cp737}
887 {}
888 {Greek}
889
890\lineiii{cp775}
891 {IBM775}
892 {Baltic languages}
893
894\lineiii{cp850}
895 {850, IBM850}
896 {Western Europe}
897
898\lineiii{cp852}
899 {852, IBM852}
900 {Central and Eastern Europe}
901
902\lineiii{cp855}
903 {855, IBM855}
904 {Bulgarian, Byelorussian, Macedonian, Russian, Serbian}
905
906\lineiii{cp856}
907 {}
908 {Hebrew}
909
910\lineiii{cp857}
911 {857, IBM857}
912 {Turkish}
913
914\lineiii{cp860}
915 {860, IBM860}
916 {Portuguese}
917
918\lineiii{cp861}
919 {861, CP-IS, IBM861}
920 {Icelandic}
921
922\lineiii{cp862}
923 {862, IBM862}
924 {Hebrew}
925
926\lineiii{cp863}
927 {863, IBM863}
928 {Canadian}
929
930\lineiii{cp864}
931 {IBM864}
932 {Arabic}
933
934\lineiii{cp865}
935 {865, IBM865}
936 {Danish, Norwegian}
937
Skip Montanaro78bace72004-07-02 02:14:34 +0000938\lineiii{cp866}
939 {866, IBM866}
940 {Russian}
941
Martin v. Löwis5c37a772002-12-31 12:39:07 +0000942\lineiii{cp869}
943 {869, CP-GR, IBM869}
944 {Greek}
945
946\lineiii{cp874}
947 {}
948 {Thai}
949
950\lineiii{cp875}
951 {}
952 {Greek}
953
Hye-Shik Chang3e2a3062004-01-17 14:29:29 +0000954\lineiii{cp932}
Hye-Shik Chang910d8f12004-07-17 14:44:43 +0000955 {932, ms932, mskanji, ms-kanji}
Hye-Shik Chang3e2a3062004-01-17 14:29:29 +0000956 {Japanese}
957
958\lineiii{cp949}
959 {949, ms949, uhc}
960 {Korean}
961
962\lineiii{cp950}
963 {950, ms950}
964 {Traditional Chinese}
965
Martin v. Löwis5c37a772002-12-31 12:39:07 +0000966\lineiii{cp1006}
967 {}
968 {Urdu}
969
970\lineiii{cp1026}
971 {ibm1026}
972 {Turkish}
973
974\lineiii{cp1140}
975 {ibm1140}
976 {Western Europe}
977
978\lineiii{cp1250}
979 {windows-1250}
980 {Central and Eastern Europe}
981
982\lineiii{cp1251}
983 {windows-1251}
984 {Bulgarian, Byelorussian, Macedonian, Russian, Serbian}
985
986\lineiii{cp1252}
987 {windows-1252}
988 {Western Europe}
989
990\lineiii{cp1253}
991 {windows-1253}
992 {Greek}
993
994\lineiii{cp1254}
995 {windows-1254}
996 {Turkish}
997
998\lineiii{cp1255}
999 {windows-1255}
1000 {Hebrew}
1001
1002\lineiii{cp1256}
1003 {windows1256}
1004 {Arabic}
1005
1006\lineiii{cp1257}
1007 {windows-1257}
1008 {Baltic languages}
1009
1010\lineiii{cp1258}
1011 {windows-1258}
1012 {Vietnamese}
1013
Hye-Shik Chang3e2a3062004-01-17 14:29:29 +00001014\lineiii{euc_jp}
Hye-Shik Chang910d8f12004-07-17 14:44:43 +00001015 {eucjp, ujis, u-jis}
Hye-Shik Chang3e2a3062004-01-17 14:29:29 +00001016 {Japanese}
1017
Hye-Shik Chang2bb146f2004-07-18 03:06:29 +00001018\lineiii{euc_jis_2004}
1019 {jisx0213, eucjis2004}
1020 {Japanese}
1021
Hye-Shik Chang3e2a3062004-01-17 14:29:29 +00001022\lineiii{euc_jisx0213}
Hye-Shik Chang2bb146f2004-07-18 03:06:29 +00001023 {eucjisx0213}
Hye-Shik Chang3e2a3062004-01-17 14:29:29 +00001024 {Japanese}
1025
1026\lineiii{euc_kr}
Hye-Shik Chang910d8f12004-07-17 14:44:43 +00001027 {euckr, korean, ksc5601, ks_c-5601, ks_c-5601-1987, ksx1001, ks_x-1001}
Hye-Shik Chang3e2a3062004-01-17 14:29:29 +00001028 {Korean}
1029
1030\lineiii{gb2312}
Hye-Shik Chang910d8f12004-07-17 14:44:43 +00001031 {chinese, csiso58gb231280, euc-cn, euccn, eucgb2312-cn, gb2312-1980,
1032 gb2312-80, iso-ir-58}
Hye-Shik Chang3e2a3062004-01-17 14:29:29 +00001033 {Simplified Chinese}
1034
1035\lineiii{gbk}
1036 {936, cp936, ms936}
1037 {Unified Chinese}
1038
1039\lineiii{gb18030}
Hye-Shik Chang910d8f12004-07-17 14:44:43 +00001040 {gb18030-2000}
Hye-Shik Chang3e2a3062004-01-17 14:29:29 +00001041 {Unified Chinese}
1042
1043\lineiii{hz}
Hye-Shik Chang910d8f12004-07-17 14:44:43 +00001044 {hzgb, hz-gb, hz-gb-2312}
Hye-Shik Chang3e2a3062004-01-17 14:29:29 +00001045 {Simplified Chinese}
1046
1047\lineiii{iso2022_jp}
Hye-Shik Chang910d8f12004-07-17 14:44:43 +00001048 {csiso2022jp, iso2022jp, iso-2022-jp}
Hye-Shik Chang3e2a3062004-01-17 14:29:29 +00001049 {Japanese}
1050
1051\lineiii{iso2022_jp_1}
Hye-Shik Chang910d8f12004-07-17 14:44:43 +00001052 {iso2022jp-1, iso-2022-jp-1}
Hye-Shik Chang3e2a3062004-01-17 14:29:29 +00001053 {Japanese}
1054
1055\lineiii{iso2022_jp_2}
Hye-Shik Chang910d8f12004-07-17 14:44:43 +00001056 {iso2022jp-2, iso-2022-jp-2}
Hye-Shik Chang3e2a3062004-01-17 14:29:29 +00001057 {Japanese, Korean, Simplified Chinese, Western Europe, Greek}
1058
Hye-Shik Chang2bb146f2004-07-18 03:06:29 +00001059\lineiii{iso2022_jp_2004}
1060 {iso2022jp-2004, iso-2022-jp-2004}
1061 {Japanese}
1062
Hye-Shik Chang3e2a3062004-01-17 14:29:29 +00001063\lineiii{iso2022_jp_3}
Hye-Shik Chang910d8f12004-07-17 14:44:43 +00001064 {iso2022jp-3, iso-2022-jp-3}
Hye-Shik Chang3e2a3062004-01-17 14:29:29 +00001065 {Japanese}
1066
1067\lineiii{iso2022_jp_ext}
Hye-Shik Chang910d8f12004-07-17 14:44:43 +00001068 {iso2022jp-ext, iso-2022-jp-ext}
Hye-Shik Chang3e2a3062004-01-17 14:29:29 +00001069 {Japanese}
1070
1071\lineiii{iso2022_kr}
Hye-Shik Chang910d8f12004-07-17 14:44:43 +00001072 {csiso2022kr, iso2022kr, iso-2022-kr}
Hye-Shik Chang3e2a3062004-01-17 14:29:29 +00001073 {Korean}
1074
Martin v. Löwis5c37a772002-12-31 12:39:07 +00001075\lineiii{latin_1}
1076 {iso-8859-1, iso8859-1, 8859, cp819, latin, latin1, L1}
1077 {West Europe}
1078
1079\lineiii{iso8859_2}
1080 {iso-8859-2, latin2, L2}
1081 {Central and Eastern Europe}
1082
1083\lineiii{iso8859_3}
1084 {iso-8859-3, latin3, L3}
1085 {Esperanto, Maltese}
1086
1087\lineiii{iso8859_4}
1088 {iso-8859-4, latin4, L4}
1089 {Baltic languagues}
1090
1091\lineiii{iso8859_5}
1092 {iso-8859-5, cyrillic}
1093 {Bulgarian, Byelorussian, Macedonian, Russian, Serbian}
1094
1095\lineiii{iso8859_6}
1096 {iso-8859-6, arabic}
1097 {Arabic}
1098
1099\lineiii{iso8859_7}
1100 {iso-8859-7, greek, greek8}
1101 {Greek}
1102
1103\lineiii{iso8859_8}
1104 {iso-8859-8, hebrew}
1105 {Hebrew}
1106
1107\lineiii{iso8859_9}
1108 {iso-8859-9, latin5, L5}
1109 {Turkish}
1110
1111\lineiii{iso8859_10}
1112 {iso-8859-10, latin6, L6}
1113 {Nordic languages}
1114
1115\lineiii{iso8859_13}
1116 {iso-8859-13}
1117 {Baltic languages}
1118
1119\lineiii{iso8859_14}
1120 {iso-8859-14, latin8, L8}
1121 {Celtic languages}
1122
1123\lineiii{iso8859_15}
1124 {iso-8859-15}
1125 {Western Europe}
1126
Hye-Shik Chang3e2a3062004-01-17 14:29:29 +00001127\lineiii{johab}
1128 {cp1361, ms1361}
1129 {Korean}
1130
Martin v. Löwis5c37a772002-12-31 12:39:07 +00001131\lineiii{koi8_r}
1132 {}
1133 {Russian}
1134
1135\lineiii{koi8_u}
1136 {}
1137 {Ukrainian}
1138
1139\lineiii{mac_cyrillic}
1140 {maccyrillic}
1141 {Bulgarian, Byelorussian, Macedonian, Russian, Serbian}
1142
1143\lineiii{mac_greek}
1144 {macgreek}
1145 {Greek}
1146
1147\lineiii{mac_iceland}
1148 {maciceland}
1149 {Icelandic}
1150
1151\lineiii{mac_latin2}
1152 {maclatin2, maccentraleurope}
1153 {Central and Eastern Europe}
1154
1155\lineiii{mac_roman}
1156 {macroman}
1157 {Western Europe}
1158
1159\lineiii{mac_turkish}
1160 {macturkish}
1161 {Turkish}
1162
Hye-Shik Chang5c5316f2004-03-19 08:06:07 +00001163\lineiii{ptcp154}
1164 {csptcp154, pt154, cp154, cyrillic-asian}
1165 {Kazakh}
1166
Hye-Shik Chang3e2a3062004-01-17 14:29:29 +00001167\lineiii{shift_jis}
1168 {csshiftjis, shiftjis, sjis, s_jis}
1169 {Japanese}
1170
Hye-Shik Chang2bb146f2004-07-18 03:06:29 +00001171\lineiii{shift_jis_2004}
1172 {shiftjis2004, sjis_2004, sjis2004}
1173 {Japanese}
1174
Hye-Shik Chang3e2a3062004-01-17 14:29:29 +00001175\lineiii{shift_jisx0213}
1176 {shiftjisx0213, sjisx0213, s_jisx0213}
1177 {Japanese}
1178
Martin v. Löwis5c37a772002-12-31 12:39:07 +00001179\lineiii{utf_16}
1180 {U16, utf16}
1181 {all languages}
1182
1183\lineiii{utf_16_be}
1184 {UTF-16BE}
1185 {all languages (BMP only)}
1186
1187\lineiii{utf_16_le}
1188 {UTF-16LE}
1189 {all languages (BMP only)}
1190
1191\lineiii{utf_7}
Walter Dörwald007f8df2005-10-09 19:42:27 +00001192 {U7, unicode-1-1-utf-7}
Martin v. Löwis5c37a772002-12-31 12:39:07 +00001193 {all languages}
1194
1195\lineiii{utf_8}
1196 {U8, UTF, utf8}
1197 {all languages}
1198
Martin v. Löwis412ed3b2006-01-08 10:45:39 +00001199\lineiii{utf_8_sig}
1200 {}
1201 {all languages}
1202
Martin v. Löwis5c37a772002-12-31 12:39:07 +00001203\end{longtableiii}
1204
1205A number of codecs are specific to Python, so their codec names have
1206no meaning outside Python. Some of them don't convert from Unicode
1207strings to byte strings, but instead use the property of the Python
1208codecs machinery that any bijective function with one argument can be
1209considered as an encoding.
1210
1211For the codecs listed below, the result in the ``encoding'' direction
1212is always a byte string. The result of the ``decoding'' direction is
1213listed as operand type in the table.
1214
1215\begin{tableiv}{l|l|l|l}{textrm}{Codec}{Aliases}{Operand type}{Purpose}
1216
Martin v. Löwis2548c732003-04-18 10:39:54 +00001217\lineiv{idna}
1218 {}
1219 {Unicode string}
Guido van Rossumd8faa362007-04-27 19:54:29 +00001220 {Implements \rfc{3490},
1221 see also \refmodule{encodings.idna}}
Martin v. Löwis2548c732003-04-18 10:39:54 +00001222
Martin v. Löwis5c37a772002-12-31 12:39:07 +00001223\lineiv{mbcs}
1224 {dbcs}
1225 {Unicode string}
1226 {Windows only: Encode operand according to the ANSI codepage (CP_ACP)}
1227
1228\lineiv{palmos}
1229 {}
1230 {Unicode string}
1231 {Encoding of PalmOS 3.5}
1232
Martin v. Löwis2548c732003-04-18 10:39:54 +00001233\lineiv{punycode}
1234 {}
1235 {Unicode string}
Guido van Rossumd8faa362007-04-27 19:54:29 +00001236 {Implements \rfc{3492}}
Martin v. Löwis2548c732003-04-18 10:39:54 +00001237
Martin v. Löwis5c37a772002-12-31 12:39:07 +00001238\lineiv{raw_unicode_escape}
1239 {}
1240 {Unicode string}
Fred Draked4be7472003-04-30 15:02:07 +00001241 {Produce a string that is suitable as raw Unicode literal in
1242 Python source code}
Martin v. Löwis5c37a772002-12-31 12:39:07 +00001243
Martin v. Löwis5c37a772002-12-31 12:39:07 +00001244\lineiv{undefined}
1245 {}
1246 {any}
Georg Brandl8f4b4db2006-03-09 10:16:42 +00001247 {Raise an exception for all conversions. Can be used as the
Fred Draked4be7472003-04-30 15:02:07 +00001248 system encoding if no automatic coercion between byte and
Walter Dörwald42748a82007-06-12 16:40:17 +00001249 Unicode strings is desired.}
Martin v. Löwis5c37a772002-12-31 12:39:07 +00001250
1251\lineiv{unicode_escape}
1252 {}
1253 {Unicode string}
Fred Draked4be7472003-04-30 15:02:07 +00001254 {Produce a string that is suitable as Unicode literal in
1255 Python source code}
Martin v. Löwis5c37a772002-12-31 12:39:07 +00001256
1257\lineiv{unicode_internal}
1258 {}
1259 {Unicode string}
Raymond Hettinger68804312005-01-01 00:28:46 +00001260 {Return the internal representation of the operand}
Martin v. Löwis5c37a772002-12-31 12:39:07 +00001261\end{tableiv}
Martin v. Löwis2548c732003-04-18 10:39:54 +00001262
Guido van Rossumd8faa362007-04-27 19:54:29 +00001263\versionadded[The \code{idna} and \code{punycode} encodings]{2.3}
1264
Martin v. Löwis2548c732003-04-18 10:39:54 +00001265\subsection{\module{encodings.idna} ---
1266 Internationalized Domain Names in Applications}
1267
1268\declaremodule{standard}{encodings.idna}
1269\modulesynopsis{Internationalized Domain Names implementation}
Fred Draked4be7472003-04-30 15:02:07 +00001270% XXX The next line triggers a formatting bug, so it's commented out
1271% until that can be fixed.
1272%\moduleauthor{Martin v. L\"owis}
1273
1274\versionadded{2.3}
Martin v. Löwis2548c732003-04-18 10:39:54 +00001275
1276This module implements \rfc{3490} (Internationalized Domain Names in
1277Applications) and \rfc{3492} (Nameprep: A Stringprep Profile for
1278Internationalized Domain Names (IDN)). It builds upon the
Fred Draked24c7672003-07-16 05:17:23 +00001279\code{punycode} encoding and \refmodule{stringprep}.
Martin v. Löwis2548c732003-04-18 10:39:54 +00001280
Fred Draked4be7472003-04-30 15:02:07 +00001281These RFCs together define a protocol to support non-\ASCII{} characters
1282in domain names. A domain name containing non-\ASCII{} characters (such
Fred Draked24c7672003-07-16 05:17:23 +00001283as ``www.Alliancefran\c caise.nu'') is converted into an
Fred Draked4be7472003-04-30 15:02:07 +00001284\ASCII-compatible encoding (ACE, such as
Martin v. Löwis2548c732003-04-18 10:39:54 +00001285``www.xn--alliancefranaise-npb.nu''). The ACE form of the domain name
1286is then used in all places where arbitrary characters are not allowed
Fred Draked4be7472003-04-30 15:02:07 +00001287by the protocol, such as DNS queries, HTTP \mailheader{Host} fields, and so
Martin v. Löwis2548c732003-04-18 10:39:54 +00001288on. This conversion is carried out in the application; if possible
1289invisible to the user: The application should transparently convert
1290Unicode domain labels to IDNA on the wire, and convert back ACE labels
1291to Unicode before presenting them to the user.
1292
1293Python supports this conversion in several ways: The \code{idna} codec
1294allows to convert between Unicode and the ACE. Furthermore, the
Fred Draked24c7672003-07-16 05:17:23 +00001295\refmodule{socket} module transparently converts Unicode host names to
Martin v. Löwis2548c732003-04-18 10:39:54 +00001296ACE, so that applications need not be concerned about converting host
1297names themselves when they pass them to the socket module. On top of
1298that, modules that have host names as function parameters, such as
Fred Draked24c7672003-07-16 05:17:23 +00001299\refmodule{httplib} and \refmodule{ftplib}, accept Unicode host names
1300(\refmodule{httplib} then also transparently sends an IDNA hostname in
1301the \mailheader{Host} field if it sends that field at all).
Martin v. Löwis2548c732003-04-18 10:39:54 +00001302
1303When receiving host names from the wire (such as in reverse name
1304lookup), no automatic conversion to Unicode is performed: Applications
1305wishing to present such host names to the user should decode them to
1306Unicode.
1307
1308The module \module{encodings.idna} also implements the nameprep
1309procedure, which performs certain normalizations on host names, to
1310achieve case-insensitivity of international domain names, and to unify
1311similar characters. The nameprep functions can be used directly if
1312desired.
1313
1314\begin{funcdesc}{nameprep}{label}
1315Return the nameprepped version of \var{label}. The implementation
1316currently assumes query strings, so \code{AllowUnassigned} is
1317true.
1318\end{funcdesc}
1319
Raymond Hettingerb5155e32003-06-18 01:58:31 +00001320\begin{funcdesc}{ToASCII}{label}
Fred Draked4be7472003-04-30 15:02:07 +00001321Convert a label to \ASCII, as specified in \rfc{3490}.
Martin v. Löwis2548c732003-04-18 10:39:54 +00001322\code{UseSTD3ASCIIRules} is assumed to be false.
1323\end{funcdesc}
1324
1325\begin{funcdesc}{ToUnicode}{label}
1326Convert a label to Unicode, as specified in \rfc{3490}.
1327\end{funcdesc}
Martin v. Löwis412ed3b2006-01-08 10:45:39 +00001328
1329 \subsection{\module{encodings.utf_8_sig} ---
1330 UTF-8 codec with BOM signature}
1331\declaremodule{standard}{encodings.utf-8-sig} % XXX utf_8_sig gives TeX errors
1332\modulesynopsis{UTF-8 codec with BOM signature}
Thomas Woutersd4ec0c32006-04-21 16:44:05 +00001333\moduleauthor{Walter D\"orwald}{}
Martin v. Löwis412ed3b2006-01-08 10:45:39 +00001334
1335\versionadded{2.5}
1336
1337This module implements a variant of the UTF-8 codec: On encoding a
1338UTF-8 encoded BOM will be prepended to the UTF-8 encoded bytes. For
1339the stateful encoder this is only done once (on the first write to the
1340byte stream). For decoding an optional UTF-8 encoded BOM at the start
1341of the data will be skipped.