blob: ac617438202a57e919f032f24cfed0ab471c12a7 [file] [log] [blame]
Fred Drakeb7979c72000-04-06 14:21:58 +00001\section{\module{codecs} ---
Fred Drake69ca9502000-04-06 16:09:59 +00002 Codec registry and base classes}
Fred Drakeb7979c72000-04-06 14:21:58 +00003
Fred Drake69ca9502000-04-06 16:09:59 +00004\declaremodule{standard}{codecs}
Fred Drakeb7979c72000-04-06 14:21:58 +00005\modulesynopsis{Encode and decode data and streams.}
6\moduleauthor{Marc-Andre Lemburg}{mal@lemburg.com}
7\sectionauthor{Marc-Andre Lemburg}{mal@lemburg.com}
Martin v. Löwis2548c732003-04-18 10:39:54 +00008\sectionauthor{Martin v. L\"owis}{martin@v.loewis.de}
Fred Drakeb7979c72000-04-06 14:21:58 +00009
10\index{Unicode}
11\index{Codecs}
12\indexii{Codecs}{encode}
13\indexii{Codecs}{decode}
14\index{streams}
15\indexii{stackable}{streams}
16
17
18This module defines base classes for standard Python codecs (encoders
19and decoders) and provides access to the internal Python codec
Walter Dörwald3aeb6322002-09-02 13:14:32 +000020registry which manages the codec and error handling lookup process.
Fred Drakeb7979c72000-04-06 14:21:58 +000021
22It defines the following functions:
23
24\begin{funcdesc}{register}{search_function}
25Register a codec search function. Search functions are expected to
26take one argument, the encoding name in all lower case letters, and
Walter Dörwaldabb02e52006-03-15 11:35:15 +000027return a \class{CodecInfo} object having the following attributes:
28
29\begin{itemize}
30 \item \code{name} The name of the encoding;
31 \item \code{encoder} The stateless encoding function;
32 \item \code{decoder} The stateless decoding function;
33 \item \code{incrementalencoder} An incremental encoder class or factory function;
34 \item \code{incrementaldecoder} An incremental decoder class or factory function;
35 \item \code{streamwriter} A stream writer class or factory function;
36 \item \code{streamreader} A stream reader class or factory function.
37\end{itemize}
38
39The various functions or classes take the following arguments:
Fred Drakeb7979c72000-04-06 14:21:58 +000040
41 \var{encoder} and \var{decoder}: These must be functions or methods
Fred Drake602aa772000-10-12 20:50:55 +000042 which have the same interface as the
43 \method{encode()}/\method{decode()} methods of Codec instances (see
44 Codec Interface). The functions/methods are expected to work in a
45 stateless mode.
Fred Drakeb7979c72000-04-06 14:21:58 +000046
Walter Dörwaldabb02e52006-03-15 11:35:15 +000047 \var{incrementalencoder} and \var{incrementalencoder}: These have to be
48 factory functions providing the following interface:
49
50 \code{factory(\var{errors}='strict')}
51
52 The factory functions must return objects providing the interfaces
53 defined by the base classes \class{IncrementalEncoder} and
54 \class{IncrementalEncoder}, respectively. Incremental codecs can maintain
55 state.
56
57 \var{streamreader} and \var{streamwriter}: These have to be
Fred Drakeb7979c72000-04-06 14:21:58 +000058 factory functions providing the following interface:
59
Fred Drake602aa772000-10-12 20:50:55 +000060 \code{factory(\var{stream}, \var{errors}='strict')}
Fred Drakeb7979c72000-04-06 14:21:58 +000061
62 The factory functions must return objects providing the interfaces
Fred Drake69ca9502000-04-06 16:09:59 +000063 defined by the base classes \class{StreamWriter} and
64 \class{StreamReader}, respectively. Stream codecs can maintain
65 state.
Fred Drakeb7979c72000-04-06 14:21:58 +000066
Fred Drake69ca9502000-04-06 16:09:59 +000067 Possible values for errors are \code{'strict'} (raise an exception
68 in case of an encoding error), \code{'replace'} (replace malformed
Walter Dörwald72f86162002-11-19 21:51:35 +000069 data with a suitable replacement marker, such as \character{?}),
Fred Drake69ca9502000-04-06 16:09:59 +000070 \code{'ignore'} (ignore malformed data and continue without further
Walter Dörwald72f86162002-11-19 21:51:35 +000071 notice), \code{'xmlcharrefreplace'} (replace with the appropriate XML
72 character reference (for encoding only)) and \code{'backslashreplace'}
73 (replace with backslashed escape sequences (for encoding only)) as
74 well as any other error handling name defined via
75 \function{register_error()}.
Fred Drakeb7979c72000-04-06 14:21:58 +000076
77In case a search function cannot find a given encoding, it should
Fred Drake69ca9502000-04-06 16:09:59 +000078return \code{None}.
Fred Drakeb7979c72000-04-06 14:21:58 +000079\end{funcdesc}
80
81\begin{funcdesc}{lookup}{encoding}
Walter Dörwaldabb02e52006-03-15 11:35:15 +000082Looks up the codec info in the Python codec registry and returns a
83\class{CodecInfo} object as defined above.
Fred Drakeb7979c72000-04-06 14:21:58 +000084
85Encodings are first looked up in the registry's cache. If not found,
Walter Dörwaldabb02e52006-03-15 11:35:15 +000086the list of registered search functions is scanned. If no \class{CodecInfo}
87object is found, a \exception{LookupError} is raised. Otherwise, the
88\class{CodecInfo} object is stored in the cache and returned to the caller.
Fred Drakeb7979c72000-04-06 14:21:58 +000089\end{funcdesc}
90
Skip Montanarob02ea652002-04-17 19:33:06 +000091To simplify access to the various codecs, the module provides these
Marc-André Lemburg494f2ae2001-09-19 11:33:31 +000092additional functions which use \function{lookup()} for the codec
93lookup:
94
95\begin{funcdesc}{getencoder}{encoding}
96Lookup up the codec for the given encoding and return its encoder
97function.
98
99Raises a \exception{LookupError} in case the encoding cannot be found.
100\end{funcdesc}
101
102\begin{funcdesc}{getdecoder}{encoding}
103Lookup up the codec for the given encoding and return its decoder
104function.
105
106Raises a \exception{LookupError} in case the encoding cannot be found.
107\end{funcdesc}
108
Walter Dörwaldabb02e52006-03-15 11:35:15 +0000109\begin{funcdesc}{getincrementalencoder}{encoding}
110Lookup up the codec for the given encoding and return its incremental encoder
111class or factory function.
112
113Raises a \exception{LookupError} in case the encoding cannot be found or the
114codec doesn't support an incremental encoder.
Neal Norwitz6bed1c12006-03-16 07:49:19 +0000115\versionadded{2.5}
Walter Dörwaldabb02e52006-03-15 11:35:15 +0000116\end{funcdesc}
117
118\begin{funcdesc}{getincrementaldecoder}{encoding}
119Lookup up the codec for the given encoding and return its incremental decoder
120class or factory function.
121
122Raises a \exception{LookupError} in case the encoding cannot be found or the
123codec doesn't support an incremental decoder.
Neal Norwitz6bed1c12006-03-16 07:49:19 +0000124\versionadded{2.5}
Walter Dörwaldabb02e52006-03-15 11:35:15 +0000125\end{funcdesc}
126
Marc-André Lemburg494f2ae2001-09-19 11:33:31 +0000127\begin{funcdesc}{getreader}{encoding}
128Lookup up the codec for the given encoding and return its StreamReader
129class or factory function.
130
131Raises a \exception{LookupError} in case the encoding cannot be found.
132\end{funcdesc}
133
134\begin{funcdesc}{getwriter}{encoding}
135Lookup up the codec for the given encoding and return its StreamWriter
136class or factory function.
137
138Raises a \exception{LookupError} in case the encoding cannot be found.
139\end{funcdesc}
140
Walter Dörwald3aeb6322002-09-02 13:14:32 +0000141\begin{funcdesc}{register_error}{name, error_handler}
142Register the error handling function \var{error_handler} under the
Raymond Hettinger8a64d402002-09-08 22:26:13 +0000143name \var{name}. \var{error_handler} will be called during encoding
Walter Dörwald3aeb6322002-09-02 13:14:32 +0000144and decoding in case of an error, when \var{name} is specified as the
Walter Dörwald2e0b18a2003-01-31 17:19:08 +0000145errors parameter.
146
147For encoding \var{error_handler} will be called with a
148\exception{UnicodeEncodeError} instance, which contains information about
149the location of the error. The error handler must either raise this or
150a different exception or return a tuple with a replacement for the
151unencodable part of the input and a position where encoding should
152continue. The encoder will encode the replacement and continue encoding
153the original input at the specified position. Negative position values
154will be treated as being relative to the end of the input string. If the
155resulting position is out of bound an IndexError will be raised.
156
157Decoding and translating works similar, except \exception{UnicodeDecodeError}
158or \exception{UnicodeTranslateError} will be passed to the handler and
159that the replacement from the error handler will be put into the output
160directly.
Walter Dörwald3aeb6322002-09-02 13:14:32 +0000161\end{funcdesc}
162
163\begin{funcdesc}{lookup_error}{name}
164Return the error handler previously register under the name \var{name}.
165
166Raises a \exception{LookupError} in case the handler cannot be found.
167\end{funcdesc}
168
169\begin{funcdesc}{strict_errors}{exception}
170Implements the \code{strict} error handling.
171\end{funcdesc}
172
173\begin{funcdesc}{replace_errors}{exception}
174Implements the \code{replace} error handling.
175\end{funcdesc}
176
177\begin{funcdesc}{ignore_errors}{exception}
178Implements the \code{ignore} error handling.
179\end{funcdesc}
180
181\begin{funcdesc}{xmlcharrefreplace_errors_errors}{exception}
182Implements the \code{xmlcharrefreplace} error handling.
183\end{funcdesc}
184
185\begin{funcdesc}{backslashreplace_errors_errors}{exception}
186Implements the \code{backslashreplace} error handling.
187\end{funcdesc}
188
Walter Dörwald1a7a8942002-11-02 13:32:07 +0000189To simplify working with encoded files or stream, the module
190also defines these utility functions:
191
Fred Drakee1b304d2000-07-24 19:35:52 +0000192\begin{funcdesc}{open}{filename, mode\optional{, encoding\optional{,
193 errors\optional{, buffering}}}}
Fred Drakeb7979c72000-04-06 14:21:58 +0000194Open an encoded file using the given \var{mode} and return
195a wrapped version providing transparent encoding/decoding.
196
Fred Drake0aa811c2001-10-20 04:24:09 +0000197\note{The wrapped version will only accept the object format
Fred Drakee1b304d2000-07-24 19:35:52 +0000198defined by the codecs, i.e.\ Unicode objects for most built-in
199codecs. Output is also codec-dependent and will usually be Unicode as
Fred Drake0aa811c2001-10-20 04:24:09 +0000200well.}
Fred Drakeb7979c72000-04-06 14:21:58 +0000201
202\var{encoding} specifies the encoding which is to be used for the
Raymond Hettinger7e431102003-09-22 15:00:55 +0000203file.
Fred Drakeb7979c72000-04-06 14:21:58 +0000204
205\var{errors} may be given to define the error handling. It defaults
Fred Drakee1b304d2000-07-24 19:35:52 +0000206to \code{'strict'} which causes a \exception{ValueError} to be raised
207in case an encoding error occurs.
Fred Drakeb7979c72000-04-06 14:21:58 +0000208
Fred Drake69ca9502000-04-06 16:09:59 +0000209\var{buffering} has the same meaning as for the built-in
210\function{open()} function. It defaults to line buffered.
Fred Drakeb7979c72000-04-06 14:21:58 +0000211\end{funcdesc}
212
Fred Drakee1b304d2000-07-24 19:35:52 +0000213\begin{funcdesc}{EncodedFile}{file, input\optional{,
214 output\optional{, errors}}}
Fred Drakeb7979c72000-04-06 14:21:58 +0000215Return a wrapped version of file which provides transparent
216encoding translation.
217
218Strings written to the wrapped file are interpreted according to the
219given \var{input} encoding and then written to the original file as
Fred Drakee1b304d2000-07-24 19:35:52 +0000220strings using the \var{output} encoding. The intermediate encoding will
Fred Drakeb7979c72000-04-06 14:21:58 +0000221usually be Unicode but depends on the specified codecs.
222
Fred Drakee1b304d2000-07-24 19:35:52 +0000223If \var{output} is not given, it defaults to \var{input}.
Fred Drakeb7979c72000-04-06 14:21:58 +0000224
225\var{errors} may be given to define the error handling. It defaults to
Fred Drakee1b304d2000-07-24 19:35:52 +0000226\code{'strict'}, which causes \exception{ValueError} to be raised in case
Fred Drakeb7979c72000-04-06 14:21:58 +0000227an encoding error occurs.
228\end{funcdesc}
229
Walter Dörwaldabb02e52006-03-15 11:35:15 +0000230\begin{funcdesc}{iterencode}{iterable, encoding\optional{, errors}}
231Uses an incremental encoder to iteratively encode the input provided by
232\var{iterable}. This function is a generator. \var{errors} (as well as
233any other keyword argument) is passed through to the incremental encoder.
Neal Norwitz6bed1c12006-03-16 07:49:19 +0000234\versionadded{2.5}
Walter Dörwaldabb02e52006-03-15 11:35:15 +0000235\end{funcdesc}
236
237\begin{funcdesc}{iterdecode}{iterable, encoding\optional{, errors}}
238Uses an incremental decoder to iteratively decode the input provided by
239\var{iterable}. This function is a generator. \var{errors} (as well as
240any other keyword argument) is passed through to the incremental encoder.
Neal Norwitz6bed1c12006-03-16 07:49:19 +0000241\versionadded{2.5}
Walter Dörwaldabb02e52006-03-15 11:35:15 +0000242\end{funcdesc}
243
Fred Drakeb7979c72000-04-06 14:21:58 +0000244The module also provides the following constants which are useful
245for reading and writing to platform dependent files:
246
247\begin{datadesc}{BOM}
248\dataline{BOM_BE}
249\dataline{BOM_LE}
Walter Dörwald474458d2002-06-04 15:16:29 +0000250\dataline{BOM_UTF8}
251\dataline{BOM_UTF16}
252\dataline{BOM_UTF16_BE}
253\dataline{BOM_UTF16_LE}
254\dataline{BOM_UTF32}
255\dataline{BOM_UTF32_BE}
256\dataline{BOM_UTF32_LE}
257These constants define various encodings of the Unicode byte order mark
258(BOM) used in UTF-16 and UTF-32 data streams to indicate the byte order
259used in the stream or file and in UTF-8 as a Unicode signature.
260\constant{BOM_UTF16} is either \constant{BOM_UTF16_BE} or
261\constant{BOM_UTF16_LE} depending on the platform's native byte order,
262\constant{BOM} is an alias for \constant{BOM_UTF16}, \constant{BOM_LE}
263for \constant{BOM_UTF16_LE} and \constant{BOM_BE} for \constant{BOM_UTF16_BE}.
264The others represent the BOM in UTF-8 and UTF-32 encodings.
Fred Drakeb7979c72000-04-06 14:21:58 +0000265\end{datadesc}
266
Fred Drakedc40ac02001-01-22 20:17:54 +0000267
Walter Dörwaldd4bfe2c2005-11-25 17:17:12 +0000268\subsection{Codec Base Classes \label{codec-base-classes}}
Fred Drake602aa772000-10-12 20:50:55 +0000269
Fred Drake9984e702005-10-20 17:52:05 +0000270The \module{codecs} module defines a set of base classes which define the
Fred Drake602aa772000-10-12 20:50:55 +0000271interface and can also be used to easily write you own codecs for use
272in Python.
273
274Each codec has to define four interfaces to make it usable as codec in
275Python: stateless encoder, stateless decoder, stream reader and stream
276writer. The stream reader and writers typically reuse the stateless
277encoder/decoder to implement the file protocols.
278
279The \class{Codec} class defines the interface for stateless
280encoders/decoders.
281
282To simplify and standardize error handling, the \method{encode()} and
283\method{decode()} methods may implement different error handling
284schemes by providing the \var{errors} string argument. The following
285string values are defined and implemented by all standard Python
286codecs:
287
Fred Drakedc40ac02001-01-22 20:17:54 +0000288\begin{tableii}{l|l}{code}{Value}{Meaning}
Walter Dörwald430b1562002-11-07 22:33:17 +0000289 \lineii{'strict'}{Raise \exception{UnicodeError} (or a subclass);
Fred Drakedc40ac02001-01-22 20:17:54 +0000290 this is the default.}
291 \lineii{'ignore'}{Ignore the character and continue with the next.}
292 \lineii{'replace'}{Replace with a suitable replacement character;
293 Python will use the official U+FFFD REPLACEMENT
Walter Dörwald430b1562002-11-07 22:33:17 +0000294 CHARACTER for the built-in Unicode codecs on
295 decoding and '?' on encoding.}
296 \lineii{'xmlcharrefreplace'}{Replace with the appropriate XML
297 character reference (only for encoding).}
298 \lineii{'backslashreplace'}{Replace with backslashed escape sequences
299 (only for encoding).}
Fred Drakedc40ac02001-01-22 20:17:54 +0000300\end{tableii}
Fred Drake602aa772000-10-12 20:50:55 +0000301
Walter Dörwald430b1562002-11-07 22:33:17 +0000302The set of allowed values can be extended via \method{register_error}.
303
Fred Drake602aa772000-10-12 20:50:55 +0000304
305\subsubsection{Codec Objects \label{codec-objects}}
306
307The \class{Codec} class defines these methods which also define the
308function interfaces of the stateless encoder and decoder:
309
310\begin{methoddesc}{encode}{input\optional{, errors}}
311 Encodes the object \var{input} and returns a tuple (output object,
Skip Montanaro6c7bc312002-04-16 15:12:10 +0000312 length consumed). While codecs are not restricted to use with Unicode, in
313 a Unicode context, encoding converts a Unicode object to a plain string
314 using a particular character set encoding (e.g., \code{cp1252} or
315 \code{iso-8859-1}).
Fred Drake602aa772000-10-12 20:50:55 +0000316
317 \var{errors} defines the error handling to apply. It defaults to
318 \code{'strict'} handling.
319
320 The method may not store state in the \class{Codec} instance. Use
321 \class{StreamCodec} for codecs which have to keep state in order to
322 make encoding/decoding efficient.
323
324 The encoder must be able to handle zero length input and return an
325 empty object of the output object type in this situation.
326\end{methoddesc}
327
328\begin{methoddesc}{decode}{input\optional{, errors}}
329 Decodes the object \var{input} and returns a tuple (output object,
Skip Montanaro6c7bc312002-04-16 15:12:10 +0000330 length consumed). In a Unicode context, decoding converts a plain string
331 encoded using a particular character set encoding to a Unicode object.
Fred Drake602aa772000-10-12 20:50:55 +0000332
333 \var{input} must be an object which provides the \code{bf_getreadbuf}
334 buffer slot. Python strings, buffer objects and memory mapped files
335 are examples of objects providing this slot.
336
337 \var{errors} defines the error handling to apply. It defaults to
338 \code{'strict'} handling.
339
340 The method may not store state in the \class{Codec} instance. Use
341 \class{StreamCodec} for codecs which have to keep state in order to
342 make encoding/decoding efficient.
343
344 The decoder must be able to handle zero length input and return an
345 empty object of the output object type in this situation.
346\end{methoddesc}
347
Walter Dörwaldabb02e52006-03-15 11:35:15 +0000348The \class{IncrementalEncoder} and \class{IncrementalDecoder} classes provide
349the basic interface for incremental encoding and decoding. Encoding/decoding the
350input isn't done with one call to the stateless encoder/decoder function,
351but with multiple calls to the \method{encode}/\method{decode} method of the
352incremental encoder/decoder. The incremental encoder/decoder keeps track of
353the encoding/decoding process during method calls.
354
355The joined output of calls to the \method{encode}/\method{decode} method is the
356same as if the all single inputs where joined into one, and this input was
357encoded/decoded with the stateless encoder/decoder.
358
359
360\subsubsection{IncrementalEncoder Objects \label{incremental-encoder-objects}}
361
Neal Norwitz6bed1c12006-03-16 07:49:19 +0000362\versionadded{2.5}
363
Walter Dörwaldabb02e52006-03-15 11:35:15 +0000364The \class{IncrementalEncoder} class is used for encoding an input in multiple
365steps. It defines the following methods which every incremental encoder must
366define in order to be compatible to the Python codec registry.
367
368\begin{classdesc}{IncrementalEncoder}{\optional{errors}}
369 Constructor for a \class{IncrementalEncoder} instance.
370
371 All incremental encoders must provide this constructor interface. They are
372 free to add additional keyword arguments, but only the ones defined
373 here are used by the Python codec registry.
374
375 The \class{IncrementalEncoder} may implement different error handling
376 schemes by providing the \var{errors} keyword argument. These
377 parameters are predefined:
378
379 \begin{itemize}
380 \item \code{'strict'} Raise \exception{ValueError} (or a subclass);
381 this is the default.
382 \item \code{'ignore'} Ignore the character and continue with the next.
383 \item \code{'replace'} Replace with a suitable replacement character
384 \item \code{'xmlcharrefreplace'} Replace with the appropriate XML
385 character reference
386 \item \code{'backslashreplace'} Replace with backslashed escape sequences.
387 \end{itemize}
388
389 The \var{errors} argument will be assigned to an attribute of the
390 same name. Assigning to this attribute makes it possible to switch
391 between different error handling strategies during the lifetime
392 of the \class{IncrementalEncoder} object.
393
394 The set of allowed values for the \var{errors} argument can
395 be extended with \function{register_error()}.
396\end{classdesc}
397
398\begin{methoddesc}{encode}{object\optional{, final}}
399 Encodes \var{object} (taking the current state of the encoder into account)
400 and returns the resulting encoded object. If this is the last call to
401 \method{encode} \var{final} must be true (the default is false).
402\end{methoddesc}
403
404\begin{methoddesc}{reset}{}
405 Reset the encoder to the initial state.
406\end{methoddesc}
407
408
409\subsubsection{IncrementalDecoder Objects \label{incremental-decoder-objects}}
410
411The \class{IncrementalDecoder} class is used for decoding an input in multiple
412steps. It defines the following methods which every incremental decoder must
413define in order to be compatible to the Python codec registry.
414
415\begin{classdesc}{IncrementalDecoder}{\optional{errors}}
416 Constructor for a \class{IncrementalDecoder} instance.
417
418 All incremental decoders must provide this constructor interface. They are
419 free to add additional keyword arguments, but only the ones defined
420 here are used by the Python codec registry.
421
422 The \class{IncrementalDecoder} may implement different error handling
423 schemes by providing the \var{errors} keyword argument. These
424 parameters are predefined:
425
426 \begin{itemize}
427 \item \code{'strict'} Raise \exception{ValueError} (or a subclass);
428 this is the default.
429 \item \code{'ignore'} Ignore the character and continue with the next.
430 \item \code{'replace'} Replace with a suitable replacement character.
431 \end{itemize}
432
433 The \var{errors} argument will be assigned to an attribute of the
434 same name. Assigning to this attribute makes it possible to switch
435 between different error handling strategies during the lifetime
436 of the \class{IncrementalEncoder} object.
437
438 The set of allowed values for the \var{errors} argument can
439 be extended with \function{register_error()}.
440\end{classdesc}
441
442\begin{methoddesc}{decode}{object\optional{, final}}
443 Decodes \var{object} (taking the current state of the decoder into account)
444 and returns the resulting decoded object. If this is the last call to
445 \method{decode} \var{final} must be true (the default is false).
446\end{methoddesc}
447
448\begin{methoddesc}{reset}{}
449 Reset the decoder to the initial state.
450\end{methoddesc}
451
452
Fred Drake602aa772000-10-12 20:50:55 +0000453The \class{StreamWriter} and \class{StreamReader} classes provide
454generic working interfaces which can be used to implement new
455encodings submodules very easily. See \module{encodings.utf_8} for an
456example on how this is done.
457
458
459\subsubsection{StreamWriter Objects \label{stream-writer-objects}}
460
461The \class{StreamWriter} class is a subclass of \class{Codec} and
462defines the following methods which every stream writer must define in
463order to be compatible to the Python codec registry.
464
465\begin{classdesc}{StreamWriter}{stream\optional{, errors}}
466 Constructor for a \class{StreamWriter} instance.
467
468 All stream writers must provide this constructor interface. They are
469 free to add additional keyword arguments, but only the ones defined
470 here are used by the Python codec registry.
471
472 \var{stream} must be a file-like object open for writing (binary)
473 data.
474
475 The \class{StreamWriter} may implement different error handling
476 schemes by providing the \var{errors} keyword argument. These
Walter Dörwald430b1562002-11-07 22:33:17 +0000477 parameters are predefined:
Fred Drake602aa772000-10-12 20:50:55 +0000478
479 \begin{itemize}
480 \item \code{'strict'} Raise \exception{ValueError} (or a subclass);
481 this is the default.
482 \item \code{'ignore'} Ignore the character and continue with the next.
483 \item \code{'replace'} Replace with a suitable replacement character
Walter Dörwald430b1562002-11-07 22:33:17 +0000484 \item \code{'xmlcharrefreplace'} Replace with the appropriate XML
485 character reference
486 \item \code{'backslashreplace'} Replace with backslashed escape sequences.
Fred Drake602aa772000-10-12 20:50:55 +0000487 \end{itemize}
Walter Dörwald430b1562002-11-07 22:33:17 +0000488
489 The \var{errors} argument will be assigned to an attribute of the
490 same name. Assigning to this attribute makes it possible to switch
491 between different error handling strategies during the lifetime
492 of the \class{StreamWriter} object.
493
494 The set of allowed values for the \var{errors} argument can
495 be extended with \function{register_error()}.
Fred Drake602aa772000-10-12 20:50:55 +0000496\end{classdesc}
497
498\begin{methoddesc}{write}{object}
499 Writes the object's contents encoded to the stream.
500\end{methoddesc}
501
502\begin{methoddesc}{writelines}{list}
503 Writes the concatenated list of strings to the stream (possibly by
504 reusing the \method{write()} method).
505\end{methoddesc}
506
507\begin{methoddesc}{reset}{}
508 Flushes and resets the codec buffers used for keeping state.
509
510 Calling this method should ensure that the data on the output is put
511 into a clean state, that allows appending of new fresh data without
512 having to rescan the whole stream to recover state.
513\end{methoddesc}
514
515In addition to the above methods, the \class{StreamWriter} must also
516inherit all other methods and attribute from the underlying stream.
517
518
519\subsubsection{StreamReader Objects \label{stream-reader-objects}}
520
521The \class{StreamReader} class is a subclass of \class{Codec} and
522defines the following methods which every stream reader must define in
523order to be compatible to the Python codec registry.
524
525\begin{classdesc}{StreamReader}{stream\optional{, errors}}
526 Constructor for a \class{StreamReader} instance.
527
528 All stream readers must provide this constructor interface. They are
529 free to add additional keyword arguments, but only the ones defined
530 here are used by the Python codec registry.
531
532 \var{stream} must be a file-like object open for reading (binary)
533 data.
534
535 The \class{StreamReader} may implement different error handling
536 schemes by providing the \var{errors} keyword argument. These
537 parameters are defined:
538
539 \begin{itemize}
540 \item \code{'strict'} Raise \exception{ValueError} (or a subclass);
541 this is the default.
542 \item \code{'ignore'} Ignore the character and continue with the next.
543 \item \code{'replace'} Replace with a suitable replacement character.
544 \end{itemize}
Walter Dörwald430b1562002-11-07 22:33:17 +0000545
546 The \var{errors} argument will be assigned to an attribute of the
547 same name. Assigning to this attribute makes it possible to switch
548 between different error handling strategies during the lifetime
549 of the \class{StreamReader} object.
550
551 The set of allowed values for the \var{errors} argument can
552 be extended with \function{register_error()}.
Fred Drake602aa772000-10-12 20:50:55 +0000553\end{classdesc}
554
Martin v. Löwis56066d22005-08-24 07:38:12 +0000555\begin{methoddesc}{read}{\optional{size\optional{, chars, \optional{firstline}}}}
Fred Drake602aa772000-10-12 20:50:55 +0000556 Decodes data from the stream and returns the resulting object.
557
Walter Dörwald69652032004-09-07 20:24:22 +0000558 \var{chars} indicates the number of characters to read from the
Fred Drakea2544ee2004-09-10 01:16:49 +0000559 stream. \function{read()} will never return more than \var{chars}
Walter Dörwald69652032004-09-07 20:24:22 +0000560 characters, but it might return less, if there are not enough
561 characters available.
562
Fred Drake602aa772000-10-12 20:50:55 +0000563 \var{size} indicates the approximate maximum number of bytes to read
564 from the stream for decoding purposes. The decoder can modify this
565 setting as appropriate. The default value -1 indicates to read and
566 decode as much as possible. \var{size} is intended to prevent having
567 to decode huge files in one step.
568
Martin v. Löwis56066d22005-08-24 07:38:12 +0000569 \var{firstline} indicates that it would be sufficient to only return
570 the first line, if there are decoding errors on later lines.
571
Fred Drake602aa772000-10-12 20:50:55 +0000572 The method should use a greedy read strategy meaning that it should
573 read as much data as is allowed within the definition of the encoding
574 and the given size, e.g. if optional encoding endings or state
575 markers are available on the stream, these should be read too.
Walter Dörwald69652032004-09-07 20:24:22 +0000576
577 \versionchanged[\var{chars} argument added]{2.4}
Martin v. Löwis56066d22005-08-24 07:38:12 +0000578 \versionchanged[\var{firstline} argument added]{2.4.2}
Fred Drake602aa772000-10-12 20:50:55 +0000579\end{methoddesc}
580
Walter Dörwald69652032004-09-07 20:24:22 +0000581\begin{methoddesc}{readline}{\optional{size\optional{, keepends}}}
Fred Drake602aa772000-10-12 20:50:55 +0000582 Read one line from the input stream and return the
583 decoded data.
584
Fred Drake602aa772000-10-12 20:50:55 +0000585 \var{size}, if given, is passed as size argument to the stream's
586 \method{readline()} method.
Walter Dörwald69652032004-09-07 20:24:22 +0000587
588 If \var{keepends} is false lineends will be stripped from the
589 lines returned.
590
591 \versionchanged[\var{keepends} argument added]{2.4}
Fred Drake602aa772000-10-12 20:50:55 +0000592\end{methoddesc}
593
Walter Dörwald69652032004-09-07 20:24:22 +0000594\begin{methoddesc}{readlines}{\optional{sizehint\optional{, keepends}}}
Fred Drake602aa772000-10-12 20:50:55 +0000595 Read all lines available on the input stream and return them as list
596 of lines.
597
598 Line breaks are implemented using the codec's decoder method and are
Walter Dörwald69652032004-09-07 20:24:22 +0000599 included in the list entries if \var{keepends} is true.
Fred Drake602aa772000-10-12 20:50:55 +0000600
601 \var{sizehint}, if given, is passed as \var{size} argument to the
602 stream's \method{read()} method.
603\end{methoddesc}
604
605\begin{methoddesc}{reset}{}
606 Resets the codec buffers used for keeping state.
607
608 Note that no stream repositioning should take place. This method is
609 primarily intended to be able to recover from decoding errors.
610\end{methoddesc}
611
612In addition to the above methods, the \class{StreamReader} must also
613inherit all other methods and attribute from the underlying stream.
614
615The next two base classes are included for convenience. They are not
616needed by the codec registry, but may provide useful in practice.
617
618
619\subsubsection{StreamReaderWriter Objects \label{stream-reader-writer}}
620
621The \class{StreamReaderWriter} allows wrapping streams which work in
622both read and write modes.
623
624The design is such that one can use the factory functions returned by
625the \function{lookup()} function to construct the instance.
626
627\begin{classdesc}{StreamReaderWriter}{stream, Reader, Writer, errors}
628 Creates a \class{StreamReaderWriter} instance.
629 \var{stream} must be a file-like object.
630 \var{Reader} and \var{Writer} must be factory functions or classes
631 providing the \class{StreamReader} and \class{StreamWriter} interface
632 resp.
633 Error handling is done in the same way as defined for the
634 stream readers and writers.
635\end{classdesc}
636
637\class{StreamReaderWriter} instances define the combined interfaces of
638\class{StreamReader} and \class{StreamWriter} classes. They inherit
639all other methods and attribute from the underlying stream.
640
641
642\subsubsection{StreamRecoder Objects \label{stream-recoder-objects}}
643
644The \class{StreamRecoder} provide a frontend - backend view of
645encoding data which is sometimes useful when dealing with different
646encoding environments.
647
648The design is such that one can use the factory functions returned by
649the \function{lookup()} function to construct the instance.
650
651\begin{classdesc}{StreamRecoder}{stream, encode, decode,
652 Reader, Writer, errors}
653 Creates a \class{StreamRecoder} instance which implements a two-way
654 conversion: \var{encode} and \var{decode} work on the frontend (the
655 input to \method{read()} and output of \method{write()}) while
656 \var{Reader} and \var{Writer} work on the backend (reading and
657 writing to the stream).
658
659 You can use these objects to do transparent direct recodings from
660 e.g.\ Latin-1 to UTF-8 and back.
661
662 \var{stream} must be a file-like object.
663
664 \var{encode}, \var{decode} must adhere to the \class{Codec}
665 interface, \var{Reader}, \var{Writer} must be factory functions or
Raymond Hettingerf17d65d2003-08-12 00:01:16 +0000666 classes providing objects of the \class{StreamReader} and
Fred Drake602aa772000-10-12 20:50:55 +0000667 \class{StreamWriter} interface respectively.
668
669 \var{encode} and \var{decode} are needed for the frontend
670 translation, \var{Reader} and \var{Writer} for the backend
671 translation. The intermediate format used is determined by the two
672 sets of codecs, e.g. the Unicode codecs will use Unicode as
673 intermediate encoding.
674
675 Error handling is done in the same way as defined for the
676 stream readers and writers.
677\end{classdesc}
678
679\class{StreamRecoder} instances define the combined interfaces of
680\class{StreamReader} and \class{StreamWriter} classes. They inherit
681all other methods and attribute from the underlying stream.
682
Martin v. Löwis412ed3b2006-01-08 10:45:39 +0000683\subsection{Encodings and Unicode\label{encodings-overview}}
684
685Unicode strings are stored internally as sequences of codepoints (to
Georg Brandl131e4f72006-01-23 21:33:48 +0000686be precise as \ctype{Py_UNICODE} arrays). Depending on the way Python is
687compiled (either via \longprogramopt{enable-unicode=ucs2} or
688\longprogramopt{enable-unicode=ucs4}, with the former being the default)
689\ctype{Py_UNICODE} is either a 16-bit or
Martin v. Löwis412ed3b2006-01-08 10:45:39 +000069032-bit data type. Once a Unicode object is used outside of CPU and
691memory, CPU endianness and how these arrays are stored as bytes become
692an issue. Transforming a unicode object into a sequence of bytes is
693called encoding and recreating the unicode object from the sequence of
694bytes is known as decoding. There are many different methods how this
695transformation can be done (these methods are also called encodings).
696The simplest method is to map the codepoints 0-255 to the bytes
Georg Brandl131e4f72006-01-23 21:33:48 +0000697\code{0x0}-\code{0xff}. This means that a unicode object that contains
698codepoints above \code{U+00FF} can't be encoded with this method (which
699is called \code{'latin-1'} or \code{'iso-8859-1'}). unicode.encode() will
700raise a UnicodeEncodeError that looks like this: \samp{UnicodeEncodeError:
701'latin-1' codec can't encode character u'\e u1234' in position 3: ordinal
702not in range(256)}.
Martin v. Löwis412ed3b2006-01-08 10:45:39 +0000703
704There's another group of encodings (the so called charmap encodings)
705that choose a different subset of all unicode code points and how
Georg Brandl131e4f72006-01-23 21:33:48 +0000706these codepoints are mapped to the bytes \code{0x0}-\code{0xff.}
707To see how this is done simply open e.g. \file{encodings/cp1252.py}
708(which is an encoding that is used primarily on Windows).
709There's a string constant with 256 characters that shows you which
710character is mapped to which byte value.
Martin v. Löwis412ed3b2006-01-08 10:45:39 +0000711
712All of these encodings can only encode 256 of the 65536 (or 1114111)
713codepoints defined in unicode. A simple and straightforward way that
714can store each Unicode code point, is to store each codepoint as two
715consecutive bytes. There are two possibilities: Store the bytes in big
716endian or in little endian order. These two encodings are called
717UTF-16-BE and UTF-16-LE respectively. Their disadvantage is that if
718e.g. you use UTF-16-BE on a little endian machine you will always have
719to swap bytes on encoding and decoding. UTF-16 avoids this problem:
720Bytes will always be in natural endianness. When these bytes are read
721by a CPU with a different endianness, then bytes have to be swapped
722though. To be able to detect the endianness of a UTF-16 byte sequence,
723there's the so called BOM (the "Byte Order Mark"). This is the Unicode
Georg Brandl131e4f72006-01-23 21:33:48 +0000724character \code{U+FEFF}. This character will be prepended to every UTF-16
725byte sequence. The byte swapped version of this character (\code{0xFFFE}) is
Martin v. Löwis412ed3b2006-01-08 10:45:39 +0000726an illegal character that may not appear in a Unicode text. So when
Georg Brandl131e4f72006-01-23 21:33:48 +0000727the first character in an UTF-16 byte sequence appears to be a \code{U+FFFE}
Martin v. Löwis412ed3b2006-01-08 10:45:39 +0000728the bytes have to be swapped on decoding. Unfortunately upto Unicode
Georg Brandl131e4f72006-01-23 21:33:48 +00007294.0 the character \code{U+FEFF} had a second purpose as a \samp{ZERO WIDTH
730NO-BREAK SPACE}: A character that has no width and doesn't allow a
Martin v. Löwis412ed3b2006-01-08 10:45:39 +0000731word to be split. It can e.g. be used to give hints to a ligature
Georg Brandl131e4f72006-01-23 21:33:48 +0000732algorithm. With Unicode 4.0 using \code{U+FEFF} as a \samp{ZERO WIDTH NO-BREAK
733SPACE} has been deprecated (with \code{U+2060} (\samp{WORD JOINER}) assuming
734this role). Nevertheless Unicode software still must be able to handle
735\code{U+FEFF} in both roles: As a BOM it's a device to determine the storage
Martin v. Löwis412ed3b2006-01-08 10:45:39 +0000736layout of the encoded bytes, and vanishes once the byte sequence has
Georg Brandl131e4f72006-01-23 21:33:48 +0000737been decoded into a Unicode string; as a \samp{ZERO WIDTH NO-BREAK SPACE}
Martin v. Löwis412ed3b2006-01-08 10:45:39 +0000738it's a normal character that will be decoded like any other.
739
740There's another encoding that is able to encoding the full range of
741Unicode characters: UTF-8. UTF-8 is an 8bit encoding, which means
742there are no issues with byte order in UTF-8. Each byte in a UTF-8
743byte sequence consists of two parts: Marker bits (the most significant
744bits) and payload bits. The marker bits are a sequence of zero to six
7451 bits followed by a 0 bit. Unicode characters are encoded like this
Walter Dörwaldb754fe42006-01-09 12:45:01 +0000746(with x being payload bits, which when concatenated give the Unicode
Martin v. Löwis412ed3b2006-01-08 10:45:39 +0000747character):
748
Walter Dörwaldb075fce2006-02-21 18:51:32 +0000749\begin{tableii}{l|l}{textrm}{Range}{Encoding}
Georg Brandl131e4f72006-01-23 21:33:48 +0000750\lineii{\code{U-00000000} ... \code{U-0000007F}}{0xxxxxxx}
751\lineii{\code{U-00000080} ... \code{U-000007FF}}{110xxxxx 10xxxxxx}
752\lineii{\code{U-00000800} ... \code{U-0000FFFF}}{1110xxxx 10xxxxxx 10xxxxxx}
753\lineii{\code{U-00010000} ... \code{U-001FFFFF}}{11110xxx 10xxxxxx 10xxxxxx 10xxxxxx}
754\lineii{\code{U-00200000} ... \code{U-03FFFFFF}}{111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx}
755\lineii{\code{U-04000000} ... \code{U-7FFFFFFF}}{1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx}
Martin v. Löwis412ed3b2006-01-08 10:45:39 +0000756\end{tableii}
757
758The least significant bit of the Unicode character is the rightmost x
759bit.
760
Georg Brandl131e4f72006-01-23 21:33:48 +0000761As UTF-8 is an 8bit encoding no BOM is required and any \code{U+FEFF}
Martin v. Löwis412ed3b2006-01-08 10:45:39 +0000762character in the decoded Unicode string (even if it's the first
Georg Brandl131e4f72006-01-23 21:33:48 +0000763character) is treated as a \samp{ZERO WIDTH NO-BREAK SPACE}.
Martin v. Löwis412ed3b2006-01-08 10:45:39 +0000764
765Without external information it's impossible to reliably determine
766which encoding was used for encoding a Unicode string. Each charmap
767encoding can decode any random byte sequence. However that's not
768possible with UTF-8, as UTF-8 byte sequences have a structure that
769doesn't allow arbitrary byte sequence. To increase the reliability
Walter Dörwaldb754fe42006-01-09 12:45:01 +0000770with which a UTF-8 encoding can be detected, Microsoft invented a
Georg Brandl131e4f72006-01-23 21:33:48 +0000771variant of UTF-8 (that Python 2.5 calls \code{"utf-8-sig"}) for its Notepad
Martin v. Löwis412ed3b2006-01-08 10:45:39 +0000772program: Before any of the Unicode characters is written to the file,
Georg Brandl131e4f72006-01-23 21:33:48 +0000773a UTF-8 encoded BOM (which looks like this as a byte sequence: \code{0xef},
774\code{0xbb}, \code{0xbf}) is written. As it's rather improbably that any
775charmap encoded file starts with these byte values (which would e.g. map to
Martin v. Löwis412ed3b2006-01-08 10:45:39 +0000776
Georg Brandl131e4f72006-01-23 21:33:48 +0000777 LATIN SMALL LETTER I WITH DIAERESIS \\
778 RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK \\
Martin v. Löwis412ed3b2006-01-08 10:45:39 +0000779 INVERTED QUESTION MARK
780
781in iso-8859-1), this increases the probability that a utf-8-sig
782encoding can be correctly guessed from the byte sequence. So here the
783BOM is not used to be able to determine the byte order used for
784generating the byte sequence, but as a signature that helps in
785guessing the encoding. On encoding the utf-8-sig codec will write
Georg Brandl131e4f72006-01-23 21:33:48 +0000786\code{0xef}, \code{0xbb}, \code{0xbf} as the first three bytes to the file.
787On decoding utf-8-sig will skip those three bytes if they appear as the
788first three bytes in the file.
Martin v. Löwis412ed3b2006-01-08 10:45:39 +0000789
790
Skip Montanaroecf7a522004-07-01 19:26:04 +0000791\subsection{Standard Encodings\label{standard-encodings}}
Martin v. Löwis5c37a772002-12-31 12:39:07 +0000792
793Python comes with a number of codecs builtin, either implemented as C
794functions, or with dictionaries as mapping tables. The following table
795lists the codecs by name, together with a few common aliases, and the
796languages for which the encoding is likely used. Neither the list of
797aliases nor the list of languages is meant to be exhaustive. Notice
798that spelling alternatives that only differ in case or use a hyphen
799instead of an underscore are also valid aliases.
800
801Many of the character sets support the same languages. They vary in
802individual characters (e.g. whether the EURO SIGN is supported or
803not), and in the assignment of characters to code positions. For the
804European languages in particular, the following variants typically
805exist:
806
807\begin{itemize}
808\item an ISO 8859 codeset
809\item a Microsoft Windows code page, which is typically derived from
810 a 8859 codeset, but replaces control characters with additional
811 graphic characters
812\item an IBM EBCDIC code page
Fred Draked4be7472003-04-30 15:02:07 +0000813\item an IBM PC code page, which is \ASCII{} compatible
Martin v. Löwis5c37a772002-12-31 12:39:07 +0000814\end{itemize}
815
816\begin{longtableiii}{l|l|l}{textrm}{Codec}{Aliases}{Languages}
817
818\lineiii{ascii}
819 {646, us-ascii}
820 {English}
821
Hye-Shik Chang3e2a3062004-01-17 14:29:29 +0000822\lineiii{big5}
Hye-Shik Chang910d8f12004-07-17 14:44:43 +0000823 {big5-tw, csbig5}
Hye-Shik Chang3e2a3062004-01-17 14:29:29 +0000824 {Traditional Chinese}
825
Hye-Shik Chang2bb146f2004-07-18 03:06:29 +0000826\lineiii{big5hkscs}
827 {big5-hkscs, hkscs}
828 {Traditional Chinese}
829
Martin v. Löwis5c37a772002-12-31 12:39:07 +0000830\lineiii{cp037}
831 {IBM037, IBM039}
832 {English}
833
834\lineiii{cp424}
835 {EBCDIC-CP-HE, IBM424}
836 {Hebrew}
837
838\lineiii{cp437}
839 {437, IBM437}
840 {English}
841
842\lineiii{cp500}
843 {EBCDIC-CP-BE, EBCDIC-CP-CH, IBM500}
844 {Western Europe}
845
846\lineiii{cp737}
847 {}
848 {Greek}
849
850\lineiii{cp775}
851 {IBM775}
852 {Baltic languages}
853
854\lineiii{cp850}
855 {850, IBM850}
856 {Western Europe}
857
858\lineiii{cp852}
859 {852, IBM852}
860 {Central and Eastern Europe}
861
862\lineiii{cp855}
863 {855, IBM855}
864 {Bulgarian, Byelorussian, Macedonian, Russian, Serbian}
865
866\lineiii{cp856}
867 {}
868 {Hebrew}
869
870\lineiii{cp857}
871 {857, IBM857}
872 {Turkish}
873
874\lineiii{cp860}
875 {860, IBM860}
876 {Portuguese}
877
878\lineiii{cp861}
879 {861, CP-IS, IBM861}
880 {Icelandic}
881
882\lineiii{cp862}
883 {862, IBM862}
884 {Hebrew}
885
886\lineiii{cp863}
887 {863, IBM863}
888 {Canadian}
889
890\lineiii{cp864}
891 {IBM864}
892 {Arabic}
893
894\lineiii{cp865}
895 {865, IBM865}
896 {Danish, Norwegian}
897
Skip Montanaro78bace72004-07-02 02:14:34 +0000898\lineiii{cp866}
899 {866, IBM866}
900 {Russian}
901
Martin v. Löwis5c37a772002-12-31 12:39:07 +0000902\lineiii{cp869}
903 {869, CP-GR, IBM869}
904 {Greek}
905
906\lineiii{cp874}
907 {}
908 {Thai}
909
910\lineiii{cp875}
911 {}
912 {Greek}
913
Hye-Shik Chang3e2a3062004-01-17 14:29:29 +0000914\lineiii{cp932}
Hye-Shik Chang910d8f12004-07-17 14:44:43 +0000915 {932, ms932, mskanji, ms-kanji}
Hye-Shik Chang3e2a3062004-01-17 14:29:29 +0000916 {Japanese}
917
918\lineiii{cp949}
919 {949, ms949, uhc}
920 {Korean}
921
922\lineiii{cp950}
923 {950, ms950}
924 {Traditional Chinese}
925
Martin v. Löwis5c37a772002-12-31 12:39:07 +0000926\lineiii{cp1006}
927 {}
928 {Urdu}
929
930\lineiii{cp1026}
931 {ibm1026}
932 {Turkish}
933
934\lineiii{cp1140}
935 {ibm1140}
936 {Western Europe}
937
938\lineiii{cp1250}
939 {windows-1250}
940 {Central and Eastern Europe}
941
942\lineiii{cp1251}
943 {windows-1251}
944 {Bulgarian, Byelorussian, Macedonian, Russian, Serbian}
945
946\lineiii{cp1252}
947 {windows-1252}
948 {Western Europe}
949
950\lineiii{cp1253}
951 {windows-1253}
952 {Greek}
953
954\lineiii{cp1254}
955 {windows-1254}
956 {Turkish}
957
958\lineiii{cp1255}
959 {windows-1255}
960 {Hebrew}
961
962\lineiii{cp1256}
963 {windows1256}
964 {Arabic}
965
966\lineiii{cp1257}
967 {windows-1257}
968 {Baltic languages}
969
970\lineiii{cp1258}
971 {windows-1258}
972 {Vietnamese}
973
Hye-Shik Chang3e2a3062004-01-17 14:29:29 +0000974\lineiii{euc_jp}
Hye-Shik Chang910d8f12004-07-17 14:44:43 +0000975 {eucjp, ujis, u-jis}
Hye-Shik Chang3e2a3062004-01-17 14:29:29 +0000976 {Japanese}
977
Hye-Shik Chang2bb146f2004-07-18 03:06:29 +0000978\lineiii{euc_jis_2004}
979 {jisx0213, eucjis2004}
980 {Japanese}
981
Hye-Shik Chang3e2a3062004-01-17 14:29:29 +0000982\lineiii{euc_jisx0213}
Hye-Shik Chang2bb146f2004-07-18 03:06:29 +0000983 {eucjisx0213}
Hye-Shik Chang3e2a3062004-01-17 14:29:29 +0000984 {Japanese}
985
986\lineiii{euc_kr}
Hye-Shik Chang910d8f12004-07-17 14:44:43 +0000987 {euckr, korean, ksc5601, ks_c-5601, ks_c-5601-1987, ksx1001, ks_x-1001}
Hye-Shik Chang3e2a3062004-01-17 14:29:29 +0000988 {Korean}
989
990\lineiii{gb2312}
Hye-Shik Chang910d8f12004-07-17 14:44:43 +0000991 {chinese, csiso58gb231280, euc-cn, euccn, eucgb2312-cn, gb2312-1980,
992 gb2312-80, iso-ir-58}
Hye-Shik Chang3e2a3062004-01-17 14:29:29 +0000993 {Simplified Chinese}
994
995\lineiii{gbk}
996 {936, cp936, ms936}
997 {Unified Chinese}
998
999\lineiii{gb18030}
Hye-Shik Chang910d8f12004-07-17 14:44:43 +00001000 {gb18030-2000}
Hye-Shik Chang3e2a3062004-01-17 14:29:29 +00001001 {Unified Chinese}
1002
1003\lineiii{hz}
Hye-Shik Chang910d8f12004-07-17 14:44:43 +00001004 {hzgb, hz-gb, hz-gb-2312}
Hye-Shik Chang3e2a3062004-01-17 14:29:29 +00001005 {Simplified Chinese}
1006
1007\lineiii{iso2022_jp}
Hye-Shik Chang910d8f12004-07-17 14:44:43 +00001008 {csiso2022jp, iso2022jp, iso-2022-jp}
Hye-Shik Chang3e2a3062004-01-17 14:29:29 +00001009 {Japanese}
1010
1011\lineiii{iso2022_jp_1}
Hye-Shik Chang910d8f12004-07-17 14:44:43 +00001012 {iso2022jp-1, iso-2022-jp-1}
Hye-Shik Chang3e2a3062004-01-17 14:29:29 +00001013 {Japanese}
1014
1015\lineiii{iso2022_jp_2}
Hye-Shik Chang910d8f12004-07-17 14:44:43 +00001016 {iso2022jp-2, iso-2022-jp-2}
Hye-Shik Chang3e2a3062004-01-17 14:29:29 +00001017 {Japanese, Korean, Simplified Chinese, Western Europe, Greek}
1018
Hye-Shik Chang2bb146f2004-07-18 03:06:29 +00001019\lineiii{iso2022_jp_2004}
1020 {iso2022jp-2004, iso-2022-jp-2004}
1021 {Japanese}
1022
Hye-Shik Chang3e2a3062004-01-17 14:29:29 +00001023\lineiii{iso2022_jp_3}
Hye-Shik Chang910d8f12004-07-17 14:44:43 +00001024 {iso2022jp-3, iso-2022-jp-3}
Hye-Shik Chang3e2a3062004-01-17 14:29:29 +00001025 {Japanese}
1026
1027\lineiii{iso2022_jp_ext}
Hye-Shik Chang910d8f12004-07-17 14:44:43 +00001028 {iso2022jp-ext, iso-2022-jp-ext}
Hye-Shik Chang3e2a3062004-01-17 14:29:29 +00001029 {Japanese}
1030
1031\lineiii{iso2022_kr}
Hye-Shik Chang910d8f12004-07-17 14:44:43 +00001032 {csiso2022kr, iso2022kr, iso-2022-kr}
Hye-Shik Chang3e2a3062004-01-17 14:29:29 +00001033 {Korean}
1034
Martin v. Löwis5c37a772002-12-31 12:39:07 +00001035\lineiii{latin_1}
1036 {iso-8859-1, iso8859-1, 8859, cp819, latin, latin1, L1}
1037 {West Europe}
1038
1039\lineiii{iso8859_2}
1040 {iso-8859-2, latin2, L2}
1041 {Central and Eastern Europe}
1042
1043\lineiii{iso8859_3}
1044 {iso-8859-3, latin3, L3}
1045 {Esperanto, Maltese}
1046
1047\lineiii{iso8859_4}
1048 {iso-8859-4, latin4, L4}
1049 {Baltic languagues}
1050
1051\lineiii{iso8859_5}
1052 {iso-8859-5, cyrillic}
1053 {Bulgarian, Byelorussian, Macedonian, Russian, Serbian}
1054
1055\lineiii{iso8859_6}
1056 {iso-8859-6, arabic}
1057 {Arabic}
1058
1059\lineiii{iso8859_7}
1060 {iso-8859-7, greek, greek8}
1061 {Greek}
1062
1063\lineiii{iso8859_8}
1064 {iso-8859-8, hebrew}
1065 {Hebrew}
1066
1067\lineiii{iso8859_9}
1068 {iso-8859-9, latin5, L5}
1069 {Turkish}
1070
1071\lineiii{iso8859_10}
1072 {iso-8859-10, latin6, L6}
1073 {Nordic languages}
1074
1075\lineiii{iso8859_13}
1076 {iso-8859-13}
1077 {Baltic languages}
1078
1079\lineiii{iso8859_14}
1080 {iso-8859-14, latin8, L8}
1081 {Celtic languages}
1082
1083\lineiii{iso8859_15}
1084 {iso-8859-15}
1085 {Western Europe}
1086
Hye-Shik Chang3e2a3062004-01-17 14:29:29 +00001087\lineiii{johab}
1088 {cp1361, ms1361}
1089 {Korean}
1090
Martin v. Löwis5c37a772002-12-31 12:39:07 +00001091\lineiii{koi8_r}
1092 {}
1093 {Russian}
1094
1095\lineiii{koi8_u}
1096 {}
1097 {Ukrainian}
1098
1099\lineiii{mac_cyrillic}
1100 {maccyrillic}
1101 {Bulgarian, Byelorussian, Macedonian, Russian, Serbian}
1102
1103\lineiii{mac_greek}
1104 {macgreek}
1105 {Greek}
1106
1107\lineiii{mac_iceland}
1108 {maciceland}
1109 {Icelandic}
1110
1111\lineiii{mac_latin2}
1112 {maclatin2, maccentraleurope}
1113 {Central and Eastern Europe}
1114
1115\lineiii{mac_roman}
1116 {macroman}
1117 {Western Europe}
1118
1119\lineiii{mac_turkish}
1120 {macturkish}
1121 {Turkish}
1122
Hye-Shik Chang5c5316f2004-03-19 08:06:07 +00001123\lineiii{ptcp154}
1124 {csptcp154, pt154, cp154, cyrillic-asian}
1125 {Kazakh}
1126
Hye-Shik Chang3e2a3062004-01-17 14:29:29 +00001127\lineiii{shift_jis}
1128 {csshiftjis, shiftjis, sjis, s_jis}
1129 {Japanese}
1130
Hye-Shik Chang2bb146f2004-07-18 03:06:29 +00001131\lineiii{shift_jis_2004}
1132 {shiftjis2004, sjis_2004, sjis2004}
1133 {Japanese}
1134
Hye-Shik Chang3e2a3062004-01-17 14:29:29 +00001135\lineiii{shift_jisx0213}
1136 {shiftjisx0213, sjisx0213, s_jisx0213}
1137 {Japanese}
1138
Martin v. Löwis5c37a772002-12-31 12:39:07 +00001139\lineiii{utf_16}
1140 {U16, utf16}
1141 {all languages}
1142
1143\lineiii{utf_16_be}
1144 {UTF-16BE}
1145 {all languages (BMP only)}
1146
1147\lineiii{utf_16_le}
1148 {UTF-16LE}
1149 {all languages (BMP only)}
1150
1151\lineiii{utf_7}
Walter Dörwald007f8df2005-10-09 19:42:27 +00001152 {U7, unicode-1-1-utf-7}
Martin v. Löwis5c37a772002-12-31 12:39:07 +00001153 {all languages}
1154
1155\lineiii{utf_8}
1156 {U8, UTF, utf8}
1157 {all languages}
1158
Martin v. Löwis412ed3b2006-01-08 10:45:39 +00001159\lineiii{utf_8_sig}
1160 {}
1161 {all languages}
1162
Martin v. Löwis5c37a772002-12-31 12:39:07 +00001163\end{longtableiii}
1164
1165A number of codecs are specific to Python, so their codec names have
1166no meaning outside Python. Some of them don't convert from Unicode
1167strings to byte strings, but instead use the property of the Python
1168codecs machinery that any bijective function with one argument can be
1169considered as an encoding.
1170
1171For the codecs listed below, the result in the ``encoding'' direction
1172is always a byte string. The result of the ``decoding'' direction is
1173listed as operand type in the table.
1174
1175\begin{tableiv}{l|l|l|l}{textrm}{Codec}{Aliases}{Operand type}{Purpose}
1176
1177\lineiv{base64_codec}
1178 {base64, base-64}
1179 {byte string}
1180 {Convert operand to MIME base64}
1181
Raymond Hettinger9a80c5d2003-09-23 20:21:01 +00001182\lineiv{bz2_codec}
1183 {bz2}
1184 {byte string}
1185 {Compress the operand using bz2}
1186
Martin v. Löwis5c37a772002-12-31 12:39:07 +00001187\lineiv{hex_codec}
1188 {hex}
1189 {byte string}
Fred Draked4be7472003-04-30 15:02:07 +00001190 {Convert operand to hexadecimal representation, with two
1191 digits per byte}
Martin v. Löwis5c37a772002-12-31 12:39:07 +00001192
Martin v. Löwis2548c732003-04-18 10:39:54 +00001193\lineiv{idna}
1194 {}
1195 {Unicode string}
Fred Draked4be7472003-04-30 15:02:07 +00001196 {Implements \rfc{3490}.
Raymond Hettingeraa1178b2003-09-01 23:13:04 +00001197 \versionadded{2.3}
Fred Draked4be7472003-04-30 15:02:07 +00001198 See also \refmodule{encodings.idna}}
Martin v. Löwis2548c732003-04-18 10:39:54 +00001199
Martin v. Löwis5c37a772002-12-31 12:39:07 +00001200\lineiv{mbcs}
1201 {dbcs}
1202 {Unicode string}
1203 {Windows only: Encode operand according to the ANSI codepage (CP_ACP)}
1204
1205\lineiv{palmos}
1206 {}
1207 {Unicode string}
1208 {Encoding of PalmOS 3.5}
1209
Martin v. Löwis2548c732003-04-18 10:39:54 +00001210\lineiv{punycode}
1211 {}
1212 {Unicode string}
Fred Draked4be7472003-04-30 15:02:07 +00001213 {Implements \rfc{3492}.
1214 \versionadded{2.3}}
Martin v. Löwis2548c732003-04-18 10:39:54 +00001215
Martin v. Löwis5c37a772002-12-31 12:39:07 +00001216\lineiv{quopri_codec}
1217 {quopri, quoted-printable, quotedprintable}
1218 {byte string}
1219 {Convert operand to MIME quoted printable}
1220
1221\lineiv{raw_unicode_escape}
1222 {}
1223 {Unicode string}
Fred Draked4be7472003-04-30 15:02:07 +00001224 {Produce a string that is suitable as raw Unicode literal in
1225 Python source code}
Martin v. Löwis5c37a772002-12-31 12:39:07 +00001226
1227\lineiv{rot_13}
1228 {rot13}
1229 {byte string}
1230 {Returns the Caesar-cypher encryption of the operand}
1231
1232\lineiv{string_escape}
1233 {}
1234 {byte string}
Fred Draked4be7472003-04-30 15:02:07 +00001235 {Produce a string that is suitable as string literal in
1236 Python source code}
Martin v. Löwis5c37a772002-12-31 12:39:07 +00001237
1238\lineiv{undefined}
1239 {}
1240 {any}
Georg Brandl8f4b4db2006-03-09 10:16:42 +00001241 {Raise an exception for all conversions. Can be used as the
Fred Draked4be7472003-04-30 15:02:07 +00001242 system encoding if no automatic coercion between byte and
1243 Unicode strings is desired.}
Martin v. Löwis5c37a772002-12-31 12:39:07 +00001244
1245\lineiv{unicode_escape}
1246 {}
1247 {Unicode string}
Fred Draked4be7472003-04-30 15:02:07 +00001248 {Produce a string that is suitable as Unicode literal in
1249 Python source code}
Martin v. Löwis5c37a772002-12-31 12:39:07 +00001250
1251\lineiv{unicode_internal}
1252 {}
1253 {Unicode string}
Raymond Hettinger68804312005-01-01 00:28:46 +00001254 {Return the internal representation of the operand}
Martin v. Löwis5c37a772002-12-31 12:39:07 +00001255
1256\lineiv{uu_codec}
1257 {uu}
1258 {byte string}
1259 {Convert the operand using uuencode}
1260
1261\lineiv{zlib_codec}
1262 {zip, zlib}
1263 {byte string}
1264 {Compress the operand using gzip}
1265
1266\end{tableiv}
Martin v. Löwis2548c732003-04-18 10:39:54 +00001267
1268\subsection{\module{encodings.idna} ---
1269 Internationalized Domain Names in Applications}
1270
1271\declaremodule{standard}{encodings.idna}
1272\modulesynopsis{Internationalized Domain Names implementation}
Fred Draked4be7472003-04-30 15:02:07 +00001273% XXX The next line triggers a formatting bug, so it's commented out
1274% until that can be fixed.
1275%\moduleauthor{Martin v. L\"owis}
1276
1277\versionadded{2.3}
Martin v. Löwis2548c732003-04-18 10:39:54 +00001278
1279This module implements \rfc{3490} (Internationalized Domain Names in
1280Applications) and \rfc{3492} (Nameprep: A Stringprep Profile for
1281Internationalized Domain Names (IDN)). It builds upon the
Fred Draked24c7672003-07-16 05:17:23 +00001282\code{punycode} encoding and \refmodule{stringprep}.
Martin v. Löwis2548c732003-04-18 10:39:54 +00001283
Fred Draked4be7472003-04-30 15:02:07 +00001284These RFCs together define a protocol to support non-\ASCII{} characters
1285in domain names. A domain name containing non-\ASCII{} characters (such
Fred Draked24c7672003-07-16 05:17:23 +00001286as ``www.Alliancefran\c caise.nu'') is converted into an
Fred Draked4be7472003-04-30 15:02:07 +00001287\ASCII-compatible encoding (ACE, such as
Martin v. Löwis2548c732003-04-18 10:39:54 +00001288``www.xn--alliancefranaise-npb.nu''). The ACE form of the domain name
1289is then used in all places where arbitrary characters are not allowed
Fred Draked4be7472003-04-30 15:02:07 +00001290by the protocol, such as DNS queries, HTTP \mailheader{Host} fields, and so
Martin v. Löwis2548c732003-04-18 10:39:54 +00001291on. This conversion is carried out in the application; if possible
1292invisible to the user: The application should transparently convert
1293Unicode domain labels to IDNA on the wire, and convert back ACE labels
1294to Unicode before presenting them to the user.
1295
1296Python supports this conversion in several ways: The \code{idna} codec
1297allows to convert between Unicode and the ACE. Furthermore, the
Fred Draked24c7672003-07-16 05:17:23 +00001298\refmodule{socket} module transparently converts Unicode host names to
Martin v. Löwis2548c732003-04-18 10:39:54 +00001299ACE, so that applications need not be concerned about converting host
1300names themselves when they pass them to the socket module. On top of
1301that, modules that have host names as function parameters, such as
Fred Draked24c7672003-07-16 05:17:23 +00001302\refmodule{httplib} and \refmodule{ftplib}, accept Unicode host names
1303(\refmodule{httplib} then also transparently sends an IDNA hostname in
1304the \mailheader{Host} field if it sends that field at all).
Martin v. Löwis2548c732003-04-18 10:39:54 +00001305
1306When receiving host names from the wire (such as in reverse name
1307lookup), no automatic conversion to Unicode is performed: Applications
1308wishing to present such host names to the user should decode them to
1309Unicode.
1310
1311The module \module{encodings.idna} also implements the nameprep
1312procedure, which performs certain normalizations on host names, to
1313achieve case-insensitivity of international domain names, and to unify
1314similar characters. The nameprep functions can be used directly if
1315desired.
1316
1317\begin{funcdesc}{nameprep}{label}
1318Return the nameprepped version of \var{label}. The implementation
1319currently assumes query strings, so \code{AllowUnassigned} is
1320true.
1321\end{funcdesc}
1322
Raymond Hettingerb5155e32003-06-18 01:58:31 +00001323\begin{funcdesc}{ToASCII}{label}
Fred Draked4be7472003-04-30 15:02:07 +00001324Convert a label to \ASCII, as specified in \rfc{3490}.
Martin v. Löwis2548c732003-04-18 10:39:54 +00001325\code{UseSTD3ASCIIRules} is assumed to be false.
1326\end{funcdesc}
1327
1328\begin{funcdesc}{ToUnicode}{label}
1329Convert a label to Unicode, as specified in \rfc{3490}.
1330\end{funcdesc}
Martin v. Löwis412ed3b2006-01-08 10:45:39 +00001331
1332 \subsection{\module{encodings.utf_8_sig} ---
1333 UTF-8 codec with BOM signature}
1334\declaremodule{standard}{encodings.utf-8-sig} % XXX utf_8_sig gives TeX errors
1335\modulesynopsis{UTF-8 codec with BOM signature}
1336\moduleauthor{Walter D\"orwald}
1337
1338\versionadded{2.5}
1339
1340This module implements a variant of the UTF-8 codec: On encoding a
1341UTF-8 encoded BOM will be prepended to the UTF-8 encoded bytes. For
1342the stateful encoder this is only done once (on the first write to the
1343byte stream). For decoding an optional UTF-8 encoded BOM at the start
1344of the data will be skipped.