blob: 136c5289923633f247637af776280009984d4645 [file] [log] [blame]
Fred Drakeb7979c72000-04-06 14:21:58 +00001\section{\module{codecs} ---
Fred Drake69ca9502000-04-06 16:09:59 +00002 Codec registry and base classes}
Fred Drakeb7979c72000-04-06 14:21:58 +00003
Fred Drake69ca9502000-04-06 16:09:59 +00004\declaremodule{standard}{codecs}
Fred Drakeb7979c72000-04-06 14:21:58 +00005\modulesynopsis{Encode and decode data and streams.}
6\moduleauthor{Marc-Andre Lemburg}{mal@lemburg.com}
7\sectionauthor{Marc-Andre Lemburg}{mal@lemburg.com}
8
9
10\index{Unicode}
11\index{Codecs}
12\indexii{Codecs}{encode}
13\indexii{Codecs}{decode}
14\index{streams}
15\indexii{stackable}{streams}
16
17
18This module defines base classes for standard Python codecs (encoders
19and decoders) and provides access to the internal Python codec
20registry which manages the codec lookup process.
21
22It defines the following functions:
23
24\begin{funcdesc}{register}{search_function}
25Register a codec search function. Search functions are expected to
26take one argument, the encoding name in all lower case letters, and
27return a tuple of functions \code{(\var{encoder}, \var{decoder}, \var{stream_reader},
28\var{stream_writer})} taking the following arguments:
29
30 \var{encoder} and \var{decoder}: These must be functions or methods
Fred Drake602aa772000-10-12 20:50:55 +000031 which have the same interface as the
32 \method{encode()}/\method{decode()} methods of Codec instances (see
33 Codec Interface). The functions/methods are expected to work in a
34 stateless mode.
Fred Drakeb7979c72000-04-06 14:21:58 +000035
36 \var{stream_reader} and \var{stream_writer}: These have to be
37 factory functions providing the following interface:
38
Fred Drake602aa772000-10-12 20:50:55 +000039 \code{factory(\var{stream}, \var{errors}='strict')}
Fred Drakeb7979c72000-04-06 14:21:58 +000040
41 The factory functions must return objects providing the interfaces
Fred Drake69ca9502000-04-06 16:09:59 +000042 defined by the base classes \class{StreamWriter} and
43 \class{StreamReader}, respectively. Stream codecs can maintain
44 state.
Fred Drakeb7979c72000-04-06 14:21:58 +000045
Fred Drake69ca9502000-04-06 16:09:59 +000046 Possible values for errors are \code{'strict'} (raise an exception
47 in case of an encoding error), \code{'replace'} (replace malformed
48 data with a suitable replacement marker, such as \character{?}) and
49 \code{'ignore'} (ignore malformed data and continue without further
50 notice).
Fred Drakeb7979c72000-04-06 14:21:58 +000051
52In case a search function cannot find a given encoding, it should
Fred Drake69ca9502000-04-06 16:09:59 +000053return \code{None}.
Fred Drakeb7979c72000-04-06 14:21:58 +000054\end{funcdesc}
55
56\begin{funcdesc}{lookup}{encoding}
57Looks up a codec tuple in the Python codec registry and returns the
58function tuple as defined above.
59
60Encodings are first looked up in the registry's cache. If not found,
61the list of registered search functions is scanned. If no codecs tuple
Fred Drake69ca9502000-04-06 16:09:59 +000062is found, a \exception{LookupError} is raised. Otherwise, the codecs
63tuple is stored in the cache and returned to the caller.
Fred Drakeb7979c72000-04-06 14:21:58 +000064\end{funcdesc}
65
Skip Montanarob02ea652002-04-17 19:33:06 +000066To simplify access to the various codecs, the module provides these
Marc-André Lemburg494f2ae2001-09-19 11:33:31 +000067additional functions which use \function{lookup()} for the codec
68lookup:
69
70\begin{funcdesc}{getencoder}{encoding}
71Lookup up the codec for the given encoding and return its encoder
72function.
73
74Raises a \exception{LookupError} in case the encoding cannot be found.
75\end{funcdesc}
76
77\begin{funcdesc}{getdecoder}{encoding}
78Lookup up the codec for the given encoding and return its decoder
79function.
80
81Raises a \exception{LookupError} in case the encoding cannot be found.
82\end{funcdesc}
83
84\begin{funcdesc}{getreader}{encoding}
85Lookup up the codec for the given encoding and return its StreamReader
86class or factory function.
87
88Raises a \exception{LookupError} in case the encoding cannot be found.
89\end{funcdesc}
90
91\begin{funcdesc}{getwriter}{encoding}
92Lookup up the codec for the given encoding and return its StreamWriter
93class or factory function.
94
95Raises a \exception{LookupError} in case the encoding cannot be found.
96\end{funcdesc}
97
Fred Drakeb7979c72000-04-06 14:21:58 +000098To simplify working with encoded files or stream, the module
99also defines these utility functions:
100
Fred Drakee1b304d2000-07-24 19:35:52 +0000101\begin{funcdesc}{open}{filename, mode\optional{, encoding\optional{,
102 errors\optional{, buffering}}}}
Fred Drakeb7979c72000-04-06 14:21:58 +0000103Open an encoded file using the given \var{mode} and return
104a wrapped version providing transparent encoding/decoding.
105
Fred Drake0aa811c2001-10-20 04:24:09 +0000106\note{The wrapped version will only accept the object format
Fred Drakee1b304d2000-07-24 19:35:52 +0000107defined by the codecs, i.e.\ Unicode objects for most built-in
108codecs. Output is also codec-dependent and will usually be Unicode as
Fred Drake0aa811c2001-10-20 04:24:09 +0000109well.}
Fred Drakeb7979c72000-04-06 14:21:58 +0000110
111\var{encoding} specifies the encoding which is to be used for the
112the file.
113
114\var{errors} may be given to define the error handling. It defaults
Fred Drakee1b304d2000-07-24 19:35:52 +0000115to \code{'strict'} which causes a \exception{ValueError} to be raised
116in case an encoding error occurs.
Fred Drakeb7979c72000-04-06 14:21:58 +0000117
Fred Drake69ca9502000-04-06 16:09:59 +0000118\var{buffering} has the same meaning as for the built-in
119\function{open()} function. It defaults to line buffered.
Fred Drakeb7979c72000-04-06 14:21:58 +0000120\end{funcdesc}
121
Fred Drakee1b304d2000-07-24 19:35:52 +0000122\begin{funcdesc}{EncodedFile}{file, input\optional{,
123 output\optional{, errors}}}
Fred Drakeb7979c72000-04-06 14:21:58 +0000124Return a wrapped version of file which provides transparent
125encoding translation.
126
127Strings written to the wrapped file are interpreted according to the
128given \var{input} encoding and then written to the original file as
Fred Drakee1b304d2000-07-24 19:35:52 +0000129strings using the \var{output} encoding. The intermediate encoding will
Fred Drakeb7979c72000-04-06 14:21:58 +0000130usually be Unicode but depends on the specified codecs.
131
Fred Drakee1b304d2000-07-24 19:35:52 +0000132If \var{output} is not given, it defaults to \var{input}.
Fred Drakeb7979c72000-04-06 14:21:58 +0000133
134\var{errors} may be given to define the error handling. It defaults to
Fred Drakee1b304d2000-07-24 19:35:52 +0000135\code{'strict'}, which causes \exception{ValueError} to be raised in case
Fred Drakeb7979c72000-04-06 14:21:58 +0000136an encoding error occurs.
137\end{funcdesc}
138
Fred Drakeb7979c72000-04-06 14:21:58 +0000139The module also provides the following constants which are useful
140for reading and writing to platform dependent files:
141
142\begin{datadesc}{BOM}
143\dataline{BOM_BE}
144\dataline{BOM_LE}
Walter Dörwald474458d2002-06-04 15:16:29 +0000145\dataline{BOM_UTF8}
146\dataline{BOM_UTF16}
147\dataline{BOM_UTF16_BE}
148\dataline{BOM_UTF16_LE}
149\dataline{BOM_UTF32}
150\dataline{BOM_UTF32_BE}
151\dataline{BOM_UTF32_LE}
152These constants define various encodings of the Unicode byte order mark
153(BOM) used in UTF-16 and UTF-32 data streams to indicate the byte order
154used in the stream or file and in UTF-8 as a Unicode signature.
155\constant{BOM_UTF16} is either \constant{BOM_UTF16_BE} or
156\constant{BOM_UTF16_LE} depending on the platform's native byte order,
157\constant{BOM} is an alias for \constant{BOM_UTF16}, \constant{BOM_LE}
158for \constant{BOM_UTF16_LE} and \constant{BOM_BE} for \constant{BOM_UTF16_BE}.
159The others represent the BOM in UTF-8 and UTF-32 encodings.
Fred Drakeb7979c72000-04-06 14:21:58 +0000160\end{datadesc}
161
Fred Drakedc40ac02001-01-22 20:17:54 +0000162
163\begin{seealso}
164 \seeurl{http://sourceforge.net/projects/python-codecs/}{A
165 SourceForge project working on additional support for Asian
166 codecs for use with Python. They are in the early stages of
167 development at the time of this writing --- look in their
168 FTP area for downloadable files.}
169\end{seealso}
170
171
Fred Drake602aa772000-10-12 20:50:55 +0000172\subsection{Codec Base Classes}
173
174The \module{codecs} defines a set of base classes which define the
175interface and can also be used to easily write you own codecs for use
176in Python.
177
178Each codec has to define four interfaces to make it usable as codec in
179Python: stateless encoder, stateless decoder, stream reader and stream
180writer. The stream reader and writers typically reuse the stateless
181encoder/decoder to implement the file protocols.
182
183The \class{Codec} class defines the interface for stateless
184encoders/decoders.
185
186To simplify and standardize error handling, the \method{encode()} and
187\method{decode()} methods may implement different error handling
188schemes by providing the \var{errors} string argument. The following
189string values are defined and implemented by all standard Python
190codecs:
191
Fred Drakedc40ac02001-01-22 20:17:54 +0000192\begin{tableii}{l|l}{code}{Value}{Meaning}
193 \lineii{'strict'}{Raise \exception{ValueError} (or a subclass);
194 this is the default.}
195 \lineii{'ignore'}{Ignore the character and continue with the next.}
196 \lineii{'replace'}{Replace with a suitable replacement character;
197 Python will use the official U+FFFD REPLACEMENT
198 CHARACTER for the built-in Unicode codecs.}
199\end{tableii}
Fred Drake602aa772000-10-12 20:50:55 +0000200
201
202\subsubsection{Codec Objects \label{codec-objects}}
203
204The \class{Codec} class defines these methods which also define the
205function interfaces of the stateless encoder and decoder:
206
207\begin{methoddesc}{encode}{input\optional{, errors}}
208 Encodes the object \var{input} and returns a tuple (output object,
Skip Montanaro6c7bc312002-04-16 15:12:10 +0000209 length consumed). While codecs are not restricted to use with Unicode, in
210 a Unicode context, encoding converts a Unicode object to a plain string
211 using a particular character set encoding (e.g., \code{cp1252} or
212 \code{iso-8859-1}).
Fred Drake602aa772000-10-12 20:50:55 +0000213
214 \var{errors} defines the error handling to apply. It defaults to
215 \code{'strict'} handling.
216
217 The method may not store state in the \class{Codec} instance. Use
218 \class{StreamCodec} for codecs which have to keep state in order to
219 make encoding/decoding efficient.
220
221 The encoder must be able to handle zero length input and return an
222 empty object of the output object type in this situation.
223\end{methoddesc}
224
225\begin{methoddesc}{decode}{input\optional{, errors}}
226 Decodes the object \var{input} and returns a tuple (output object,
Skip Montanaro6c7bc312002-04-16 15:12:10 +0000227 length consumed). In a Unicode context, decoding converts a plain string
228 encoded using a particular character set encoding to a Unicode object.
Fred Drake602aa772000-10-12 20:50:55 +0000229
230 \var{input} must be an object which provides the \code{bf_getreadbuf}
231 buffer slot. Python strings, buffer objects and memory mapped files
232 are examples of objects providing this slot.
233
234 \var{errors} defines the error handling to apply. It defaults to
235 \code{'strict'} handling.
236
237 The method may not store state in the \class{Codec} instance. Use
238 \class{StreamCodec} for codecs which have to keep state in order to
239 make encoding/decoding efficient.
240
241 The decoder must be able to handle zero length input and return an
242 empty object of the output object type in this situation.
243\end{methoddesc}
244
245The \class{StreamWriter} and \class{StreamReader} classes provide
246generic working interfaces which can be used to implement new
247encodings submodules very easily. See \module{encodings.utf_8} for an
248example on how this is done.
249
250
251\subsubsection{StreamWriter Objects \label{stream-writer-objects}}
252
253The \class{StreamWriter} class is a subclass of \class{Codec} and
254defines the following methods which every stream writer must define in
255order to be compatible to the Python codec registry.
256
257\begin{classdesc}{StreamWriter}{stream\optional{, errors}}
258 Constructor for a \class{StreamWriter} instance.
259
260 All stream writers must provide this constructor interface. They are
261 free to add additional keyword arguments, but only the ones defined
262 here are used by the Python codec registry.
263
264 \var{stream} must be a file-like object open for writing (binary)
265 data.
266
267 The \class{StreamWriter} may implement different error handling
268 schemes by providing the \var{errors} keyword argument. These
269 parameters are defined:
270
271 \begin{itemize}
272 \item \code{'strict'} Raise \exception{ValueError} (or a subclass);
273 this is the default.
274 \item \code{'ignore'} Ignore the character and continue with the next.
275 \item \code{'replace'} Replace with a suitable replacement character
276 \end{itemize}
277\end{classdesc}
278
279\begin{methoddesc}{write}{object}
280 Writes the object's contents encoded to the stream.
281\end{methoddesc}
282
283\begin{methoddesc}{writelines}{list}
284 Writes the concatenated list of strings to the stream (possibly by
285 reusing the \method{write()} method).
286\end{methoddesc}
287
288\begin{methoddesc}{reset}{}
289 Flushes and resets the codec buffers used for keeping state.
290
291 Calling this method should ensure that the data on the output is put
292 into a clean state, that allows appending of new fresh data without
293 having to rescan the whole stream to recover state.
294\end{methoddesc}
295
296In addition to the above methods, the \class{StreamWriter} must also
297inherit all other methods and attribute from the underlying stream.
298
299
300\subsubsection{StreamReader Objects \label{stream-reader-objects}}
301
302The \class{StreamReader} class is a subclass of \class{Codec} and
303defines the following methods which every stream reader must define in
304order to be compatible to the Python codec registry.
305
306\begin{classdesc}{StreamReader}{stream\optional{, errors}}
307 Constructor for a \class{StreamReader} instance.
308
309 All stream readers must provide this constructor interface. They are
310 free to add additional keyword arguments, but only the ones defined
311 here are used by the Python codec registry.
312
313 \var{stream} must be a file-like object open for reading (binary)
314 data.
315
316 The \class{StreamReader} may implement different error handling
317 schemes by providing the \var{errors} keyword argument. These
318 parameters are defined:
319
320 \begin{itemize}
321 \item \code{'strict'} Raise \exception{ValueError} (or a subclass);
322 this is the default.
323 \item \code{'ignore'} Ignore the character and continue with the next.
324 \item \code{'replace'} Replace with a suitable replacement character.
325 \end{itemize}
326\end{classdesc}
327
328\begin{methoddesc}{read}{\optional{size}}
329 Decodes data from the stream and returns the resulting object.
330
331 \var{size} indicates the approximate maximum number of bytes to read
332 from the stream for decoding purposes. The decoder can modify this
333 setting as appropriate. The default value -1 indicates to read and
334 decode as much as possible. \var{size} is intended to prevent having
335 to decode huge files in one step.
336
337 The method should use a greedy read strategy meaning that it should
338 read as much data as is allowed within the definition of the encoding
339 and the given size, e.g. if optional encoding endings or state
340 markers are available on the stream, these should be read too.
341\end{methoddesc}
342
343\begin{methoddesc}{readline}{[size]}
344 Read one line from the input stream and return the
345 decoded data.
346
Fred Drake0aa811c2001-10-20 04:24:09 +0000347 Unlike the \method{readlines()} method, this method inherits
Fred Drake602aa772000-10-12 20:50:55 +0000348 the line breaking knowledge from the underlying stream's
349 \method{readline()} method -- there is currently no support for line
350 breaking using the codec decoder due to lack of line buffering.
351 Sublcasses should however, if possible, try to implement this method
352 using their own knowledge of line breaking.
353
354 \var{size}, if given, is passed as size argument to the stream's
355 \method{readline()} method.
356\end{methoddesc}
357
358\begin{methoddesc}{readlines}{[sizehint]}
359 Read all lines available on the input stream and return them as list
360 of lines.
361
362 Line breaks are implemented using the codec's decoder method and are
363 included in the list entries.
364
365 \var{sizehint}, if given, is passed as \var{size} argument to the
366 stream's \method{read()} method.
367\end{methoddesc}
368
369\begin{methoddesc}{reset}{}
370 Resets the codec buffers used for keeping state.
371
372 Note that no stream repositioning should take place. This method is
373 primarily intended to be able to recover from decoding errors.
374\end{methoddesc}
375
376In addition to the above methods, the \class{StreamReader} must also
377inherit all other methods and attribute from the underlying stream.
378
379The next two base classes are included for convenience. They are not
380needed by the codec registry, but may provide useful in practice.
381
382
383\subsubsection{StreamReaderWriter Objects \label{stream-reader-writer}}
384
385The \class{StreamReaderWriter} allows wrapping streams which work in
386both read and write modes.
387
388The design is such that one can use the factory functions returned by
389the \function{lookup()} function to construct the instance.
390
391\begin{classdesc}{StreamReaderWriter}{stream, Reader, Writer, errors}
392 Creates a \class{StreamReaderWriter} instance.
393 \var{stream} must be a file-like object.
394 \var{Reader} and \var{Writer} must be factory functions or classes
395 providing the \class{StreamReader} and \class{StreamWriter} interface
396 resp.
397 Error handling is done in the same way as defined for the
398 stream readers and writers.
399\end{classdesc}
400
401\class{StreamReaderWriter} instances define the combined interfaces of
402\class{StreamReader} and \class{StreamWriter} classes. They inherit
403all other methods and attribute from the underlying stream.
404
405
406\subsubsection{StreamRecoder Objects \label{stream-recoder-objects}}
407
408The \class{StreamRecoder} provide a frontend - backend view of
409encoding data which is sometimes useful when dealing with different
410encoding environments.
411
412The design is such that one can use the factory functions returned by
413the \function{lookup()} function to construct the instance.
414
415\begin{classdesc}{StreamRecoder}{stream, encode, decode,
416 Reader, Writer, errors}
417 Creates a \class{StreamRecoder} instance which implements a two-way
418 conversion: \var{encode} and \var{decode} work on the frontend (the
419 input to \method{read()} and output of \method{write()}) while
420 \var{Reader} and \var{Writer} work on the backend (reading and
421 writing to the stream).
422
423 You can use these objects to do transparent direct recodings from
424 e.g.\ Latin-1 to UTF-8 and back.
425
426 \var{stream} must be a file-like object.
427
428 \var{encode}, \var{decode} must adhere to the \class{Codec}
429 interface, \var{Reader}, \var{Writer} must be factory functions or
430 classes providing objects of the the \class{StreamReader} and
431 \class{StreamWriter} interface respectively.
432
433 \var{encode} and \var{decode} are needed for the frontend
434 translation, \var{Reader} and \var{Writer} for the backend
435 translation. The intermediate format used is determined by the two
436 sets of codecs, e.g. the Unicode codecs will use Unicode as
437 intermediate encoding.
438
439 Error handling is done in the same way as defined for the
440 stream readers and writers.
441\end{classdesc}
442
443\class{StreamRecoder} instances define the combined interfaces of
444\class{StreamReader} and \class{StreamWriter} classes. They inherit
445all other methods and attribute from the underlying stream.
446