blob: 9f77955a1010e277534de5633b2a59be3bd13730 [file] [log] [blame]
Fred Drakeb7979c72000-04-06 14:21:58 +00001\section{\module{codecs} ---
Fred Drake69ca9502000-04-06 16:09:59 +00002 Codec registry and base classes}
Fred Drakeb7979c72000-04-06 14:21:58 +00003
Fred Drake69ca9502000-04-06 16:09:59 +00004\declaremodule{standard}{codecs}
Fred Drakeb7979c72000-04-06 14:21:58 +00005\modulesynopsis{Encode and decode data and streams.}
6\moduleauthor{Marc-Andre Lemburg}{mal@lemburg.com}
7\sectionauthor{Marc-Andre Lemburg}{mal@lemburg.com}
8
9
10\index{Unicode}
11\index{Codecs}
12\indexii{Codecs}{encode}
13\indexii{Codecs}{decode}
14\index{streams}
15\indexii{stackable}{streams}
16
17
18This module defines base classes for standard Python codecs (encoders
19and decoders) and provides access to the internal Python codec
20registry which manages the codec lookup process.
21
22It defines the following functions:
23
24\begin{funcdesc}{register}{search_function}
25Register a codec search function. Search functions are expected to
26take one argument, the encoding name in all lower case letters, and
27return a tuple of functions \code{(\var{encoder}, \var{decoder}, \var{stream_reader},
28\var{stream_writer})} taking the following arguments:
29
30 \var{encoder} and \var{decoder}: These must be functions or methods
Fred Drake602aa772000-10-12 20:50:55 +000031 which have the same interface as the
32 \method{encode()}/\method{decode()} methods of Codec instances (see
33 Codec Interface). The functions/methods are expected to work in a
34 stateless mode.
Fred Drakeb7979c72000-04-06 14:21:58 +000035
36 \var{stream_reader} and \var{stream_writer}: These have to be
37 factory functions providing the following interface:
38
Fred Drake602aa772000-10-12 20:50:55 +000039 \code{factory(\var{stream}, \var{errors}='strict')}
Fred Drakeb7979c72000-04-06 14:21:58 +000040
41 The factory functions must return objects providing the interfaces
Fred Drake69ca9502000-04-06 16:09:59 +000042 defined by the base classes \class{StreamWriter} and
43 \class{StreamReader}, respectively. Stream codecs can maintain
44 state.
Fred Drakeb7979c72000-04-06 14:21:58 +000045
Fred Drake69ca9502000-04-06 16:09:59 +000046 Possible values for errors are \code{'strict'} (raise an exception
47 in case of an encoding error), \code{'replace'} (replace malformed
48 data with a suitable replacement marker, such as \character{?}) and
49 \code{'ignore'} (ignore malformed data and continue without further
50 notice).
Fred Drakeb7979c72000-04-06 14:21:58 +000051
52In case a search function cannot find a given encoding, it should
Fred Drake69ca9502000-04-06 16:09:59 +000053return \code{None}.
Fred Drakeb7979c72000-04-06 14:21:58 +000054\end{funcdesc}
55
56\begin{funcdesc}{lookup}{encoding}
57Looks up a codec tuple in the Python codec registry and returns the
58function tuple as defined above.
59
60Encodings are first looked up in the registry's cache. If not found,
61the list of registered search functions is scanned. If no codecs tuple
Fred Drake69ca9502000-04-06 16:09:59 +000062is found, a \exception{LookupError} is raised. Otherwise, the codecs
63tuple is stored in the cache and returned to the caller.
Fred Drakeb7979c72000-04-06 14:21:58 +000064\end{funcdesc}
65
Skip Montanarob02ea652002-04-17 19:33:06 +000066To simplify access to the various codecs, the module provides these
Marc-André Lemburg494f2ae2001-09-19 11:33:31 +000067additional functions which use \function{lookup()} for the codec
68lookup:
69
70\begin{funcdesc}{getencoder}{encoding}
71Lookup up the codec for the given encoding and return its encoder
72function.
73
74Raises a \exception{LookupError} in case the encoding cannot be found.
75\end{funcdesc}
76
77\begin{funcdesc}{getdecoder}{encoding}
78Lookup up the codec for the given encoding and return its decoder
79function.
80
81Raises a \exception{LookupError} in case the encoding cannot be found.
82\end{funcdesc}
83
84\begin{funcdesc}{getreader}{encoding}
85Lookup up the codec for the given encoding and return its StreamReader
86class or factory function.
87
88Raises a \exception{LookupError} in case the encoding cannot be found.
89\end{funcdesc}
90
91\begin{funcdesc}{getwriter}{encoding}
92Lookup up the codec for the given encoding and return its StreamWriter
93class or factory function.
94
95Raises a \exception{LookupError} in case the encoding cannot be found.
96\end{funcdesc}
97
Fred Drakeb7979c72000-04-06 14:21:58 +000098To simplify working with encoded files or stream, the module
99also defines these utility functions:
100
Fred Drakee1b304d2000-07-24 19:35:52 +0000101\begin{funcdesc}{open}{filename, mode\optional{, encoding\optional{,
102 errors\optional{, buffering}}}}
Fred Drakeb7979c72000-04-06 14:21:58 +0000103Open an encoded file using the given \var{mode} and return
104a wrapped version providing transparent encoding/decoding.
105
Fred Drake0aa811c2001-10-20 04:24:09 +0000106\note{The wrapped version will only accept the object format
Fred Drakee1b304d2000-07-24 19:35:52 +0000107defined by the codecs, i.e.\ Unicode objects for most built-in
108codecs. Output is also codec-dependent and will usually be Unicode as
Fred Drake0aa811c2001-10-20 04:24:09 +0000109well.}
Fred Drakeb7979c72000-04-06 14:21:58 +0000110
111\var{encoding} specifies the encoding which is to be used for the
112the file.
113
114\var{errors} may be given to define the error handling. It defaults
Fred Drakee1b304d2000-07-24 19:35:52 +0000115to \code{'strict'} which causes a \exception{ValueError} to be raised
116in case an encoding error occurs.
Fred Drakeb7979c72000-04-06 14:21:58 +0000117
Fred Drake69ca9502000-04-06 16:09:59 +0000118\var{buffering} has the same meaning as for the built-in
119\function{open()} function. It defaults to line buffered.
Fred Drakeb7979c72000-04-06 14:21:58 +0000120\end{funcdesc}
121
Fred Drakee1b304d2000-07-24 19:35:52 +0000122\begin{funcdesc}{EncodedFile}{file, input\optional{,
123 output\optional{, errors}}}
Fred Drakeb7979c72000-04-06 14:21:58 +0000124Return a wrapped version of file which provides transparent
125encoding translation.
126
127Strings written to the wrapped file are interpreted according to the
128given \var{input} encoding and then written to the original file as
Fred Drakee1b304d2000-07-24 19:35:52 +0000129strings using the \var{output} encoding. The intermediate encoding will
Fred Drakeb7979c72000-04-06 14:21:58 +0000130usually be Unicode but depends on the specified codecs.
131
Fred Drakee1b304d2000-07-24 19:35:52 +0000132If \var{output} is not given, it defaults to \var{input}.
Fred Drakeb7979c72000-04-06 14:21:58 +0000133
134\var{errors} may be given to define the error handling. It defaults to
Fred Drakee1b304d2000-07-24 19:35:52 +0000135\code{'strict'}, which causes \exception{ValueError} to be raised in case
Fred Drakeb7979c72000-04-06 14:21:58 +0000136an encoding error occurs.
137\end{funcdesc}
138
Fred Drakeb7979c72000-04-06 14:21:58 +0000139The module also provides the following constants which are useful
140for reading and writing to platform dependent files:
141
142\begin{datadesc}{BOM}
143\dataline{BOM_BE}
144\dataline{BOM_LE}
145\dataline{BOM32_BE}
146\dataline{BOM32_LE}
147\dataline{BOM64_BE}
148\dataline{BOM64_LE}
149These constants define the byte order marks (BOM) used in data
150streams to indicate the byte order used in the stream or file.
151\constant{BOM} is either \constant{BOM_BE} or \constant{BOM_LE}
152depending on the platform's native byte order, while the others
153represent big endian (\samp{_BE} suffix) and little endian
154(\samp{_LE} suffix) byte order using 32-bit and 64-bit encodings.
155\end{datadesc}
156
Fred Drakedc40ac02001-01-22 20:17:54 +0000157
158\begin{seealso}
159 \seeurl{http://sourceforge.net/projects/python-codecs/}{A
160 SourceForge project working on additional support for Asian
161 codecs for use with Python. They are in the early stages of
162 development at the time of this writing --- look in their
163 FTP area for downloadable files.}
164\end{seealso}
165
166
Fred Drake602aa772000-10-12 20:50:55 +0000167\subsection{Codec Base Classes}
168
169The \module{codecs} defines a set of base classes which define the
170interface and can also be used to easily write you own codecs for use
171in Python.
172
173Each codec has to define four interfaces to make it usable as codec in
174Python: stateless encoder, stateless decoder, stream reader and stream
175writer. The stream reader and writers typically reuse the stateless
176encoder/decoder to implement the file protocols.
177
178The \class{Codec} class defines the interface for stateless
179encoders/decoders.
180
181To simplify and standardize error handling, the \method{encode()} and
182\method{decode()} methods may implement different error handling
183schemes by providing the \var{errors} string argument. The following
184string values are defined and implemented by all standard Python
185codecs:
186
Fred Drakedc40ac02001-01-22 20:17:54 +0000187\begin{tableii}{l|l}{code}{Value}{Meaning}
188 \lineii{'strict'}{Raise \exception{ValueError} (or a subclass);
189 this is the default.}
190 \lineii{'ignore'}{Ignore the character and continue with the next.}
191 \lineii{'replace'}{Replace with a suitable replacement character;
192 Python will use the official U+FFFD REPLACEMENT
193 CHARACTER for the built-in Unicode codecs.}
194\end{tableii}
Fred Drake602aa772000-10-12 20:50:55 +0000195
196
197\subsubsection{Codec Objects \label{codec-objects}}
198
199The \class{Codec} class defines these methods which also define the
200function interfaces of the stateless encoder and decoder:
201
202\begin{methoddesc}{encode}{input\optional{, errors}}
203 Encodes the object \var{input} and returns a tuple (output object,
Skip Montanaro6c7bc312002-04-16 15:12:10 +0000204 length consumed). While codecs are not restricted to use with Unicode, in
205 a Unicode context, encoding converts a Unicode object to a plain string
206 using a particular character set encoding (e.g., \code{cp1252} or
207 \code{iso-8859-1}).
Fred Drake602aa772000-10-12 20:50:55 +0000208
209 \var{errors} defines the error handling to apply. It defaults to
210 \code{'strict'} handling.
211
212 The method may not store state in the \class{Codec} instance. Use
213 \class{StreamCodec} for codecs which have to keep state in order to
214 make encoding/decoding efficient.
215
216 The encoder must be able to handle zero length input and return an
217 empty object of the output object type in this situation.
218\end{methoddesc}
219
220\begin{methoddesc}{decode}{input\optional{, errors}}
221 Decodes the object \var{input} and returns a tuple (output object,
Skip Montanaro6c7bc312002-04-16 15:12:10 +0000222 length consumed). In a Unicode context, decoding converts a plain string
223 encoded using a particular character set encoding to a Unicode object.
Fred Drake602aa772000-10-12 20:50:55 +0000224
225 \var{input} must be an object which provides the \code{bf_getreadbuf}
226 buffer slot. Python strings, buffer objects and memory mapped files
227 are examples of objects providing this slot.
228
229 \var{errors} defines the error handling to apply. It defaults to
230 \code{'strict'} handling.
231
232 The method may not store state in the \class{Codec} instance. Use
233 \class{StreamCodec} for codecs which have to keep state in order to
234 make encoding/decoding efficient.
235
236 The decoder must be able to handle zero length input and return an
237 empty object of the output object type in this situation.
238\end{methoddesc}
239
240The \class{StreamWriter} and \class{StreamReader} classes provide
241generic working interfaces which can be used to implement new
242encodings submodules very easily. See \module{encodings.utf_8} for an
243example on how this is done.
244
245
246\subsubsection{StreamWriter Objects \label{stream-writer-objects}}
247
248The \class{StreamWriter} class is a subclass of \class{Codec} and
249defines the following methods which every stream writer must define in
250order to be compatible to the Python codec registry.
251
252\begin{classdesc}{StreamWriter}{stream\optional{, errors}}
253 Constructor for a \class{StreamWriter} instance.
254
255 All stream writers must provide this constructor interface. They are
256 free to add additional keyword arguments, but only the ones defined
257 here are used by the Python codec registry.
258
259 \var{stream} must be a file-like object open for writing (binary)
260 data.
261
262 The \class{StreamWriter} may implement different error handling
263 schemes by providing the \var{errors} keyword argument. These
264 parameters are defined:
265
266 \begin{itemize}
267 \item \code{'strict'} Raise \exception{ValueError} (or a subclass);
268 this is the default.
269 \item \code{'ignore'} Ignore the character and continue with the next.
270 \item \code{'replace'} Replace with a suitable replacement character
271 \end{itemize}
272\end{classdesc}
273
274\begin{methoddesc}{write}{object}
275 Writes the object's contents encoded to the stream.
276\end{methoddesc}
277
278\begin{methoddesc}{writelines}{list}
279 Writes the concatenated list of strings to the stream (possibly by
280 reusing the \method{write()} method).
281\end{methoddesc}
282
283\begin{methoddesc}{reset}{}
284 Flushes and resets the codec buffers used for keeping state.
285
286 Calling this method should ensure that the data on the output is put
287 into a clean state, that allows appending of new fresh data without
288 having to rescan the whole stream to recover state.
289\end{methoddesc}
290
291In addition to the above methods, the \class{StreamWriter} must also
292inherit all other methods and attribute from the underlying stream.
293
294
295\subsubsection{StreamReader Objects \label{stream-reader-objects}}
296
297The \class{StreamReader} class is a subclass of \class{Codec} and
298defines the following methods which every stream reader must define in
299order to be compatible to the Python codec registry.
300
301\begin{classdesc}{StreamReader}{stream\optional{, errors}}
302 Constructor for a \class{StreamReader} instance.
303
304 All stream readers must provide this constructor interface. They are
305 free to add additional keyword arguments, but only the ones defined
306 here are used by the Python codec registry.
307
308 \var{stream} must be a file-like object open for reading (binary)
309 data.
310
311 The \class{StreamReader} may implement different error handling
312 schemes by providing the \var{errors} keyword argument. These
313 parameters are defined:
314
315 \begin{itemize}
316 \item \code{'strict'} Raise \exception{ValueError} (or a subclass);
317 this is the default.
318 \item \code{'ignore'} Ignore the character and continue with the next.
319 \item \code{'replace'} Replace with a suitable replacement character.
320 \end{itemize}
321\end{classdesc}
322
323\begin{methoddesc}{read}{\optional{size}}
324 Decodes data from the stream and returns the resulting object.
325
326 \var{size} indicates the approximate maximum number of bytes to read
327 from the stream for decoding purposes. The decoder can modify this
328 setting as appropriate. The default value -1 indicates to read and
329 decode as much as possible. \var{size} is intended to prevent having
330 to decode huge files in one step.
331
332 The method should use a greedy read strategy meaning that it should
333 read as much data as is allowed within the definition of the encoding
334 and the given size, e.g. if optional encoding endings or state
335 markers are available on the stream, these should be read too.
336\end{methoddesc}
337
338\begin{methoddesc}{readline}{[size]}
339 Read one line from the input stream and return the
340 decoded data.
341
Fred Drake0aa811c2001-10-20 04:24:09 +0000342 Unlike the \method{readlines()} method, this method inherits
Fred Drake602aa772000-10-12 20:50:55 +0000343 the line breaking knowledge from the underlying stream's
344 \method{readline()} method -- there is currently no support for line
345 breaking using the codec decoder due to lack of line buffering.
346 Sublcasses should however, if possible, try to implement this method
347 using their own knowledge of line breaking.
348
349 \var{size}, if given, is passed as size argument to the stream's
350 \method{readline()} method.
351\end{methoddesc}
352
353\begin{methoddesc}{readlines}{[sizehint]}
354 Read all lines available on the input stream and return them as list
355 of lines.
356
357 Line breaks are implemented using the codec's decoder method and are
358 included in the list entries.
359
360 \var{sizehint}, if given, is passed as \var{size} argument to the
361 stream's \method{read()} method.
362\end{methoddesc}
363
364\begin{methoddesc}{reset}{}
365 Resets the codec buffers used for keeping state.
366
367 Note that no stream repositioning should take place. This method is
368 primarily intended to be able to recover from decoding errors.
369\end{methoddesc}
370
371In addition to the above methods, the \class{StreamReader} must also
372inherit all other methods and attribute from the underlying stream.
373
374The next two base classes are included for convenience. They are not
375needed by the codec registry, but may provide useful in practice.
376
377
378\subsubsection{StreamReaderWriter Objects \label{stream-reader-writer}}
379
380The \class{StreamReaderWriter} allows wrapping streams which work in
381both read and write modes.
382
383The design is such that one can use the factory functions returned by
384the \function{lookup()} function to construct the instance.
385
386\begin{classdesc}{StreamReaderWriter}{stream, Reader, Writer, errors}
387 Creates a \class{StreamReaderWriter} instance.
388 \var{stream} must be a file-like object.
389 \var{Reader} and \var{Writer} must be factory functions or classes
390 providing the \class{StreamReader} and \class{StreamWriter} interface
391 resp.
392 Error handling is done in the same way as defined for the
393 stream readers and writers.
394\end{classdesc}
395
396\class{StreamReaderWriter} instances define the combined interfaces of
397\class{StreamReader} and \class{StreamWriter} classes. They inherit
398all other methods and attribute from the underlying stream.
399
400
401\subsubsection{StreamRecoder Objects \label{stream-recoder-objects}}
402
403The \class{StreamRecoder} provide a frontend - backend view of
404encoding data which is sometimes useful when dealing with different
405encoding environments.
406
407The design is such that one can use the factory functions returned by
408the \function{lookup()} function to construct the instance.
409
410\begin{classdesc}{StreamRecoder}{stream, encode, decode,
411 Reader, Writer, errors}
412 Creates a \class{StreamRecoder} instance which implements a two-way
413 conversion: \var{encode} and \var{decode} work on the frontend (the
414 input to \method{read()} and output of \method{write()}) while
415 \var{Reader} and \var{Writer} work on the backend (reading and
416 writing to the stream).
417
418 You can use these objects to do transparent direct recodings from
419 e.g.\ Latin-1 to UTF-8 and back.
420
421 \var{stream} must be a file-like object.
422
423 \var{encode}, \var{decode} must adhere to the \class{Codec}
424 interface, \var{Reader}, \var{Writer} must be factory functions or
425 classes providing objects of the the \class{StreamReader} and
426 \class{StreamWriter} interface respectively.
427
428 \var{encode} and \var{decode} are needed for the frontend
429 translation, \var{Reader} and \var{Writer} for the backend
430 translation. The intermediate format used is determined by the two
431 sets of codecs, e.g. the Unicode codecs will use Unicode as
432 intermediate encoding.
433
434 Error handling is done in the same way as defined for the
435 stream readers and writers.
436\end{classdesc}
437
438\class{StreamRecoder} instances define the combined interfaces of
439\class{StreamReader} and \class{StreamWriter} classes. They inherit
440all other methods and attribute from the underlying stream.
441