blob: 396b4b3b5f8aa2141fdd97161ce5c2fb9790928b [file] [log] [blame]
Fred Drakeb7979c72000-04-06 14:21:58 +00001\section{\module{codecs} ---
Fred Drake69ca9502000-04-06 16:09:59 +00002 Codec registry and base classes}
Fred Drakeb7979c72000-04-06 14:21:58 +00003
Fred Drake69ca9502000-04-06 16:09:59 +00004\declaremodule{standard}{codecs}
Fred Drakeb7979c72000-04-06 14:21:58 +00005\modulesynopsis{Encode and decode data and streams.}
6\moduleauthor{Marc-Andre Lemburg}{mal@lemburg.com}
7\sectionauthor{Marc-Andre Lemburg}{mal@lemburg.com}
8
9
10\index{Unicode}
11\index{Codecs}
12\indexii{Codecs}{encode}
13\indexii{Codecs}{decode}
14\index{streams}
15\indexii{stackable}{streams}
16
17
18This module defines base classes for standard Python codecs (encoders
19and decoders) and provides access to the internal Python codec
20registry which manages the codec lookup process.
21
22It defines the following functions:
23
24\begin{funcdesc}{register}{search_function}
25Register a codec search function. Search functions are expected to
26take one argument, the encoding name in all lower case letters, and
27return a tuple of functions \code{(\var{encoder}, \var{decoder}, \var{stream_reader},
28\var{stream_writer})} taking the following arguments:
29
30 \var{encoder} and \var{decoder}: These must be functions or methods
Fred Drake602aa772000-10-12 20:50:55 +000031 which have the same interface as the
32 \method{encode()}/\method{decode()} methods of Codec instances (see
33 Codec Interface). The functions/methods are expected to work in a
34 stateless mode.
Fred Drakeb7979c72000-04-06 14:21:58 +000035
36 \var{stream_reader} and \var{stream_writer}: These have to be
37 factory functions providing the following interface:
38
Fred Drake602aa772000-10-12 20:50:55 +000039 \code{factory(\var{stream}, \var{errors}='strict')}
Fred Drakeb7979c72000-04-06 14:21:58 +000040
41 The factory functions must return objects providing the interfaces
Fred Drake69ca9502000-04-06 16:09:59 +000042 defined by the base classes \class{StreamWriter} and
43 \class{StreamReader}, respectively. Stream codecs can maintain
44 state.
Fred Drakeb7979c72000-04-06 14:21:58 +000045
Fred Drake69ca9502000-04-06 16:09:59 +000046 Possible values for errors are \code{'strict'} (raise an exception
47 in case of an encoding error), \code{'replace'} (replace malformed
48 data with a suitable replacement marker, such as \character{?}) and
49 \code{'ignore'} (ignore malformed data and continue without further
50 notice).
Fred Drakeb7979c72000-04-06 14:21:58 +000051
52In case a search function cannot find a given encoding, it should
Fred Drake69ca9502000-04-06 16:09:59 +000053return \code{None}.
Fred Drakeb7979c72000-04-06 14:21:58 +000054\end{funcdesc}
55
56\begin{funcdesc}{lookup}{encoding}
57Looks up a codec tuple in the Python codec registry and returns the
58function tuple as defined above.
59
60Encodings are first looked up in the registry's cache. If not found,
61the list of registered search functions is scanned. If no codecs tuple
Fred Drake69ca9502000-04-06 16:09:59 +000062is found, a \exception{LookupError} is raised. Otherwise, the codecs
63tuple is stored in the cache and returned to the caller.
Fred Drakeb7979c72000-04-06 14:21:58 +000064\end{funcdesc}
65
Marc-André Lemburg494f2ae2001-09-19 11:33:31 +000066To simply access to the various codecs, the module provides these
67additional functions which use \function{lookup()} for the codec
68lookup:
69
70\begin{funcdesc}{getencoder}{encoding}
71Lookup up the codec for the given encoding and return its encoder
72function.
73
74Raises a \exception{LookupError} in case the encoding cannot be found.
75\end{funcdesc}
76
77\begin{funcdesc}{getdecoder}{encoding}
78Lookup up the codec for the given encoding and return its decoder
79function.
80
81Raises a \exception{LookupError} in case the encoding cannot be found.
82\end{funcdesc}
83
84\begin{funcdesc}{getreader}{encoding}
85Lookup up the codec for the given encoding and return its StreamReader
86class or factory function.
87
88Raises a \exception{LookupError} in case the encoding cannot be found.
89\end{funcdesc}
90
91\begin{funcdesc}{getwriter}{encoding}
92Lookup up the codec for the given encoding and return its StreamWriter
93class or factory function.
94
95Raises a \exception{LookupError} in case the encoding cannot be found.
96\end{funcdesc}
97
Fred Drakeb7979c72000-04-06 14:21:58 +000098To simplify working with encoded files or stream, the module
99also defines these utility functions:
100
Fred Drakee1b304d2000-07-24 19:35:52 +0000101\begin{funcdesc}{open}{filename, mode\optional{, encoding\optional{,
102 errors\optional{, buffering}}}}
Fred Drakeb7979c72000-04-06 14:21:58 +0000103Open an encoded file using the given \var{mode} and return
104a wrapped version providing transparent encoding/decoding.
105
Fred Drake0aa811c2001-10-20 04:24:09 +0000106\note{The wrapped version will only accept the object format
Fred Drakee1b304d2000-07-24 19:35:52 +0000107defined by the codecs, i.e.\ Unicode objects for most built-in
108codecs. Output is also codec-dependent and will usually be Unicode as
Fred Drake0aa811c2001-10-20 04:24:09 +0000109well.}
Fred Drakeb7979c72000-04-06 14:21:58 +0000110
111\var{encoding} specifies the encoding which is to be used for the
112the file.
113
114\var{errors} may be given to define the error handling. It defaults
Fred Drakee1b304d2000-07-24 19:35:52 +0000115to \code{'strict'} which causes a \exception{ValueError} to be raised
116in case an encoding error occurs.
Fred Drakeb7979c72000-04-06 14:21:58 +0000117
Fred Drake69ca9502000-04-06 16:09:59 +0000118\var{buffering} has the same meaning as for the built-in
119\function{open()} function. It defaults to line buffered.
Fred Drakeb7979c72000-04-06 14:21:58 +0000120\end{funcdesc}
121
Fred Drakee1b304d2000-07-24 19:35:52 +0000122\begin{funcdesc}{EncodedFile}{file, input\optional{,
123 output\optional{, errors}}}
Fred Drakeb7979c72000-04-06 14:21:58 +0000124Return a wrapped version of file which provides transparent
125encoding translation.
126
127Strings written to the wrapped file are interpreted according to the
128given \var{input} encoding and then written to the original file as
Fred Drakee1b304d2000-07-24 19:35:52 +0000129strings using the \var{output} encoding. The intermediate encoding will
Fred Drakeb7979c72000-04-06 14:21:58 +0000130usually be Unicode but depends on the specified codecs.
131
Fred Drakee1b304d2000-07-24 19:35:52 +0000132If \var{output} is not given, it defaults to \var{input}.
Fred Drakeb7979c72000-04-06 14:21:58 +0000133
134\var{errors} may be given to define the error handling. It defaults to
Fred Drakee1b304d2000-07-24 19:35:52 +0000135\code{'strict'}, which causes \exception{ValueError} to be raised in case
Fred Drakeb7979c72000-04-06 14:21:58 +0000136an encoding error occurs.
137\end{funcdesc}
138
Fred Drakeb7979c72000-04-06 14:21:58 +0000139The module also provides the following constants which are useful
140for reading and writing to platform dependent files:
141
142\begin{datadesc}{BOM}
143\dataline{BOM_BE}
144\dataline{BOM_LE}
145\dataline{BOM32_BE}
146\dataline{BOM32_LE}
147\dataline{BOM64_BE}
148\dataline{BOM64_LE}
149These constants define the byte order marks (BOM) used in data
150streams to indicate the byte order used in the stream or file.
151\constant{BOM} is either \constant{BOM_BE} or \constant{BOM_LE}
152depending on the platform's native byte order, while the others
153represent big endian (\samp{_BE} suffix) and little endian
154(\samp{_LE} suffix) byte order using 32-bit and 64-bit encodings.
155\end{datadesc}
156
Fred Drakedc40ac02001-01-22 20:17:54 +0000157
158\begin{seealso}
159 \seeurl{http://sourceforge.net/projects/python-codecs/}{A
160 SourceForge project working on additional support for Asian
161 codecs for use with Python. They are in the early stages of
162 development at the time of this writing --- look in their
163 FTP area for downloadable files.}
164\end{seealso}
165
166
Fred Drake602aa772000-10-12 20:50:55 +0000167\subsection{Codec Base Classes}
168
169The \module{codecs} defines a set of base classes which define the
170interface and can also be used to easily write you own codecs for use
171in Python.
172
173Each codec has to define four interfaces to make it usable as codec in
174Python: stateless encoder, stateless decoder, stream reader and stream
175writer. The stream reader and writers typically reuse the stateless
176encoder/decoder to implement the file protocols.
177
178The \class{Codec} class defines the interface for stateless
179encoders/decoders.
180
181To simplify and standardize error handling, the \method{encode()} and
182\method{decode()} methods may implement different error handling
183schemes by providing the \var{errors} string argument. The following
184string values are defined and implemented by all standard Python
185codecs:
186
Fred Drakedc40ac02001-01-22 20:17:54 +0000187\begin{tableii}{l|l}{code}{Value}{Meaning}
188 \lineii{'strict'}{Raise \exception{ValueError} (or a subclass);
189 this is the default.}
190 \lineii{'ignore'}{Ignore the character and continue with the next.}
191 \lineii{'replace'}{Replace with a suitable replacement character;
192 Python will use the official U+FFFD REPLACEMENT
193 CHARACTER for the built-in Unicode codecs.}
194\end{tableii}
Fred Drake602aa772000-10-12 20:50:55 +0000195
196
197\subsubsection{Codec Objects \label{codec-objects}}
198
199The \class{Codec} class defines these methods which also define the
200function interfaces of the stateless encoder and decoder:
201
202\begin{methoddesc}{encode}{input\optional{, errors}}
203 Encodes the object \var{input} and returns a tuple (output object,
204 length consumed).
205
206 \var{errors} defines the error handling to apply. It defaults to
207 \code{'strict'} handling.
208
209 The method may not store state in the \class{Codec} instance. Use
210 \class{StreamCodec} for codecs which have to keep state in order to
211 make encoding/decoding efficient.
212
213 The encoder must be able to handle zero length input and return an
214 empty object of the output object type in this situation.
215\end{methoddesc}
216
217\begin{methoddesc}{decode}{input\optional{, errors}}
218 Decodes the object \var{input} and returns a tuple (output object,
219 length consumed).
220
221 \var{input} must be an object which provides the \code{bf_getreadbuf}
222 buffer slot. Python strings, buffer objects and memory mapped files
223 are examples of objects providing this slot.
224
225 \var{errors} defines the error handling to apply. It defaults to
226 \code{'strict'} handling.
227
228 The method may not store state in the \class{Codec} instance. Use
229 \class{StreamCodec} for codecs which have to keep state in order to
230 make encoding/decoding efficient.
231
232 The decoder must be able to handle zero length input and return an
233 empty object of the output object type in this situation.
234\end{methoddesc}
235
236The \class{StreamWriter} and \class{StreamReader} classes provide
237generic working interfaces which can be used to implement new
238encodings submodules very easily. See \module{encodings.utf_8} for an
239example on how this is done.
240
241
242\subsubsection{StreamWriter Objects \label{stream-writer-objects}}
243
244The \class{StreamWriter} class is a subclass of \class{Codec} and
245defines the following methods which every stream writer must define in
246order to be compatible to the Python codec registry.
247
248\begin{classdesc}{StreamWriter}{stream\optional{, errors}}
249 Constructor for a \class{StreamWriter} instance.
250
251 All stream writers must provide this constructor interface. They are
252 free to add additional keyword arguments, but only the ones defined
253 here are used by the Python codec registry.
254
255 \var{stream} must be a file-like object open for writing (binary)
256 data.
257
258 The \class{StreamWriter} may implement different error handling
259 schemes by providing the \var{errors} keyword argument. These
260 parameters are defined:
261
262 \begin{itemize}
263 \item \code{'strict'} Raise \exception{ValueError} (or a subclass);
264 this is the default.
265 \item \code{'ignore'} Ignore the character and continue with the next.
266 \item \code{'replace'} Replace with a suitable replacement character
267 \end{itemize}
268\end{classdesc}
269
270\begin{methoddesc}{write}{object}
271 Writes the object's contents encoded to the stream.
272\end{methoddesc}
273
274\begin{methoddesc}{writelines}{list}
275 Writes the concatenated list of strings to the stream (possibly by
276 reusing the \method{write()} method).
277\end{methoddesc}
278
279\begin{methoddesc}{reset}{}
280 Flushes and resets the codec buffers used for keeping state.
281
282 Calling this method should ensure that the data on the output is put
283 into a clean state, that allows appending of new fresh data without
284 having to rescan the whole stream to recover state.
285\end{methoddesc}
286
287In addition to the above methods, the \class{StreamWriter} must also
288inherit all other methods and attribute from the underlying stream.
289
290
291\subsubsection{StreamReader Objects \label{stream-reader-objects}}
292
293The \class{StreamReader} class is a subclass of \class{Codec} and
294defines the following methods which every stream reader must define in
295order to be compatible to the Python codec registry.
296
297\begin{classdesc}{StreamReader}{stream\optional{, errors}}
298 Constructor for a \class{StreamReader} instance.
299
300 All stream readers must provide this constructor interface. They are
301 free to add additional keyword arguments, but only the ones defined
302 here are used by the Python codec registry.
303
304 \var{stream} must be a file-like object open for reading (binary)
305 data.
306
307 The \class{StreamReader} may implement different error handling
308 schemes by providing the \var{errors} keyword argument. These
309 parameters are defined:
310
311 \begin{itemize}
312 \item \code{'strict'} Raise \exception{ValueError} (or a subclass);
313 this is the default.
314 \item \code{'ignore'} Ignore the character and continue with the next.
315 \item \code{'replace'} Replace with a suitable replacement character.
316 \end{itemize}
317\end{classdesc}
318
319\begin{methoddesc}{read}{\optional{size}}
320 Decodes data from the stream and returns the resulting object.
321
322 \var{size} indicates the approximate maximum number of bytes to read
323 from the stream for decoding purposes. The decoder can modify this
324 setting as appropriate. The default value -1 indicates to read and
325 decode as much as possible. \var{size} is intended to prevent having
326 to decode huge files in one step.
327
328 The method should use a greedy read strategy meaning that it should
329 read as much data as is allowed within the definition of the encoding
330 and the given size, e.g. if optional encoding endings or state
331 markers are available on the stream, these should be read too.
332\end{methoddesc}
333
334\begin{methoddesc}{readline}{[size]}
335 Read one line from the input stream and return the
336 decoded data.
337
Fred Drake0aa811c2001-10-20 04:24:09 +0000338 Unlike the \method{readlines()} method, this method inherits
Fred Drake602aa772000-10-12 20:50:55 +0000339 the line breaking knowledge from the underlying stream's
340 \method{readline()} method -- there is currently no support for line
341 breaking using the codec decoder due to lack of line buffering.
342 Sublcasses should however, if possible, try to implement this method
343 using their own knowledge of line breaking.
344
345 \var{size}, if given, is passed as size argument to the stream's
346 \method{readline()} method.
347\end{methoddesc}
348
349\begin{methoddesc}{readlines}{[sizehint]}
350 Read all lines available on the input stream and return them as list
351 of lines.
352
353 Line breaks are implemented using the codec's decoder method and are
354 included in the list entries.
355
356 \var{sizehint}, if given, is passed as \var{size} argument to the
357 stream's \method{read()} method.
358\end{methoddesc}
359
360\begin{methoddesc}{reset}{}
361 Resets the codec buffers used for keeping state.
362
363 Note that no stream repositioning should take place. This method is
364 primarily intended to be able to recover from decoding errors.
365\end{methoddesc}
366
367In addition to the above methods, the \class{StreamReader} must also
368inherit all other methods and attribute from the underlying stream.
369
370The next two base classes are included for convenience. They are not
371needed by the codec registry, but may provide useful in practice.
372
373
374\subsubsection{StreamReaderWriter Objects \label{stream-reader-writer}}
375
376The \class{StreamReaderWriter} allows wrapping streams which work in
377both read and write modes.
378
379The design is such that one can use the factory functions returned by
380the \function{lookup()} function to construct the instance.
381
382\begin{classdesc}{StreamReaderWriter}{stream, Reader, Writer, errors}
383 Creates a \class{StreamReaderWriter} instance.
384 \var{stream} must be a file-like object.
385 \var{Reader} and \var{Writer} must be factory functions or classes
386 providing the \class{StreamReader} and \class{StreamWriter} interface
387 resp.
388 Error handling is done in the same way as defined for the
389 stream readers and writers.
390\end{classdesc}
391
392\class{StreamReaderWriter} instances define the combined interfaces of
393\class{StreamReader} and \class{StreamWriter} classes. They inherit
394all other methods and attribute from the underlying stream.
395
396
397\subsubsection{StreamRecoder Objects \label{stream-recoder-objects}}
398
399The \class{StreamRecoder} provide a frontend - backend view of
400encoding data which is sometimes useful when dealing with different
401encoding environments.
402
403The design is such that one can use the factory functions returned by
404the \function{lookup()} function to construct the instance.
405
406\begin{classdesc}{StreamRecoder}{stream, encode, decode,
407 Reader, Writer, errors}
408 Creates a \class{StreamRecoder} instance which implements a two-way
409 conversion: \var{encode} and \var{decode} work on the frontend (the
410 input to \method{read()} and output of \method{write()}) while
411 \var{Reader} and \var{Writer} work on the backend (reading and
412 writing to the stream).
413
414 You can use these objects to do transparent direct recodings from
415 e.g.\ Latin-1 to UTF-8 and back.
416
417 \var{stream} must be a file-like object.
418
419 \var{encode}, \var{decode} must adhere to the \class{Codec}
420 interface, \var{Reader}, \var{Writer} must be factory functions or
421 classes providing objects of the the \class{StreamReader} and
422 \class{StreamWriter} interface respectively.
423
424 \var{encode} and \var{decode} are needed for the frontend
425 translation, \var{Reader} and \var{Writer} for the backend
426 translation. The intermediate format used is determined by the two
427 sets of codecs, e.g. the Unicode codecs will use Unicode as
428 intermediate encoding.
429
430 Error handling is done in the same way as defined for the
431 stream readers and writers.
432\end{classdesc}
433
434\class{StreamRecoder} instances define the combined interfaces of
435\class{StreamReader} and \class{StreamWriter} classes. They inherit
436all other methods and attribute from the underlying stream.
437