blob: a72df8596f6ba231aac9a5c72ef0ef1faec8d0a6 [file] [log] [blame]
Fred Drakeb7979c72000-04-06 14:21:58 +00001\section{\module{codecs} ---
Fred Drake69ca9502000-04-06 16:09:59 +00002 Codec registry and base classes}
Fred Drakeb7979c72000-04-06 14:21:58 +00003
Fred Drake69ca9502000-04-06 16:09:59 +00004\declaremodule{standard}{codecs}
Fred Drakeb7979c72000-04-06 14:21:58 +00005\modulesynopsis{Encode and decode data and streams.}
6\moduleauthor{Marc-Andre Lemburg}{mal@lemburg.com}
7\sectionauthor{Marc-Andre Lemburg}{mal@lemburg.com}
8
9
10\index{Unicode}
11\index{Codecs}
12\indexii{Codecs}{encode}
13\indexii{Codecs}{decode}
14\index{streams}
15\indexii{stackable}{streams}
16
17
18This module defines base classes for standard Python codecs (encoders
19and decoders) and provides access to the internal Python codec
20registry which manages the codec lookup process.
21
22It defines the following functions:
23
24\begin{funcdesc}{register}{search_function}
25Register a codec search function. Search functions are expected to
26take one argument, the encoding name in all lower case letters, and
27return a tuple of functions \code{(\var{encoder}, \var{decoder}, \var{stream_reader},
28\var{stream_writer})} taking the following arguments:
29
30 \var{encoder} and \var{decoder}: These must be functions or methods
31 which have the same interface as the .encode/.decode methods of
32 Codec instances (see Codec Interface). The functions/methods are
33 expected to work in a stateless mode.
34
35 \var{stream_reader} and \var{stream_writer}: These have to be
36 factory functions providing the following interface:
37
Fred Drake69ca9502000-04-06 16:09:59 +000038 \code{factory(\var{stream}, \var{errors}='strict')}
Fred Drakeb7979c72000-04-06 14:21:58 +000039
40 The factory functions must return objects providing the interfaces
Fred Drake69ca9502000-04-06 16:09:59 +000041 defined by the base classes \class{StreamWriter} and
42 \class{StreamReader}, respectively. Stream codecs can maintain
43 state.
Fred Drakeb7979c72000-04-06 14:21:58 +000044
Fred Drake69ca9502000-04-06 16:09:59 +000045 Possible values for errors are \code{'strict'} (raise an exception
46 in case of an encoding error), \code{'replace'} (replace malformed
47 data with a suitable replacement marker, such as \character{?}) and
48 \code{'ignore'} (ignore malformed data and continue without further
49 notice).
Fred Drakeb7979c72000-04-06 14:21:58 +000050
51In case a search function cannot find a given encoding, it should
Fred Drake69ca9502000-04-06 16:09:59 +000052return \code{None}.
Fred Drakeb7979c72000-04-06 14:21:58 +000053\end{funcdesc}
54
55\begin{funcdesc}{lookup}{encoding}
56Looks up a codec tuple in the Python codec registry and returns the
57function tuple as defined above.
58
59Encodings are first looked up in the registry's cache. If not found,
60the list of registered search functions is scanned. If no codecs tuple
Fred Drake69ca9502000-04-06 16:09:59 +000061is found, a \exception{LookupError} is raised. Otherwise, the codecs
62tuple is stored in the cache and returned to the caller.
Fred Drakeb7979c72000-04-06 14:21:58 +000063\end{funcdesc}
64
65To simplify working with encoded files or stream, the module
66also defines these utility functions:
67
Fred Drakee1b304d2000-07-24 19:35:52 +000068\begin{funcdesc}{open}{filename, mode\optional{, encoding\optional{,
69 errors\optional{, buffering}}}}
Fred Drakeb7979c72000-04-06 14:21:58 +000070Open an encoded file using the given \var{mode} and return
71a wrapped version providing transparent encoding/decoding.
72
Fred Drake69ca9502000-04-06 16:09:59 +000073\strong{Note:} The wrapped version will only accept the object format
Fred Drakee1b304d2000-07-24 19:35:52 +000074defined by the codecs, i.e.\ Unicode objects for most built-in
75codecs. Output is also codec-dependent and will usually be Unicode as
Fred Drake69ca9502000-04-06 16:09:59 +000076well.
Fred Drakeb7979c72000-04-06 14:21:58 +000077
78\var{encoding} specifies the encoding which is to be used for the
79the file.
80
81\var{errors} may be given to define the error handling. It defaults
Fred Drakee1b304d2000-07-24 19:35:52 +000082to \code{'strict'} which causes a \exception{ValueError} to be raised
83in case an encoding error occurs.
Fred Drakeb7979c72000-04-06 14:21:58 +000084
Fred Drake69ca9502000-04-06 16:09:59 +000085\var{buffering} has the same meaning as for the built-in
86\function{open()} function. It defaults to line buffered.
Fred Drakeb7979c72000-04-06 14:21:58 +000087\end{funcdesc}
88
Fred Drakee1b304d2000-07-24 19:35:52 +000089\begin{funcdesc}{EncodedFile}{file, input\optional{,
90 output\optional{, errors}}}
Fred Drakeb7979c72000-04-06 14:21:58 +000091Return a wrapped version of file which provides transparent
92encoding translation.
93
94Strings written to the wrapped file are interpreted according to the
95given \var{input} encoding and then written to the original file as
Fred Drakee1b304d2000-07-24 19:35:52 +000096strings using the \var{output} encoding. The intermediate encoding will
Fred Drakeb7979c72000-04-06 14:21:58 +000097usually be Unicode but depends on the specified codecs.
98
Fred Drakee1b304d2000-07-24 19:35:52 +000099If \var{output} is not given, it defaults to \var{input}.
Fred Drakeb7979c72000-04-06 14:21:58 +0000100
101\var{errors} may be given to define the error handling. It defaults to
Fred Drakee1b304d2000-07-24 19:35:52 +0000102\code{'strict'}, which causes \exception{ValueError} to be raised in case
Fred Drakeb7979c72000-04-06 14:21:58 +0000103an encoding error occurs.
104\end{funcdesc}
105
106
107
108...XXX document codec base classes...
109
110
111
112The module also provides the following constants which are useful
113for reading and writing to platform dependent files:
114
115\begin{datadesc}{BOM}
116\dataline{BOM_BE}
117\dataline{BOM_LE}
118\dataline{BOM32_BE}
119\dataline{BOM32_LE}
120\dataline{BOM64_BE}
121\dataline{BOM64_LE}
122These constants define the byte order marks (BOM) used in data
123streams to indicate the byte order used in the stream or file.
124\constant{BOM} is either \constant{BOM_BE} or \constant{BOM_LE}
125depending on the platform's native byte order, while the others
126represent big endian (\samp{_BE} suffix) and little endian
127(\samp{_LE} suffix) byte order using 32-bit and 64-bit encodings.
128\end{datadesc}
129