blob: ee4ac00e798035f569c69f56503bc4ff728e0525 [file] [log] [blame]
Fred Drakeb7979c72000-04-06 14:21:58 +00001\section{\module{codecs} ---
Fred Drake69ca9502000-04-06 16:09:59 +00002 Codec registry and base classes}
Fred Drakeb7979c72000-04-06 14:21:58 +00003
Fred Drake69ca9502000-04-06 16:09:59 +00004\declaremodule{standard}{codecs}
Fred Drakeb7979c72000-04-06 14:21:58 +00005\modulesynopsis{Encode and decode data and streams.}
6\moduleauthor{Marc-Andre Lemburg}{mal@lemburg.com}
7\sectionauthor{Marc-Andre Lemburg}{mal@lemburg.com}
8
9
10\index{Unicode}
11\index{Codecs}
12\indexii{Codecs}{encode}
13\indexii{Codecs}{decode}
14\index{streams}
15\indexii{stackable}{streams}
16
17
18This module defines base classes for standard Python codecs (encoders
19and decoders) and provides access to the internal Python codec
20registry which manages the codec lookup process.
21
22It defines the following functions:
23
24\begin{funcdesc}{register}{search_function}
25Register a codec search function. Search functions are expected to
26take one argument, the encoding name in all lower case letters, and
27return a tuple of functions \code{(\var{encoder}, \var{decoder}, \var{stream_reader},
28\var{stream_writer})} taking the following arguments:
29
30 \var{encoder} and \var{decoder}: These must be functions or methods
31 which have the same interface as the .encode/.decode methods of
32 Codec instances (see Codec Interface). The functions/methods are
33 expected to work in a stateless mode.
34
35 \var{stream_reader} and \var{stream_writer}: These have to be
36 factory functions providing the following interface:
37
Fred Drake69ca9502000-04-06 16:09:59 +000038 \code{factory(\var{stream}, \var{errors}='strict')}
Fred Drakeb7979c72000-04-06 14:21:58 +000039
40 The factory functions must return objects providing the interfaces
Fred Drake69ca9502000-04-06 16:09:59 +000041 defined by the base classes \class{StreamWriter} and
42 \class{StreamReader}, respectively. Stream codecs can maintain
43 state.
Fred Drakeb7979c72000-04-06 14:21:58 +000044
Fred Drake69ca9502000-04-06 16:09:59 +000045 Possible values for errors are \code{'strict'} (raise an exception
46 in case of an encoding error), \code{'replace'} (replace malformed
47 data with a suitable replacement marker, such as \character{?}) and
48 \code{'ignore'} (ignore malformed data and continue without further
49 notice).
Fred Drakeb7979c72000-04-06 14:21:58 +000050
51In case a search function cannot find a given encoding, it should
Fred Drake69ca9502000-04-06 16:09:59 +000052return \code{None}.
Fred Drakeb7979c72000-04-06 14:21:58 +000053\end{funcdesc}
54
55\begin{funcdesc}{lookup}{encoding}
56Looks up a codec tuple in the Python codec registry and returns the
57function tuple as defined above.
58
59Encodings are first looked up in the registry's cache. If not found,
60the list of registered search functions is scanned. If no codecs tuple
Fred Drake69ca9502000-04-06 16:09:59 +000061is found, a \exception{LookupError} is raised. Otherwise, the codecs
62tuple is stored in the cache and returned to the caller.
Fred Drakeb7979c72000-04-06 14:21:58 +000063\end{funcdesc}
64
65To simplify working with encoded files or stream, the module
66also defines these utility functions:
67
Fred Drake69ca9502000-04-06 16:09:59 +000068\begin{funcdesc}{open}{filename, mode\optional{, encoding=None\optional{, errors='strict'\optional{, buffering=1}}}}
Fred Drakeb7979c72000-04-06 14:21:58 +000069Open an encoded file using the given \var{mode} and return
70a wrapped version providing transparent encoding/decoding.
71
Fred Drake69ca9502000-04-06 16:09:59 +000072\strong{Note:} The wrapped version will only accept the object format
73defined by the codecs, i.e. Unicode objects for most builtin
74codecs. Output is also codec dependent and will usually by Unicode as
75well.
Fred Drakeb7979c72000-04-06 14:21:58 +000076
77\var{encoding} specifies the encoding which is to be used for the
78the file.
79
80\var{errors} may be given to define the error handling. It defaults
81to 'strict' which causes a \exception{ValueError} to be raised in case
82an encoding error occurs.
83
Fred Drake69ca9502000-04-06 16:09:59 +000084\var{buffering} has the same meaning as for the built-in
85\function{open()} function. It defaults to line buffered.
Fred Drakeb7979c72000-04-06 14:21:58 +000086\end{funcdesc}
87
Fred Drake69ca9502000-04-06 16:09:59 +000088\begin{funcdesc}{EncodedFile}{file, input\optional{, output=None\optional{, errors='strict'}}}
Fred Drakeb7979c72000-04-06 14:21:58 +000089
90Return a wrapped version of file which provides transparent
91encoding translation.
92
93Strings written to the wrapped file are interpreted according to the
94given \var{input} encoding and then written to the original file as
95string using the \var{output} encoding. The intermediate encoding will
96usually be Unicode but depends on the specified codecs.
97
98If \var{output} is not given, it defaults to input.
99
100\var{errors} may be given to define the error handling. It defaults to
101'strict' which causes \exception{ValueError} to be raised in case
102an encoding error occurs.
103\end{funcdesc}
104
105
106
107...XXX document codec base classes...
108
109
110
111The module also provides the following constants which are useful
112for reading and writing to platform dependent files:
113
114\begin{datadesc}{BOM}
115\dataline{BOM_BE}
116\dataline{BOM_LE}
117\dataline{BOM32_BE}
118\dataline{BOM32_LE}
119\dataline{BOM64_BE}
120\dataline{BOM64_LE}
121These constants define the byte order marks (BOM) used in data
122streams to indicate the byte order used in the stream or file.
123\constant{BOM} is either \constant{BOM_BE} or \constant{BOM_LE}
124depending on the platform's native byte order, while the others
125represent big endian (\samp{_BE} suffix) and little endian
126(\samp{_LE} suffix) byte order using 32-bit and 64-bit encodings.
127\end{datadesc}
128