Fred Drake | b7979c7 | 2000-04-06 14:21:58 +0000 | [diff] [blame] | 1 | \section{\module{codecs} --- |
Fred Drake | 69ca950 | 2000-04-06 16:09:59 +0000 | [diff] [blame] | 2 | Codec registry and base classes} |
Fred Drake | b7979c7 | 2000-04-06 14:21:58 +0000 | [diff] [blame] | 3 | |
Fred Drake | 69ca950 | 2000-04-06 16:09:59 +0000 | [diff] [blame] | 4 | \declaremodule{standard}{codecs} |
Fred Drake | b7979c7 | 2000-04-06 14:21:58 +0000 | [diff] [blame] | 5 | \modulesynopsis{Encode and decode data and streams.} |
| 6 | \moduleauthor{Marc-Andre Lemburg}{mal@lemburg.com} |
| 7 | \sectionauthor{Marc-Andre Lemburg}{mal@lemburg.com} |
| 8 | |
| 9 | |
| 10 | \index{Unicode} |
| 11 | \index{Codecs} |
| 12 | \indexii{Codecs}{encode} |
| 13 | \indexii{Codecs}{decode} |
| 14 | \index{streams} |
| 15 | \indexii{stackable}{streams} |
| 16 | |
| 17 | |
| 18 | This module defines base classes for standard Python codecs (encoders |
| 19 | and decoders) and provides access to the internal Python codec |
| 20 | registry which manages the codec lookup process. |
| 21 | |
| 22 | It defines the following functions: |
| 23 | |
| 24 | \begin{funcdesc}{register}{search_function} |
| 25 | Register a codec search function. Search functions are expected to |
| 26 | take one argument, the encoding name in all lower case letters, and |
| 27 | return a tuple of functions \code{(\var{encoder}, \var{decoder}, \var{stream_reader}, |
| 28 | \var{stream_writer})} taking the following arguments: |
| 29 | |
| 30 | \var{encoder} and \var{decoder}: These must be functions or methods |
| 31 | which have the same interface as the .encode/.decode methods of |
| 32 | Codec instances (see Codec Interface). The functions/methods are |
| 33 | expected to work in a stateless mode. |
| 34 | |
| 35 | \var{stream_reader} and \var{stream_writer}: These have to be |
| 36 | factory functions providing the following interface: |
| 37 | |
Fred Drake | 69ca950 | 2000-04-06 16:09:59 +0000 | [diff] [blame] | 38 | \code{factory(\var{stream}, \var{errors}='strict')} |
Fred Drake | b7979c7 | 2000-04-06 14:21:58 +0000 | [diff] [blame] | 39 | |
| 40 | The factory functions must return objects providing the interfaces |
Fred Drake | 69ca950 | 2000-04-06 16:09:59 +0000 | [diff] [blame] | 41 | defined by the base classes \class{StreamWriter} and |
| 42 | \class{StreamReader}, respectively. Stream codecs can maintain |
| 43 | state. |
Fred Drake | b7979c7 | 2000-04-06 14:21:58 +0000 | [diff] [blame] | 44 | |
Fred Drake | 69ca950 | 2000-04-06 16:09:59 +0000 | [diff] [blame] | 45 | Possible values for errors are \code{'strict'} (raise an exception |
| 46 | in case of an encoding error), \code{'replace'} (replace malformed |
| 47 | data with a suitable replacement marker, such as \character{?}) and |
| 48 | \code{'ignore'} (ignore malformed data and continue without further |
| 49 | notice). |
Fred Drake | b7979c7 | 2000-04-06 14:21:58 +0000 | [diff] [blame] | 50 | |
| 51 | In case a search function cannot find a given encoding, it should |
Fred Drake | 69ca950 | 2000-04-06 16:09:59 +0000 | [diff] [blame] | 52 | return \code{None}. |
Fred Drake | b7979c7 | 2000-04-06 14:21:58 +0000 | [diff] [blame] | 53 | \end{funcdesc} |
| 54 | |
| 55 | \begin{funcdesc}{lookup}{encoding} |
| 56 | Looks up a codec tuple in the Python codec registry and returns the |
| 57 | function tuple as defined above. |
| 58 | |
| 59 | Encodings are first looked up in the registry's cache. If not found, |
| 60 | the list of registered search functions is scanned. If no codecs tuple |
Fred Drake | 69ca950 | 2000-04-06 16:09:59 +0000 | [diff] [blame] | 61 | is found, a \exception{LookupError} is raised. Otherwise, the codecs |
| 62 | tuple is stored in the cache and returned to the caller. |
Fred Drake | b7979c7 | 2000-04-06 14:21:58 +0000 | [diff] [blame] | 63 | \end{funcdesc} |
| 64 | |
| 65 | To simplify working with encoded files or stream, the module |
| 66 | also defines these utility functions: |
| 67 | |
Fred Drake | e1b304d | 2000-07-24 19:35:52 +0000 | [diff] [blame] | 68 | \begin{funcdesc}{open}{filename, mode\optional{, encoding\optional{, |
| 69 | errors\optional{, buffering}}}} |
Fred Drake | b7979c7 | 2000-04-06 14:21:58 +0000 | [diff] [blame] | 70 | Open an encoded file using the given \var{mode} and return |
| 71 | a wrapped version providing transparent encoding/decoding. |
| 72 | |
Fred Drake | 69ca950 | 2000-04-06 16:09:59 +0000 | [diff] [blame] | 73 | \strong{Note:} The wrapped version will only accept the object format |
Fred Drake | e1b304d | 2000-07-24 19:35:52 +0000 | [diff] [blame] | 74 | defined by the codecs, i.e.\ Unicode objects for most built-in |
| 75 | codecs. Output is also codec-dependent and will usually be Unicode as |
Fred Drake | 69ca950 | 2000-04-06 16:09:59 +0000 | [diff] [blame] | 76 | well. |
Fred Drake | b7979c7 | 2000-04-06 14:21:58 +0000 | [diff] [blame] | 77 | |
| 78 | \var{encoding} specifies the encoding which is to be used for the |
| 79 | the file. |
| 80 | |
| 81 | \var{errors} may be given to define the error handling. It defaults |
Fred Drake | e1b304d | 2000-07-24 19:35:52 +0000 | [diff] [blame] | 82 | to \code{'strict'} which causes a \exception{ValueError} to be raised |
| 83 | in case an encoding error occurs. |
Fred Drake | b7979c7 | 2000-04-06 14:21:58 +0000 | [diff] [blame] | 84 | |
Fred Drake | 69ca950 | 2000-04-06 16:09:59 +0000 | [diff] [blame] | 85 | \var{buffering} has the same meaning as for the built-in |
| 86 | \function{open()} function. It defaults to line buffered. |
Fred Drake | b7979c7 | 2000-04-06 14:21:58 +0000 | [diff] [blame] | 87 | \end{funcdesc} |
| 88 | |
Fred Drake | e1b304d | 2000-07-24 19:35:52 +0000 | [diff] [blame] | 89 | \begin{funcdesc}{EncodedFile}{file, input\optional{, |
| 90 | output\optional{, errors}}} |
Fred Drake | b7979c7 | 2000-04-06 14:21:58 +0000 | [diff] [blame] | 91 | Return a wrapped version of file which provides transparent |
| 92 | encoding translation. |
| 93 | |
| 94 | Strings written to the wrapped file are interpreted according to the |
| 95 | given \var{input} encoding and then written to the original file as |
Fred Drake | e1b304d | 2000-07-24 19:35:52 +0000 | [diff] [blame] | 96 | strings using the \var{output} encoding. The intermediate encoding will |
Fred Drake | b7979c7 | 2000-04-06 14:21:58 +0000 | [diff] [blame] | 97 | usually be Unicode but depends on the specified codecs. |
| 98 | |
Fred Drake | e1b304d | 2000-07-24 19:35:52 +0000 | [diff] [blame] | 99 | If \var{output} is not given, it defaults to \var{input}. |
Fred Drake | b7979c7 | 2000-04-06 14:21:58 +0000 | [diff] [blame] | 100 | |
| 101 | \var{errors} may be given to define the error handling. It defaults to |
Fred Drake | e1b304d | 2000-07-24 19:35:52 +0000 | [diff] [blame] | 102 | \code{'strict'}, which causes \exception{ValueError} to be raised in case |
Fred Drake | b7979c7 | 2000-04-06 14:21:58 +0000 | [diff] [blame] | 103 | an encoding error occurs. |
| 104 | \end{funcdesc} |
| 105 | |
| 106 | |
| 107 | |
| 108 | ...XXX document codec base classes... |
| 109 | |
| 110 | |
| 111 | |
| 112 | The module also provides the following constants which are useful |
| 113 | for reading and writing to platform dependent files: |
| 114 | |
| 115 | \begin{datadesc}{BOM} |
| 116 | \dataline{BOM_BE} |
| 117 | \dataline{BOM_LE} |
| 118 | \dataline{BOM32_BE} |
| 119 | \dataline{BOM32_LE} |
| 120 | \dataline{BOM64_BE} |
| 121 | \dataline{BOM64_LE} |
| 122 | These constants define the byte order marks (BOM) used in data |
| 123 | streams to indicate the byte order used in the stream or file. |
| 124 | \constant{BOM} is either \constant{BOM_BE} or \constant{BOM_LE} |
| 125 | depending on the platform's native byte order, while the others |
| 126 | represent big endian (\samp{_BE} suffix) and little endian |
| 127 | (\samp{_LE} suffix) byte order using 32-bit and 64-bit encodings. |
| 128 | \end{datadesc} |
| 129 | |