Blame - Doc/lib/libcodecs.tex - platform/external/python/cpython3

blob: 1806ef0addd3559fd989eeb7f1471f453dbdc634 [file] [log] [blame]

Fred Drake	b7979c7	2000-04-06 14:21:58 +0000	[diff] [blame]	1	\section{\module{codecs} ---
Fred Drake	69ca950	2000-04-06 16:09:59 +0000	[diff] [blame]	2	Codec registry and base classes}
Fred Drake	b7979c7	2000-04-06 14:21:58 +0000	[diff] [blame]	3
Fred Drake	69ca950	2000-04-06 16:09:59 +0000	[diff] [blame]	4	\declaremodule{standard}{codecs}
Fred Drake	b7979c7	2000-04-06 14:21:58 +0000	[diff] [blame]	5	\modulesynopsis{Encode and decode data and streams.}
				6	\moduleauthor{Marc-Andre Lemburg}{mal@lemburg.com}
				7	\sectionauthor{Marc-Andre Lemburg}{mal@lemburg.com}
Martin v. Löwis	2548c73	2003-04-18 10:39:54 +0000	[diff] [blame]	8	\sectionauthor{Martin v. L\"owis}{martin@v.loewis.de}
Fred Drake	b7979c7	2000-04-06 14:21:58 +0000	[diff] [blame]	9
				10	\index{Unicode}
				11	\index{Codecs}
				12	\indexii{Codecs}{encode}
				13	\indexii{Codecs}{decode}
				14	\index{streams}
				15	\indexii{stackable}{streams}
				16
				17
				18	This module defines base classes for standard Python codecs (encoders
				19	and decoders) and provides access to the internal Python codec
Walter Dörwald	3aeb632	2002-09-02 13:14:32 +0000	[diff] [blame]	20	registry which manages the codec and error handling lookup process.
Fred Drake	b7979c7	2000-04-06 14:21:58 +0000	[diff] [blame]	21
				22	It defines the following functions:
				23
				24	\begin{funcdesc}{register}{search_function}
				25	Register a codec search function. Search functions are expected to
				26	take one argument, the encoding name in all lower case letters, and
Thomas Wouters	a977329	2006-04-21 09:43:23 +0000	[diff] [blame^]	27	return a \class{CodecInfo} object having the following attributes:
				28
				29	\begin{itemize}
				30	\item \code{name} The name of the encoding;
				31	\item \code{encoder} The stateless encoding function;
				32	\item \code{decoder} The stateless decoding function;
				33	\item \code{incrementalencoder} An incremental encoder class or factory function;
				34	\item \code{incrementaldecoder} An incremental decoder class or factory function;
				35	\item \code{streamwriter} A stream writer class or factory function;
				36	\item \code{streamreader} A stream reader class or factory function.
				37	\end{itemize}
				38
				39	The various functions or classes take the following arguments:
Fred Drake	b7979c7	2000-04-06 14:21:58 +0000	[diff] [blame]	40
				41	\var{encoder} and \var{decoder}: These must be functions or methods
Fred Drake	602aa77	2000-10-12 20:50:55 +0000	[diff] [blame]	42	which have the same interface as the
				43	\method{encode()}/\method{decode()} methods of Codec instances (see
				44	Codec Interface). The functions/methods are expected to work in a
				45	stateless mode.
Fred Drake	b7979c7	2000-04-06 14:21:58 +0000	[diff] [blame]	46
Thomas Wouters	a977329	2006-04-21 09:43:23 +0000	[diff] [blame^]	47	\var{incrementalencoder} and \var{incrementalencoder}: These have to be
				48	factory functions providing the following interface:
				49
				50	\code{factory(\var{errors}='strict')}
				51
				52	The factory functions must return objects providing the interfaces
				53	defined by the base classes \class{IncrementalEncoder} and
				54	\class{IncrementalEncoder}, respectively. Incremental codecs can maintain
				55	state.
				56
				57	\var{streamreader} and \var{streamwriter}: These have to be
Fred Drake	b7979c7	2000-04-06 14:21:58 +0000	[diff] [blame]	58	factory functions providing the following interface:
				59
Fred Drake	602aa77	2000-10-12 20:50:55 +0000	[diff] [blame]	60	\code{factory(\var{stream}, \var{errors}='strict')}
Fred Drake	b7979c7	2000-04-06 14:21:58 +0000	[diff] [blame]	61
				62	The factory functions must return objects providing the interfaces
Fred Drake	69ca950	2000-04-06 16:09:59 +0000	[diff] [blame]	63	defined by the base classes \class{StreamWriter} and
				64	\class{StreamReader}, respectively. Stream codecs can maintain
				65	state.
Fred Drake	b7979c7	2000-04-06 14:21:58 +0000	[diff] [blame]	66
Fred Drake	69ca950	2000-04-06 16:09:59 +0000	[diff] [blame]	67	Possible values for errors are \code{'strict'} (raise an exception
				68	in case of an encoding error), \code{'replace'} (replace malformed
Walter Dörwald	72f8616	2002-11-19 21:51:35 +0000	[diff] [blame]	69	data with a suitable replacement marker, such as \character{?}),
Fred Drake	69ca950	2000-04-06 16:09:59 +0000	[diff] [blame]	70	\code{'ignore'} (ignore malformed data and continue without further
Walter Dörwald	72f8616	2002-11-19 21:51:35 +0000	[diff] [blame]	71	notice), \code{'xmlcharrefreplace'} (replace with the appropriate XML
				72	character reference (for encoding only)) and \code{'backslashreplace'}
				73	(replace with backslashed escape sequences (for encoding only)) as
				74	well as any other error handling name defined via
				75	\function{register_error()}.
Fred Drake	b7979c7	2000-04-06 14:21:58 +0000	[diff] [blame]	76
				77	In case a search function cannot find a given encoding, it should
Fred Drake	69ca950	2000-04-06 16:09:59 +0000	[diff] [blame]	78	return \code{None}.
Fred Drake	b7979c7	2000-04-06 14:21:58 +0000	[diff] [blame]	79	\end{funcdesc}
				80
				81	\begin{funcdesc}{lookup}{encoding}
Thomas Wouters	a977329	2006-04-21 09:43:23 +0000	[diff] [blame^]	82	Looks up the codec info in the Python codec registry and returns a
				83	\class{CodecInfo} object as defined above.
Fred Drake	b7979c7	2000-04-06 14:21:58 +0000	[diff] [blame]	84
				85	Encodings are first looked up in the registry's cache. If not found,
Thomas Wouters	a977329	2006-04-21 09:43:23 +0000	[diff] [blame^]	86	the list of registered search functions is scanned. If no \class{CodecInfo}
				87	object is found, a \exception{LookupError} is raised. Otherwise, the
				88	\class{CodecInfo} object is stored in the cache and returned to the caller.
Fred Drake	b7979c7	2000-04-06 14:21:58 +0000	[diff] [blame]	89	\end{funcdesc}
				90
Skip Montanaro	b02ea65	2002-04-17 19:33:06 +0000	[diff] [blame]	91	To simplify access to the various codecs, the module provides these
Marc-André Lemburg	494f2ae	2001-09-19 11:33:31 +0000	[diff] [blame]	92	additional functions which use \function{lookup()} for the codec
				93	lookup:
				94
				95	\begin{funcdesc}{getencoder}{encoding}
				96	Lookup up the codec for the given encoding and return its encoder
				97	function.
				98
				99	Raises a \exception{LookupError} in case the encoding cannot be found.
				100	\end{funcdesc}
				101
				102	\begin{funcdesc}{getdecoder}{encoding}
				103	Lookup up the codec for the given encoding and return its decoder
				104	function.
				105
				106	Raises a \exception{LookupError} in case the encoding cannot be found.
				107	\end{funcdesc}
				108
Thomas Wouters	a977329	2006-04-21 09:43:23 +0000	[diff] [blame^]	109	\begin{funcdesc}{getincrementalencoder}{encoding}
				110	Lookup up the codec for the given encoding and return its incremental encoder
				111	class or factory function.
				112
				113	Raises a \exception{LookupError} in case the encoding cannot be found or the
				114	codec doesn't support an incremental encoder.
				115	\end{funcdesc}
				116
				117	\begin{funcdesc}{getincrementaldecoder}{encoding}
				118	Lookup up the codec for the given encoding and return its incremental decoder
				119	class or factory function.
				120
				121	Raises a \exception{LookupError} in case the encoding cannot be found or the
				122	codec doesn't support an incremental decoder.
				123	\end{funcdesc}
				124
Marc-André Lemburg	494f2ae	2001-09-19 11:33:31 +0000	[diff] [blame]	125	\begin{funcdesc}{getreader}{encoding}
				126	Lookup up the codec for the given encoding and return its StreamReader
				127	class or factory function.
				128
				129	Raises a \exception{LookupError} in case the encoding cannot be found.
				130	\end{funcdesc}
				131
				132	\begin{funcdesc}{getwriter}{encoding}
				133	Lookup up the codec for the given encoding and return its StreamWriter
				134	class or factory function.
				135
				136	Raises a \exception{LookupError} in case the encoding cannot be found.
				137	\end{funcdesc}
				138
Walter Dörwald	3aeb632	2002-09-02 13:14:32 +0000	[diff] [blame]	139	\begin{funcdesc}{register_error}{name, error_handler}
				140	Register the error handling function \var{error_handler} under the
Raymond Hettinger	8a64d40	2002-09-08 22:26:13 +0000	[diff] [blame]	141	name \var{name}. \var{error_handler} will be called during encoding
Walter Dörwald	3aeb632	2002-09-02 13:14:32 +0000	[diff] [blame]	142	and decoding in case of an error, when \var{name} is specified as the
Walter Dörwald	2e0b18a	2003-01-31 17:19:08 +0000	[diff] [blame]	143	errors parameter.
				144
				145	For encoding \var{error_handler} will be called with a
				146	\exception{UnicodeEncodeError} instance, which contains information about
				147	the location of the error. The error handler must either raise this or
				148	a different exception or return a tuple with a replacement for the
				149	unencodable part of the input and a position where encoding should
				150	continue. The encoder will encode the replacement and continue encoding
				151	the original input at the specified position. Negative position values
				152	will be treated as being relative to the end of the input string. If the
				153	resulting position is out of bound an IndexError will be raised.
				154
				155	Decoding and translating works similar, except \exception{UnicodeDecodeError}
				156	or \exception{UnicodeTranslateError} will be passed to the handler and
				157	that the replacement from the error handler will be put into the output
				158	directly.
Walter Dörwald	3aeb632	2002-09-02 13:14:32 +0000	[diff] [blame]	159	\end{funcdesc}
				160
				161	\begin{funcdesc}{lookup_error}{name}
				162	Return the error handler previously register under the name \var{name}.
				163
				164	Raises a \exception{LookupError} in case the handler cannot be found.
				165	\end{funcdesc}
				166
				167	\begin{funcdesc}{strict_errors}{exception}
				168	Implements the \code{strict} error handling.
				169	\end{funcdesc}
				170
				171	\begin{funcdesc}{replace_errors}{exception}
				172	Implements the \code{replace} error handling.
				173	\end{funcdesc}
				174
				175	\begin{funcdesc}{ignore_errors}{exception}
				176	Implements the \code{ignore} error handling.
				177	\end{funcdesc}
				178
				179	\begin{funcdesc}{xmlcharrefreplace_errors_errors}{exception}
				180	Implements the \code{xmlcharrefreplace} error handling.
				181	\end{funcdesc}
				182
				183	\begin{funcdesc}{backslashreplace_errors_errors}{exception}
				184	Implements the \code{backslashreplace} error handling.
				185	\end{funcdesc}
				186
Walter Dörwald	1a7a894	2002-11-02 13:32:07 +0000	[diff] [blame]	187	To simplify working with encoded files or stream, the module
				188	also defines these utility functions:
				189
Fred Drake	e1b304d	2000-07-24 19:35:52 +0000	[diff] [blame]	190	\begin{funcdesc}{open}{filename, mode\optional{, encoding\optional{,
				191	errors\optional{, buffering}}}}
Fred Drake	b7979c7	2000-04-06 14:21:58 +0000	[diff] [blame]	192	Open an encoded file using the given \var{mode} and return
				193	a wrapped version providing transparent encoding/decoding.
				194
Fred Drake	0aa811c	2001-10-20 04:24:09 +0000	[diff] [blame]	195	\note{The wrapped version will only accept the object format
Fred Drake	e1b304d	2000-07-24 19:35:52 +0000	[diff] [blame]	196	defined by the codecs, i.e.\ Unicode objects for most built-in
				197	codecs. Output is also codec-dependent and will usually be Unicode as
Fred Drake	0aa811c	2001-10-20 04:24:09 +0000	[diff] [blame]	198	well.}
Fred Drake	b7979c7	2000-04-06 14:21:58 +0000	[diff] [blame]	199
				200	\var{encoding} specifies the encoding which is to be used for the
Raymond Hettinger	7e43110	2003-09-22 15:00:55 +0000	[diff] [blame]	201	file.
Fred Drake	b7979c7	2000-04-06 14:21:58 +0000	[diff] [blame]	202
				203	\var{errors} may be given to define the error handling. It defaults
Fred Drake	e1b304d	2000-07-24 19:35:52 +0000	[diff] [blame]	204	to \code{'strict'} which causes a \exception{ValueError} to be raised
				205	in case an encoding error occurs.
Fred Drake	b7979c7	2000-04-06 14:21:58 +0000	[diff] [blame]	206
Fred Drake	69ca950	2000-04-06 16:09:59 +0000	[diff] [blame]	207	\var{buffering} has the same meaning as for the built-in
				208	\function{open()} function. It defaults to line buffered.
Fred Drake	b7979c7	2000-04-06 14:21:58 +0000	[diff] [blame]	209	\end{funcdesc}
				210
Fred Drake	e1b304d	2000-07-24 19:35:52 +0000	[diff] [blame]	211	\begin{funcdesc}{EncodedFile}{file, input\optional{,
				212	output\optional{, errors}}}
Fred Drake	b7979c7	2000-04-06 14:21:58 +0000	[diff] [blame]	213	Return a wrapped version of file which provides transparent
				214	encoding translation.
				215
				216	Strings written to the wrapped file are interpreted according to the
				217	given \var{input} encoding and then written to the original file as
Fred Drake	e1b304d	2000-07-24 19:35:52 +0000	[diff] [blame]	218	strings using the \var{output} encoding. The intermediate encoding will
Fred Drake	b7979c7	2000-04-06 14:21:58 +0000	[diff] [blame]	219	usually be Unicode but depends on the specified codecs.
				220
Fred Drake	e1b304d	2000-07-24 19:35:52 +0000	[diff] [blame]	221	If \var{output} is not given, it defaults to \var{input}.
Fred Drake	b7979c7	2000-04-06 14:21:58 +0000	[diff] [blame]	222
				223	\var{errors} may be given to define the error handling. It defaults to
Fred Drake	e1b304d	2000-07-24 19:35:52 +0000	[diff] [blame]	224	\code{'strict'}, which causes \exception{ValueError} to be raised in case
Fred Drake	b7979c7	2000-04-06 14:21:58 +0000	[diff] [blame]	225	an encoding error occurs.
				226	\end{funcdesc}
				227
Thomas Wouters	a977329	2006-04-21 09:43:23 +0000	[diff] [blame^]	228	\begin{funcdesc}{iterencode}{iterable, encoding\optional{, errors}}
				229	Uses an incremental encoder to iteratively encode the input provided by
				230	\var{iterable}. This function is a generator. \var{errors} (as well as
				231	any other keyword argument) is passed through to the incremental encoder.
				232	\end{funcdesc}
				233
				234	\begin{funcdesc}{iterdecode}{iterable, encoding\optional{, errors}}
				235	Uses an incremental decoder to iteratively decode the input provided by
				236	\var{iterable}. This function is a generator. \var{errors} (as well as
				237	any other keyword argument) is passed through to the incremental encoder.
				238	\end{funcdesc}
				239
Fred Drake	b7979c7	2000-04-06 14:21:58 +0000	[diff] [blame]	240	The module also provides the following constants which are useful
				241	for reading and writing to platform dependent files:
				242
				243	\begin{datadesc}{BOM}
				244	\dataline{BOM_BE}
				245	\dataline{BOM_LE}
Walter Dörwald	474458d	2002-06-04 15:16:29 +0000	[diff] [blame]	246	\dataline{BOM_UTF8}
				247	\dataline{BOM_UTF16}
				248	\dataline{BOM_UTF16_BE}
				249	\dataline{BOM_UTF16_LE}
				250	\dataline{BOM_UTF32}
				251	\dataline{BOM_UTF32_BE}
				252	\dataline{BOM_UTF32_LE}
				253	These constants define various encodings of the Unicode byte order mark
				254	(BOM) used in UTF-16 and UTF-32 data streams to indicate the byte order
				255	used in the stream or file and in UTF-8 as a Unicode signature.
				256	\constant{BOM_UTF16} is either \constant{BOM_UTF16_BE} or
				257	\constant{BOM_UTF16_LE} depending on the platform's native byte order,
				258	\constant{BOM} is an alias for \constant{BOM_UTF16}, \constant{BOM_LE}
				259	for \constant{BOM_UTF16_LE} and \constant{BOM_BE} for \constant{BOM_UTF16_BE}.
				260	The others represent the BOM in UTF-8 and UTF-32 encodings.
Fred Drake	b7979c7	2000-04-06 14:21:58 +0000	[diff] [blame]	261	\end{datadesc}
				262
Fred Drake	dc40ac0	2001-01-22 20:17:54 +0000	[diff] [blame]	263
Walter Dörwald	d4bfe2c	2005-11-25 17:17:12 +0000	[diff] [blame]	264	\subsection{Codec Base Classes \label{codec-base-classes}}
Fred Drake	602aa77	2000-10-12 20:50:55 +0000	[diff] [blame]	265
Fred Drake	9984e70	2005-10-20 17:52:05 +0000	[diff] [blame]	266	The \module{codecs} module defines a set of base classes which define the
Fred Drake	602aa77	2000-10-12 20:50:55 +0000	[diff] [blame]	267	interface and can also be used to easily write you own codecs for use
				268	in Python.
				269
				270	Each codec has to define four interfaces to make it usable as codec in
				271	Python: stateless encoder, stateless decoder, stream reader and stream
				272	writer. The stream reader and writers typically reuse the stateless
				273	encoder/decoder to implement the file protocols.
				274
				275	The \class{Codec} class defines the interface for stateless
				276	encoders/decoders.
				277
				278	To simplify and standardize error handling, the \method{encode()} and
				279	\method{decode()} methods may implement different error handling
				280	schemes by providing the \var{errors} string argument. The following
				281	string values are defined and implemented by all standard Python
				282	codecs:
				283
Fred Drake	dc40ac0	2001-01-22 20:17:54 +0000	[diff] [blame]	284	\begin{tableii}{l\|l}{code}{Value}{Meaning}
Walter Dörwald	430b156	2002-11-07 22:33:17 +0000	[diff] [blame]	285	\lineii{'strict'}{Raise \exception{UnicodeError} (or a subclass);
Fred Drake	dc40ac0	2001-01-22 20:17:54 +0000	[diff] [blame]	286	this is the default.}
				287	\lineii{'ignore'}{Ignore the character and continue with the next.}
				288	\lineii{'replace'}{Replace with a suitable replacement character;
				289	Python will use the official U+FFFD REPLACEMENT
Walter Dörwald	430b156	2002-11-07 22:33:17 +0000	[diff] [blame]	290	CHARACTER for the built-in Unicode codecs on
				291	decoding and '?' on encoding.}
				292	\lineii{'xmlcharrefreplace'}{Replace with the appropriate XML
				293	character reference (only for encoding).}
				294	\lineii{'backslashreplace'}{Replace with backslashed escape sequences
				295	(only for encoding).}
Fred Drake	dc40ac0	2001-01-22 20:17:54 +0000	[diff] [blame]	296	\end{tableii}
Fred Drake	602aa77	2000-10-12 20:50:55 +0000	[diff] [blame]	297
Walter Dörwald	430b156	2002-11-07 22:33:17 +0000	[diff] [blame]	298	The set of allowed values can be extended via \method{register_error}.
				299
Fred Drake	602aa77	2000-10-12 20:50:55 +0000	[diff] [blame]	300
				301	\subsubsection{Codec Objects \label{codec-objects}}
				302
				303	The \class{Codec} class defines these methods which also define the
				304	function interfaces of the stateless encoder and decoder:
				305
				306	\begin{methoddesc}{encode}{input\optional{, errors}}
				307	Encodes the object \var{input} and returns a tuple (output object,
Skip Montanaro	6c7bc31	2002-04-16 15:12:10 +0000	[diff] [blame]	308	length consumed). While codecs are not restricted to use with Unicode, in
				309	a Unicode context, encoding converts a Unicode object to a plain string
				310	using a particular character set encoding (e.g., \code{cp1252} or
				311	\code{iso-8859-1}).
Fred Drake	602aa77	2000-10-12 20:50:55 +0000	[diff] [blame]	312
				313	\var{errors} defines the error handling to apply. It defaults to
				314	\code{'strict'} handling.
				315
				316	The method may not store state in the \class{Codec} instance. Use
				317	\class{StreamCodec} for codecs which have to keep state in order to
				318	make encoding/decoding efficient.
				319
				320	The encoder must be able to handle zero length input and return an
				321	empty object of the output object type in this situation.
				322	\end{methoddesc}
				323
				324	\begin{methoddesc}{decode}{input\optional{, errors}}
				325	Decodes the object \var{input} and returns a tuple (output object,
Skip Montanaro	6c7bc31	2002-04-16 15:12:10 +0000	[diff] [blame]	326	length consumed). In a Unicode context, decoding converts a plain string
				327	encoded using a particular character set encoding to a Unicode object.
Fred Drake	602aa77	2000-10-12 20:50:55 +0000	[diff] [blame]	328
				329	\var{input} must be an object which provides the \code{bf_getreadbuf}
				330	buffer slot. Python strings, buffer objects and memory mapped files
				331	are examples of objects providing this slot.
				332
				333	\var{errors} defines the error handling to apply. It defaults to
				334	\code{'strict'} handling.
				335
				336	The method may not store state in the \class{Codec} instance. Use
				337	\class{StreamCodec} for codecs which have to keep state in order to
				338	make encoding/decoding efficient.
				339
				340	The decoder must be able to handle zero length input and return an
				341	empty object of the output object type in this situation.
				342	\end{methoddesc}
				343
Thomas Wouters	a977329	2006-04-21 09:43:23 +0000	[diff] [blame^]	344	The \class{IncrementalEncoder} and \class{IncrementalDecoder} classes provide
				345	the basic interface for incremental encoding and decoding. Encoding/decoding the
				346	input isn't done with one call to the stateless encoder/decoder function,
				347	but with multiple calls to the \method{encode}/\method{decode} method of the
				348	incremental encoder/decoder. The incremental encoder/decoder keeps track of
				349	the encoding/decoding process during method calls.
				350
				351	The joined output of calls to the \method{encode}/\method{decode} method is the
				352	same as if the all single inputs where joined into one, and this input was
				353	encoded/decoded with the stateless encoder/decoder.
				354
				355
				356	\subsubsection{IncrementalEncoder Objects \label{incremental-encoder-objects}}
				357
				358	The \class{IncrementalEncoder} class is used for encoding an input in multiple
				359	steps. It defines the following methods which every incremental encoder must
				360	define in order to be compatible to the Python codec registry.
				361
				362	\begin{classdesc}{IncrementalEncoder}{\optional{errors}}
				363	Constructor for a \class{IncrementalEncoder} instance.
				364
				365	All incremental encoders must provide this constructor interface. They are
				366	free to add additional keyword arguments, but only the ones defined
				367	here are used by the Python codec registry.
				368
				369	The \class{IncrementalEncoder} may implement different error handling
				370	schemes by providing the \var{errors} keyword argument. These
				371	parameters are predefined:
				372
				373	\begin{itemize}
				374	\item \code{'strict'} Raise \exception{ValueError} (or a subclass);
				375	this is the default.
				376	\item \code{'ignore'} Ignore the character and continue with the next.
				377	\item \code{'replace'} Replace with a suitable replacement character
				378	\item \code{'xmlcharrefreplace'} Replace with the appropriate XML
				379	character reference
				380	\item \code{'backslashreplace'} Replace with backslashed escape sequences.
				381	\end{itemize}
				382
				383	The \var{errors} argument will be assigned to an attribute of the
				384	same name. Assigning to this attribute makes it possible to switch
				385	between different error handling strategies during the lifetime
				386	of the \class{IncrementalEncoder} object.
				387
				388	The set of allowed values for the \var{errors} argument can
				389	be extended with \function{register_error()}.
				390	\end{classdesc}
				391
				392	\begin{methoddesc}{encode}{object\optional{, final}}
				393	Encodes \var{object} (taking the current state of the encoder into account)
				394	and returns the resulting encoded object. If this is the last call to
				395	\method{encode} \var{final} must be true (the default is false).
				396	\end{methoddesc}
				397
				398	\begin{methoddesc}{reset}{}
				399	Reset the encoder to the initial state.
				400	\end{methoddesc}
				401
				402
				403	\subsubsection{IncrementalDecoder Objects \label{incremental-decoder-objects}}
				404
				405	The \class{IncrementalDecoder} class is used for decoding an input in multiple
				406	steps. It defines the following methods which every incremental decoder must
				407	define in order to be compatible to the Python codec registry.
				408
				409	\begin{classdesc}{IncrementalDecoder}{\optional{errors}}
				410	Constructor for a \class{IncrementalDecoder} instance.
				411
				412	All incremental decoders must provide this constructor interface. They are
				413	free to add additional keyword arguments, but only the ones defined
				414	here are used by the Python codec registry.
				415
				416	The \class{IncrementalDecoder} may implement different error handling
				417	schemes by providing the \var{errors} keyword argument. These
				418	parameters are predefined:
				419
				420	\begin{itemize}
				421	\item \code{'strict'} Raise \exception{ValueError} (or a subclass);
				422	this is the default.
				423	\item \code{'ignore'} Ignore the character and continue with the next.
				424	\item \code{'replace'} Replace with a suitable replacement character.
				425	\end{itemize}
				426
				427	The \var{errors} argument will be assigned to an attribute of the
				428	same name. Assigning to this attribute makes it possible to switch
				429	between different error handling strategies during the lifetime
				430	of the \class{IncrementalEncoder} object.
				431
				432	The set of allowed values for the \var{errors} argument can
				433	be extended with \function{register_error()}.
				434	\end{classdesc}
				435
				436	\begin{methoddesc}{decode}{object\optional{, final}}
				437	Decodes \var{object} (taking the current state of the decoder into account)
				438	and returns the resulting decoded object. If this is the last call to
				439	\method{decode} \var{final} must be true (the default is false).
				440	\end{methoddesc}
				441
				442	\begin{methoddesc}{reset}{}
				443	Reset the decoder to the initial state.
				444	\end{methoddesc}
				445
				446
Fred Drake	602aa77	2000-10-12 20:50:55 +0000	[diff] [blame]	447	The \class{StreamWriter} and \class{StreamReader} classes provide
				448	generic working interfaces which can be used to implement new
				449	encodings submodules very easily. See \module{encodings.utf_8} for an
				450	example on how this is done.
				451
				452
				453	\subsubsection{StreamWriter Objects \label{stream-writer-objects}}
				454
				455	The \class{StreamWriter} class is a subclass of \class{Codec} and
				456	defines the following methods which every stream writer must define in
				457	order to be compatible to the Python codec registry.
				458
				459	\begin{classdesc}{StreamWriter}{stream\optional{, errors}}
				460	Constructor for a \class{StreamWriter} instance.
				461
				462	All stream writers must provide this constructor interface. They are
				463	free to add additional keyword arguments, but only the ones defined
				464	here are used by the Python codec registry.
				465
				466	\var{stream} must be a file-like object open for writing (binary)
				467	data.
				468
				469	The \class{StreamWriter} may implement different error handling
				470	schemes by providing the \var{errors} keyword argument. These
Walter Dörwald	430b156	2002-11-07 22:33:17 +0000	[diff] [blame]	471	parameters are predefined:
Fred Drake	602aa77	2000-10-12 20:50:55 +0000	[diff] [blame]	472
				473	\begin{itemize}
				474	\item \code{'strict'} Raise \exception{ValueError} (or a subclass);
				475	this is the default.
				476	\item \code{'ignore'} Ignore the character and continue with the next.
				477	\item \code{'replace'} Replace with a suitable replacement character
Walter Dörwald	430b156	2002-11-07 22:33:17 +0000	[diff] [blame]	478	\item \code{'xmlcharrefreplace'} Replace with the appropriate XML
				479	character reference
				480	\item \code{'backslashreplace'} Replace with backslashed escape sequences.
Fred Drake	602aa77	2000-10-12 20:50:55 +0000	[diff] [blame]	481	\end{itemize}
Walter Dörwald	430b156	2002-11-07 22:33:17 +0000	[diff] [blame]	482
				483	The \var{errors} argument will be assigned to an attribute of the
				484	same name. Assigning to this attribute makes it possible to switch
				485	between different error handling strategies during the lifetime
				486	of the \class{StreamWriter} object.
				487
				488	The set of allowed values for the \var{errors} argument can
				489	be extended with \function{register_error()}.
Fred Drake	602aa77	2000-10-12 20:50:55 +0000	[diff] [blame]	490	\end{classdesc}
				491
				492	\begin{methoddesc}{write}{object}
				493	Writes the object's contents encoded to the stream.
				494	\end{methoddesc}
				495
				496	\begin{methoddesc}{writelines}{list}
				497	Writes the concatenated list of strings to the stream (possibly by
				498	reusing the \method{write()} method).
				499	\end{methoddesc}
				500
				501	\begin{methoddesc}{reset}{}
				502	Flushes and resets the codec buffers used for keeping state.
				503
				504	Calling this method should ensure that the data on the output is put
				505	into a clean state, that allows appending of new fresh data without
				506	having to rescan the whole stream to recover state.
				507	\end{methoddesc}
				508
				509	In addition to the above methods, the \class{StreamWriter} must also
				510	inherit all other methods and attribute from the underlying stream.
				511
				512
				513	\subsubsection{StreamReader Objects \label{stream-reader-objects}}
				514
				515	The \class{StreamReader} class is a subclass of \class{Codec} and
				516	defines the following methods which every stream reader must define in
				517	order to be compatible to the Python codec registry.
				518
				519	\begin{classdesc}{StreamReader}{stream\optional{, errors}}
				520	Constructor for a \class{StreamReader} instance.
				521
				522	All stream readers must provide this constructor interface. They are
				523	free to add additional keyword arguments, but only the ones defined
				524	here are used by the Python codec registry.
				525
				526	\var{stream} must be a file-like object open for reading (binary)
				527	data.
				528
				529	The \class{StreamReader} may implement different error handling
				530	schemes by providing the \var{errors} keyword argument. These
				531	parameters are defined:
				532
				533	\begin{itemize}
				534	\item \code{'strict'} Raise \exception{ValueError} (or a subclass);
				535	this is the default.
				536	\item \code{'ignore'} Ignore the character and continue with the next.
				537	\item \code{'replace'} Replace with a suitable replacement character.
				538	\end{itemize}
Walter Dörwald	430b156	2002-11-07 22:33:17 +0000	[diff] [blame]	539
				540	The \var{errors} argument will be assigned to an attribute of the
				541	same name. Assigning to this attribute makes it possible to switch
				542	between different error handling strategies during the lifetime
				543	of the \class{StreamReader} object.
				544
				545	The set of allowed values for the \var{errors} argument can
				546	be extended with \function{register_error()}.
Fred Drake	602aa77	2000-10-12 20:50:55 +0000	[diff] [blame]	547	\end{classdesc}
				548
Martin v. Löwis	56066d2	2005-08-24 07:38:12 +0000	[diff] [blame]	549	\begin{methoddesc}{read}{\optional{size\optional{, chars, \optional{firstline}}}}
Fred Drake	602aa77	2000-10-12 20:50:55 +0000	[diff] [blame]	550	Decodes data from the stream and returns the resulting object.
				551
Walter Dörwald	6965203	2004-09-07 20:24:22 +0000	[diff] [blame]	552	\var{chars} indicates the number of characters to read from the
Fred Drake	a2544ee	2004-09-10 01:16:49 +0000	[diff] [blame]	553	stream. \function{read()} will never return more than \var{chars}
Walter Dörwald	6965203	2004-09-07 20:24:22 +0000	[diff] [blame]	554	characters, but it might return less, if there are not enough
				555	characters available.
				556
Fred Drake	602aa77	2000-10-12 20:50:55 +0000	[diff] [blame]	557	\var{size} indicates the approximate maximum number of bytes to read
				558	from the stream for decoding purposes. The decoder can modify this
				559	setting as appropriate. The default value -1 indicates to read and
				560	decode as much as possible. \var{size} is intended to prevent having
				561	to decode huge files in one step.
				562
Martin v. Löwis	56066d2	2005-08-24 07:38:12 +0000	[diff] [blame]	563	\var{firstline} indicates that it would be sufficient to only return
				564	the first line, if there are decoding errors on later lines.
				565
Fred Drake	602aa77	2000-10-12 20:50:55 +0000	[diff] [blame]	566	The method should use a greedy read strategy meaning that it should
				567	read as much data as is allowed within the definition of the encoding
				568	and the given size, e.g. if optional encoding endings or state
				569	markers are available on the stream, these should be read too.
Walter Dörwald	6965203	2004-09-07 20:24:22 +0000	[diff] [blame]	570
				571	\versionchanged[\var{chars} argument added]{2.4}
Martin v. Löwis	56066d2	2005-08-24 07:38:12 +0000	[diff] [blame]	572	\versionchanged[\var{firstline} argument added]{2.4.2}
Fred Drake	602aa77	2000-10-12 20:50:55 +0000	[diff] [blame]	573	\end{methoddesc}
				574
Walter Dörwald	6965203	2004-09-07 20:24:22 +0000	[diff] [blame]	575	\begin{methoddesc}{readline}{\optional{size\optional{, keepends}}}
Fred Drake	602aa77	2000-10-12 20:50:55 +0000	[diff] [blame]	576	Read one line from the input stream and return the
				577	decoded data.
				578
Fred Drake	602aa77	2000-10-12 20:50:55 +0000	[diff] [blame]	579	\var{size}, if given, is passed as size argument to the stream's
				580	\method{readline()} method.
Walter Dörwald	6965203	2004-09-07 20:24:22 +0000	[diff] [blame]	581
				582	If \var{keepends} is false lineends will be stripped from the
				583	lines returned.
				584
				585	\versionchanged[\var{keepends} argument added]{2.4}
Fred Drake	602aa77	2000-10-12 20:50:55 +0000	[diff] [blame]	586	\end{methoddesc}
				587
Walter Dörwald	6965203	2004-09-07 20:24:22 +0000	[diff] [blame]	588	\begin{methoddesc}{readlines}{\optional{sizehint\optional{, keepends}}}
Fred Drake	602aa77	2000-10-12 20:50:55 +0000	[diff] [blame]	589	Read all lines available on the input stream and return them as list
				590	of lines.
				591
				592	Line breaks are implemented using the codec's decoder method and are
Walter Dörwald	6965203	2004-09-07 20:24:22 +0000	[diff] [blame]	593	included in the list entries if \var{keepends} is true.
Fred Drake	602aa77	2000-10-12 20:50:55 +0000	[diff] [blame]	594
				595	\var{sizehint}, if given, is passed as \var{size} argument to the
				596	stream's \method{read()} method.
				597	\end{methoddesc}
				598
				599	\begin{methoddesc}{reset}{}
				600	Resets the codec buffers used for keeping state.
				601
				602	Note that no stream repositioning should take place. This method is
				603	primarily intended to be able to recover from decoding errors.
				604	\end{methoddesc}
				605
				606	In addition to the above methods, the \class{StreamReader} must also
				607	inherit all other methods and attribute from the underlying stream.
				608
				609	The next two base classes are included for convenience. They are not
				610	needed by the codec registry, but may provide useful in practice.
				611
				612
				613	\subsubsection{StreamReaderWriter Objects \label{stream-reader-writer}}
				614
				615	The \class{StreamReaderWriter} allows wrapping streams which work in
				616	both read and write modes.
				617
				618	The design is such that one can use the factory functions returned by
				619	the \function{lookup()} function to construct the instance.
				620
				621	\begin{classdesc}{StreamReaderWriter}{stream, Reader, Writer, errors}
				622	Creates a \class{StreamReaderWriter} instance.
				623	\var{stream} must be a file-like object.
				624	\var{Reader} and \var{Writer} must be factory functions or classes
				625	providing the \class{StreamReader} and \class{StreamWriter} interface
				626	resp.
				627	Error handling is done in the same way as defined for the
				628	stream readers and writers.
				629	\end{classdesc}
				630
				631	\class{StreamReaderWriter} instances define the combined interfaces of
				632	\class{StreamReader} and \class{StreamWriter} classes. They inherit
				633	all other methods and attribute from the underlying stream.
				634
				635
				636	\subsubsection{StreamRecoder Objects \label{stream-recoder-objects}}
				637
				638	The \class{StreamRecoder} provide a frontend - backend view of
				639	encoding data which is sometimes useful when dealing with different
				640	encoding environments.
				641
				642	The design is such that one can use the factory functions returned by
				643	the \function{lookup()} function to construct the instance.
				644
				645	\begin{classdesc}{StreamRecoder}{stream, encode, decode,
				646	Reader, Writer, errors}
				647	Creates a \class{StreamRecoder} instance which implements a two-way
				648	conversion: \var{encode} and \var{decode} work on the frontend (the
				649	input to \method{read()} and output of \method{write()}) while
				650	\var{Reader} and \var{Writer} work on the backend (reading and
				651	writing to the stream).
				652
				653	You can use these objects to do transparent direct recodings from
				654	e.g.\ Latin-1 to UTF-8 and back.
				655
				656	\var{stream} must be a file-like object.
				657
				658	\var{encode}, \var{decode} must adhere to the \class{Codec}
				659	interface, \var{Reader}, \var{Writer} must be factory functions or
Raymond Hettinger	f17d65d	2003-08-12 00:01:16 +0000	[diff] [blame]	660	classes providing objects of the \class{StreamReader} and
Fred Drake	602aa77	2000-10-12 20:50:55 +0000	[diff] [blame]	661	\class{StreamWriter} interface respectively.
				662
				663	\var{encode} and \var{decode} are needed for the frontend
				664	translation, \var{Reader} and \var{Writer} for the backend
				665	translation. The intermediate format used is determined by the two
				666	sets of codecs, e.g. the Unicode codecs will use Unicode as
				667	intermediate encoding.
				668
				669	Error handling is done in the same way as defined for the
				670	stream readers and writers.
				671	\end{classdesc}
				672
				673	\class{StreamRecoder} instances define the combined interfaces of
				674	\class{StreamReader} and \class{StreamWriter} classes. They inherit
				675	all other methods and attribute from the underlying stream.
				676
Martin v. Löwis	412ed3b	2006-01-08 10:45:39 +0000	[diff] [blame]	677	\subsection{Encodings and Unicode\label{encodings-overview}}
				678
				679	Unicode strings are stored internally as sequences of codepoints (to
Georg Brandl	131e4f7	2006-01-23 21:33:48 +0000	[diff] [blame]	680	be precise as \ctype{Py_UNICODE} arrays). Depending on the way Python is
				681	compiled (either via \longprogramopt{enable-unicode=ucs2} or
				682	\longprogramopt{enable-unicode=ucs4}, with the former being the default)
				683	\ctype{Py_UNICODE} is either a 16-bit or
Martin v. Löwis	412ed3b	2006-01-08 10:45:39 +0000	[diff] [blame]	684	32-bit data type. Once a Unicode object is used outside of CPU and
				685	memory, CPU endianness and how these arrays are stored as bytes become
				686	an issue. Transforming a unicode object into a sequence of bytes is
				687	called encoding and recreating the unicode object from the sequence of
				688	bytes is known as decoding. There are many different methods how this
				689	transformation can be done (these methods are also called encodings).
				690	The simplest method is to map the codepoints 0-255 to the bytes
Georg Brandl	131e4f7	2006-01-23 21:33:48 +0000	[diff] [blame]	691	\code{0x0}-\code{0xff}. This means that a unicode object that contains
				692	codepoints above \code{U+00FF} can't be encoded with this method (which
				693	is called \code{'latin-1'} or \code{'iso-8859-1'}). unicode.encode() will
				694	raise a UnicodeEncodeError that looks like this: \samp{UnicodeEncodeError:
				695	'latin-1' codec can't encode character u'\e u1234' in position 3: ordinal
				696	not in range(256)}.
Martin v. Löwis	412ed3b	2006-01-08 10:45:39 +0000	[diff] [blame]	697
				698	There's another group of encodings (the so called charmap encodings)
				699	that choose a different subset of all unicode code points and how
Georg Brandl	131e4f7	2006-01-23 21:33:48 +0000	[diff] [blame]	700	these codepoints are mapped to the bytes \code{0x0}-\code{0xff.}
				701	To see how this is done simply open e.g. \file{encodings/cp1252.py}
				702	(which is an encoding that is used primarily on Windows).
				703	There's a string constant with 256 characters that shows you which
				704	character is mapped to which byte value.
Martin v. Löwis	412ed3b	2006-01-08 10:45:39 +0000	[diff] [blame]	705
				706	All of these encodings can only encode 256 of the 65536 (or 1114111)
				707	codepoints defined in unicode. A simple and straightforward way that
				708	can store each Unicode code point, is to store each codepoint as two
				709	consecutive bytes. There are two possibilities: Store the bytes in big
				710	endian or in little endian order. These two encodings are called
				711	UTF-16-BE and UTF-16-LE respectively. Their disadvantage is that if
				712	e.g. you use UTF-16-BE on a little endian machine you will always have
				713	to swap bytes on encoding and decoding. UTF-16 avoids this problem:
				714	Bytes will always be in natural endianness. When these bytes are read
				715	by a CPU with a different endianness, then bytes have to be swapped
				716	though. To be able to detect the endianness of a UTF-16 byte sequence,
				717	there's the so called BOM (the "Byte Order Mark"). This is the Unicode
Georg Brandl	131e4f7	2006-01-23 21:33:48 +0000	[diff] [blame]	718	character \code{U+FEFF}. This character will be prepended to every UTF-16
				719	byte sequence. The byte swapped version of this character (\code{0xFFFE}) is
Martin v. Löwis	412ed3b	2006-01-08 10:45:39 +0000	[diff] [blame]	720	an illegal character that may not appear in a Unicode text. So when
Georg Brandl	131e4f7	2006-01-23 21:33:48 +0000	[diff] [blame]	721	the first character in an UTF-16 byte sequence appears to be a \code{U+FFFE}
Martin v. Löwis	412ed3b	2006-01-08 10:45:39 +0000	[diff] [blame]	722	the bytes have to be swapped on decoding. Unfortunately upto Unicode
Georg Brandl	131e4f7	2006-01-23 21:33:48 +0000	[diff] [blame]	723	4.0 the character \code{U+FEFF} had a second purpose as a \samp{ZERO WIDTH
				724	NO-BREAK SPACE}: A character that has no width and doesn't allow a
Martin v. Löwis	412ed3b	2006-01-08 10:45:39 +0000	[diff] [blame]	725	word to be split. It can e.g. be used to give hints to a ligature
Georg Brandl	131e4f7	2006-01-23 21:33:48 +0000	[diff] [blame]	726	algorithm. With Unicode 4.0 using \code{U+FEFF} as a \samp{ZERO WIDTH NO-BREAK
				727	SPACE} has been deprecated (with \code{U+2060} (\samp{WORD JOINER}) assuming
				728	this role). Nevertheless Unicode software still must be able to handle
				729	\code{U+FEFF} in both roles: As a BOM it's a device to determine the storage
Martin v. Löwis	412ed3b	2006-01-08 10:45:39 +0000	[diff] [blame]	730	layout of the encoded bytes, and vanishes once the byte sequence has
Georg Brandl	131e4f7	2006-01-23 21:33:48 +0000	[diff] [blame]	731	been decoded into a Unicode string; as a \samp{ZERO WIDTH NO-BREAK SPACE}
Martin v. Löwis	412ed3b	2006-01-08 10:45:39 +0000	[diff] [blame]	732	it's a normal character that will be decoded like any other.
				733
				734	There's another encoding that is able to encoding the full range of
				735	Unicode characters: UTF-8. UTF-8 is an 8bit encoding, which means
				736	there are no issues with byte order in UTF-8. Each byte in a UTF-8
				737	byte sequence consists of two parts: Marker bits (the most significant
				738	bits) and payload bits. The marker bits are a sequence of zero to six
				739	1 bits followed by a 0 bit. Unicode characters are encoded like this
Walter Dörwald	b754fe4	2006-01-09 12:45:01 +0000	[diff] [blame]	740	(with x being payload bits, which when concatenated give the Unicode
Martin v. Löwis	412ed3b	2006-01-08 10:45:39 +0000	[diff] [blame]	741	character):
				742
Walter Dörwald	b075fce	2006-02-21 18:51:32 +0000	[diff] [blame]	743	\begin{tableii}{l\|l}{textrm}{Range}{Encoding}
Georg Brandl	131e4f7	2006-01-23 21:33:48 +0000	[diff] [blame]	744	\lineii{\code{U-00000000} ... \code{U-0000007F}}{0xxxxxxx}
				745	\lineii{\code{U-00000080} ... \code{U-000007FF}}{110xxxxx 10xxxxxx}
				746	\lineii{\code{U-00000800} ... \code{U-0000FFFF}}{1110xxxx 10xxxxxx 10xxxxxx}
				747	\lineii{\code{U-00010000} ... \code{U-001FFFFF}}{11110xxx 10xxxxxx 10xxxxxx 10xxxxxx}
				748	\lineii{\code{U-00200000} ... \code{U-03FFFFFF}}{111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx}
				749	\lineii{\code{U-04000000} ... \code{U-7FFFFFFF}}{1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx}
Martin v. Löwis	412ed3b	2006-01-08 10:45:39 +0000	[diff] [blame]	750	\end{tableii}
				751
				752	The least significant bit of the Unicode character is the rightmost x
				753	bit.
				754
Georg Brandl	131e4f7	2006-01-23 21:33:48 +0000	[diff] [blame]	755	As UTF-8 is an 8bit encoding no BOM is required and any \code{U+FEFF}
Martin v. Löwis	412ed3b	2006-01-08 10:45:39 +0000	[diff] [blame]	756	character in the decoded Unicode string (even if it's the first
Georg Brandl	131e4f7	2006-01-23 21:33:48 +0000	[diff] [blame]	757	character) is treated as a \samp{ZERO WIDTH NO-BREAK SPACE}.
Martin v. Löwis	412ed3b	2006-01-08 10:45:39 +0000	[diff] [blame]	758
				759	Without external information it's impossible to reliably determine
				760	which encoding was used for encoding a Unicode string. Each charmap
				761	encoding can decode any random byte sequence. However that's not
				762	possible with UTF-8, as UTF-8 byte sequences have a structure that
				763	doesn't allow arbitrary byte sequence. To increase the reliability
Walter Dörwald	b754fe4	2006-01-09 12:45:01 +0000	[diff] [blame]	764	with which a UTF-8 encoding can be detected, Microsoft invented a
Georg Brandl	131e4f7	2006-01-23 21:33:48 +0000	[diff] [blame]	765	variant of UTF-8 (that Python 2.5 calls \code{"utf-8-sig"}) for its Notepad
Martin v. Löwis	412ed3b	2006-01-08 10:45:39 +0000	[diff] [blame]	766	program: Before any of the Unicode characters is written to the file,
Georg Brandl	131e4f7	2006-01-23 21:33:48 +0000	[diff] [blame]	767	a UTF-8 encoded BOM (which looks like this as a byte sequence: \code{0xef},
				768	\code{0xbb}, \code{0xbf}) is written. As it's rather improbably that any
				769	charmap encoded file starts with these byte values (which would e.g. map to
Martin v. Löwis	412ed3b	2006-01-08 10:45:39 +0000	[diff] [blame]	770
Georg Brandl	131e4f7	2006-01-23 21:33:48 +0000	[diff] [blame]	771	LATIN SMALL LETTER I WITH DIAERESIS \\
				772	RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK \\
Martin v. Löwis	412ed3b	2006-01-08 10:45:39 +0000	[diff] [blame]	773	INVERTED QUESTION MARK
				774
				775	in iso-8859-1), this increases the probability that a utf-8-sig
				776	encoding can be correctly guessed from the byte sequence. So here the
				777	BOM is not used to be able to determine the byte order used for
				778	generating the byte sequence, but as a signature that helps in
				779	guessing the encoding. On encoding the utf-8-sig codec will write
Georg Brandl	131e4f7	2006-01-23 21:33:48 +0000	[diff] [blame]	780	\code{0xef}, \code{0xbb}, \code{0xbf} as the first three bytes to the file.
				781	On decoding utf-8-sig will skip those three bytes if they appear as the
				782	first three bytes in the file.
Martin v. Löwis	412ed3b	2006-01-08 10:45:39 +0000	[diff] [blame]	783
				784
Skip Montanaro	ecf7a52	2004-07-01 19:26:04 +0000	[diff] [blame]	785	\subsection{Standard Encodings\label{standard-encodings}}
Martin v. Löwis	5c37a77	2002-12-31 12:39:07 +0000	[diff] [blame]	786
				787	Python comes with a number of codecs builtin, either implemented as C
				788	functions, or with dictionaries as mapping tables. The following table
				789	lists the codecs by name, together with a few common aliases, and the
				790	languages for which the encoding is likely used. Neither the list of
				791	aliases nor the list of languages is meant to be exhaustive. Notice
				792	that spelling alternatives that only differ in case or use a hyphen
				793	instead of an underscore are also valid aliases.
				794
				795	Many of the character sets support the same languages. They vary in
				796	individual characters (e.g. whether the EURO SIGN is supported or
				797	not), and in the assignment of characters to code positions. For the
				798	European languages in particular, the following variants typically
				799	exist:
				800
				801	\begin{itemize}
				802	\item an ISO 8859 codeset
				803	\item a Microsoft Windows code page, which is typically derived from
				804	a 8859 codeset, but replaces control characters with additional
				805	graphic characters
				806	\item an IBM EBCDIC code page
Fred Drake	d4be747	2003-04-30 15:02:07 +0000	[diff] [blame]	807	\item an IBM PC code page, which is \ASCII{} compatible
Martin v. Löwis	5c37a77	2002-12-31 12:39:07 +0000	[diff] [blame]	808	\end{itemize}
				809
				810	\begin{longtableiii}{l\|l\|l}{textrm}{Codec}{Aliases}{Languages}
				811
				812	\lineiii{ascii}
				813	{646, us-ascii}
				814	{English}
				815
Hye-Shik Chang	3e2a306	2004-01-17 14:29:29 +0000	[diff] [blame]	816	\lineiii{big5}
Hye-Shik Chang	910d8f1	2004-07-17 14:44:43 +0000	[diff] [blame]	817	{big5-tw, csbig5}
Hye-Shik Chang	3e2a306	2004-01-17 14:29:29 +0000	[diff] [blame]	818	{Traditional Chinese}
				819
Hye-Shik Chang	2bb146f	2004-07-18 03:06:29 +0000	[diff] [blame]	820	\lineiii{big5hkscs}
				821	{big5-hkscs, hkscs}
				822	{Traditional Chinese}
				823
Martin v. Löwis	5c37a77	2002-12-31 12:39:07 +0000	[diff] [blame]	824	\lineiii{cp037}
				825	{IBM037, IBM039}
				826	{English}
				827
				828	\lineiii{cp424}
				829	{EBCDIC-CP-HE, IBM424}
				830	{Hebrew}
				831
				832	\lineiii{cp437}
				833	{437, IBM437}
				834	{English}
				835
				836	\lineiii{cp500}
				837	{EBCDIC-CP-BE, EBCDIC-CP-CH, IBM500}
				838	{Western Europe}
				839
				840	\lineiii{cp737}
				841	{}
				842	{Greek}
				843
				844	\lineiii{cp775}
				845	{IBM775}
				846	{Baltic languages}
				847
				848	\lineiii{cp850}
				849	{850, IBM850}
				850	{Western Europe}
				851
				852	\lineiii{cp852}
				853	{852, IBM852}
				854	{Central and Eastern Europe}
				855
				856	\lineiii{cp855}
				857	{855, IBM855}
				858	{Bulgarian, Byelorussian, Macedonian, Russian, Serbian}
				859
				860	\lineiii{cp856}
				861	{}
				862	{Hebrew}
				863
				864	\lineiii{cp857}
				865	{857, IBM857}
				866	{Turkish}
				867
				868	\lineiii{cp860}
				869	{860, IBM860}
				870	{Portuguese}
				871
				872	\lineiii{cp861}
				873	{861, CP-IS, IBM861}
				874	{Icelandic}
				875
				876	\lineiii{cp862}
				877	{862, IBM862}
				878	{Hebrew}
				879
				880	\lineiii{cp863}
				881	{863, IBM863}
				882	{Canadian}
				883
				884	\lineiii{cp864}
				885	{IBM864}
				886	{Arabic}
				887
				888	\lineiii{cp865}
				889	{865, IBM865}
				890	{Danish, Norwegian}
				891
Skip Montanaro	78bace7	2004-07-02 02:14:34 +0000	[diff] [blame]	892	\lineiii{cp866}
				893	{866, IBM866}
				894	{Russian}
				895
Martin v. Löwis	5c37a77	2002-12-31 12:39:07 +0000	[diff] [blame]	896	\lineiii{cp869}
				897	{869, CP-GR, IBM869}
				898	{Greek}
				899
				900	\lineiii{cp874}
				901	{}
				902	{Thai}
				903
				904	\lineiii{cp875}
				905	{}
				906	{Greek}
				907
Hye-Shik Chang	3e2a306	2004-01-17 14:29:29 +0000	[diff] [blame]	908	\lineiii{cp932}
Hye-Shik Chang	910d8f1	2004-07-17 14:44:43 +0000	[diff] [blame]	909	{932, ms932, mskanji, ms-kanji}
Hye-Shik Chang	3e2a306	2004-01-17 14:29:29 +0000	[diff] [blame]	910	{Japanese}
				911
				912	\lineiii{cp949}
				913	{949, ms949, uhc}
				914	{Korean}
				915
				916	\lineiii{cp950}
				917	{950, ms950}
				918	{Traditional Chinese}
				919
Martin v. Löwis	5c37a77	2002-12-31 12:39:07 +0000	[diff] [blame]	920	\lineiii{cp1006}
				921	{}
				922	{Urdu}
				923
				924	\lineiii{cp1026}
				925	{ibm1026}
				926	{Turkish}
				927
				928	\lineiii{cp1140}
				929	{ibm1140}
				930	{Western Europe}
				931
				932	\lineiii{cp1250}
				933	{windows-1250}
				934	{Central and Eastern Europe}
				935
				936	\lineiii{cp1251}
				937	{windows-1251}
				938	{Bulgarian, Byelorussian, Macedonian, Russian, Serbian}
				939
				940	\lineiii{cp1252}
				941	{windows-1252}
				942	{Western Europe}
				943
				944	\lineiii{cp1253}
				945	{windows-1253}
				946	{Greek}
				947
				948	\lineiii{cp1254}
				949	{windows-1254}
				950	{Turkish}
				951
				952	\lineiii{cp1255}
				953	{windows-1255}
				954	{Hebrew}
				955
				956	\lineiii{cp1256}
				957	{windows1256}
				958	{Arabic}
				959
				960	\lineiii{cp1257}
				961	{windows-1257}
				962	{Baltic languages}
				963
				964	\lineiii{cp1258}
				965	{windows-1258}
				966	{Vietnamese}
				967
Hye-Shik Chang	3e2a306	2004-01-17 14:29:29 +0000	[diff] [blame]	968	\lineiii{euc_jp}
Hye-Shik Chang	910d8f1	2004-07-17 14:44:43 +0000	[diff] [blame]	969	{eucjp, ujis, u-jis}
Hye-Shik Chang	3e2a306	2004-01-17 14:29:29 +0000	[diff] [blame]	970	{Japanese}
				971
Hye-Shik Chang	2bb146f	2004-07-18 03:06:29 +0000	[diff] [blame]	972	\lineiii{euc_jis_2004}
				973	{jisx0213, eucjis2004}
				974	{Japanese}
				975
Hye-Shik Chang	3e2a306	2004-01-17 14:29:29 +0000	[diff] [blame]	976	\lineiii{euc_jisx0213}
Hye-Shik Chang	2bb146f	2004-07-18 03:06:29 +0000	[diff] [blame]	977	{eucjisx0213}
Hye-Shik Chang	3e2a306	2004-01-17 14:29:29 +0000	[diff] [blame]	978	{Japanese}
				979
				980	\lineiii{euc_kr}
Hye-Shik Chang	910d8f1	2004-07-17 14:44:43 +0000	[diff] [blame]	981	{euckr, korean, ksc5601, ks_c-5601, ks_c-5601-1987, ksx1001, ks_x-1001}
Hye-Shik Chang	3e2a306	2004-01-17 14:29:29 +0000	[diff] [blame]	982	{Korean}
				983
				984	\lineiii{gb2312}
Hye-Shik Chang	910d8f1	2004-07-17 14:44:43 +0000	[diff] [blame]	985	{chinese, csiso58gb231280, euc-cn, euccn, eucgb2312-cn, gb2312-1980,
				986	gb2312-80, iso-ir-58}
Hye-Shik Chang	3e2a306	2004-01-17 14:29:29 +0000	[diff] [blame]	987	{Simplified Chinese}
				988
				989	\lineiii{gbk}
				990	{936, cp936, ms936}
				991	{Unified Chinese}
				992
				993	\lineiii{gb18030}
Hye-Shik Chang	910d8f1	2004-07-17 14:44:43 +0000	[diff] [blame]	994	{gb18030-2000}
Hye-Shik Chang	3e2a306	2004-01-17 14:29:29 +0000	[diff] [blame]	995	{Unified Chinese}
				996
				997	\lineiii{hz}
Hye-Shik Chang	910d8f1	2004-07-17 14:44:43 +0000	[diff] [blame]	998	{hzgb, hz-gb, hz-gb-2312}
Hye-Shik Chang	3e2a306	2004-01-17 14:29:29 +0000	[diff] [blame]	999	{Simplified Chinese}
				1000
				1001	\lineiii{iso2022_jp}
Hye-Shik Chang	910d8f1	2004-07-17 14:44:43 +0000	[diff] [blame]	1002	{csiso2022jp, iso2022jp, iso-2022-jp}
Hye-Shik Chang	3e2a306	2004-01-17 14:29:29 +0000	[diff] [blame]	1003	{Japanese}
				1004
				1005	\lineiii{iso2022_jp_1}
Hye-Shik Chang	910d8f1	2004-07-17 14:44:43 +0000	[diff] [blame]	1006	{iso2022jp-1, iso-2022-jp-1}
Hye-Shik Chang	3e2a306	2004-01-17 14:29:29 +0000	[diff] [blame]	1007	{Japanese}
				1008
				1009	\lineiii{iso2022_jp_2}
Hye-Shik Chang	910d8f1	2004-07-17 14:44:43 +0000	[diff] [blame]	1010	{iso2022jp-2, iso-2022-jp-2}
Hye-Shik Chang	3e2a306	2004-01-17 14:29:29 +0000	[diff] [blame]	1011	{Japanese, Korean, Simplified Chinese, Western Europe, Greek}
				1012
Hye-Shik Chang	2bb146f	2004-07-18 03:06:29 +0000	[diff] [blame]	1013	\lineiii{iso2022_jp_2004}
				1014	{iso2022jp-2004, iso-2022-jp-2004}
				1015	{Japanese}
				1016
Hye-Shik Chang	3e2a306	2004-01-17 14:29:29 +0000	[diff] [blame]	1017	\lineiii{iso2022_jp_3}
Hye-Shik Chang	910d8f1	2004-07-17 14:44:43 +0000	[diff] [blame]	1018	{iso2022jp-3, iso-2022-jp-3}
Hye-Shik Chang	3e2a306	2004-01-17 14:29:29 +0000	[diff] [blame]	1019	{Japanese}
				1020
				1021	\lineiii{iso2022_jp_ext}
Hye-Shik Chang	910d8f1	2004-07-17 14:44:43 +0000	[diff] [blame]	1022	{iso2022jp-ext, iso-2022-jp-ext}
Hye-Shik Chang	3e2a306	2004-01-17 14:29:29 +0000	[diff] [blame]	1023	{Japanese}
				1024
				1025	\lineiii{iso2022_kr}
Hye-Shik Chang	910d8f1	2004-07-17 14:44:43 +0000	[diff] [blame]	1026	{csiso2022kr, iso2022kr, iso-2022-kr}
Hye-Shik Chang	3e2a306	2004-01-17 14:29:29 +0000	[diff] [blame]	1027	{Korean}
				1028
Martin v. Löwis	5c37a77	2002-12-31 12:39:07 +0000	[diff] [blame]	1029	\lineiii{latin_1}
				1030	{iso-8859-1, iso8859-1, 8859, cp819, latin, latin1, L1}
				1031	{West Europe}
				1032
				1033	\lineiii{iso8859_2}
				1034	{iso-8859-2, latin2, L2}
				1035	{Central and Eastern Europe}
				1036
				1037	\lineiii{iso8859_3}
				1038	{iso-8859-3, latin3, L3}
				1039	{Esperanto, Maltese}
				1040
				1041	\lineiii{iso8859_4}
				1042	{iso-8859-4, latin4, L4}
				1043	{Baltic languagues}
				1044
				1045	\lineiii{iso8859_5}
				1046	{iso-8859-5, cyrillic}
				1047	{Bulgarian, Byelorussian, Macedonian, Russian, Serbian}
				1048
				1049	\lineiii{iso8859_6}
				1050	{iso-8859-6, arabic}
				1051	{Arabic}
				1052
				1053	\lineiii{iso8859_7}
				1054	{iso-8859-7, greek, greek8}
				1055	{Greek}
				1056
				1057	\lineiii{iso8859_8}
				1058	{iso-8859-8, hebrew}
				1059	{Hebrew}
				1060
				1061	\lineiii{iso8859_9}
				1062	{iso-8859-9, latin5, L5}
				1063	{Turkish}
				1064
				1065	\lineiii{iso8859_10}
				1066	{iso-8859-10, latin6, L6}
				1067	{Nordic languages}
				1068
				1069	\lineiii{iso8859_13}
				1070	{iso-8859-13}
				1071	{Baltic languages}
				1072
				1073	\lineiii{iso8859_14}
				1074	{iso-8859-14, latin8, L8}
				1075	{Celtic languages}
				1076
				1077	\lineiii{iso8859_15}
				1078	{iso-8859-15}
				1079	{Western Europe}
				1080
Hye-Shik Chang	3e2a306	2004-01-17 14:29:29 +0000	[diff] [blame]	1081	\lineiii{johab}
				1082	{cp1361, ms1361}
				1083	{Korean}
				1084
Martin v. Löwis	5c37a77	2002-12-31 12:39:07 +0000	[diff] [blame]	1085	\lineiii{koi8_r}
				1086	{}
				1087	{Russian}
				1088
				1089	\lineiii{koi8_u}
				1090	{}
				1091	{Ukrainian}
				1092
				1093	\lineiii{mac_cyrillic}
				1094	{maccyrillic}
				1095	{Bulgarian, Byelorussian, Macedonian, Russian, Serbian}
				1096
				1097	\lineiii{mac_greek}
				1098	{macgreek}
				1099	{Greek}
				1100
				1101	\lineiii{mac_iceland}
				1102	{maciceland}
				1103	{Icelandic}
				1104
				1105	\lineiii{mac_latin2}
				1106	{maclatin2, maccentraleurope}
				1107	{Central and Eastern Europe}
				1108
				1109	\lineiii{mac_roman}
				1110	{macroman}
				1111	{Western Europe}
				1112
				1113	\lineiii{mac_turkish}
				1114	{macturkish}
				1115	{Turkish}
				1116
Hye-Shik Chang	5c5316f	2004-03-19 08:06:07 +0000	[diff] [blame]	1117	\lineiii{ptcp154}
				1118	{csptcp154, pt154, cp154, cyrillic-asian}
				1119	{Kazakh}
				1120
Hye-Shik Chang	3e2a306	2004-01-17 14:29:29 +0000	[diff] [blame]	1121	\lineiii{shift_jis}
				1122	{csshiftjis, shiftjis, sjis, s_jis}
				1123	{Japanese}
				1124
Hye-Shik Chang	2bb146f	2004-07-18 03:06:29 +0000	[diff] [blame]	1125	\lineiii{shift_jis_2004}
				1126	{shiftjis2004, sjis_2004, sjis2004}
				1127	{Japanese}
				1128
Hye-Shik Chang	3e2a306	2004-01-17 14:29:29 +0000	[diff] [blame]	1129	\lineiii{shift_jisx0213}
				1130	{shiftjisx0213, sjisx0213, s_jisx0213}
				1131	{Japanese}
				1132
Martin v. Löwis	5c37a77	2002-12-31 12:39:07 +0000	[diff] [blame]	1133	\lineiii{utf_16}
				1134	{U16, utf16}
				1135	{all languages}
				1136
				1137	\lineiii{utf_16_be}
				1138	{UTF-16BE}
				1139	{all languages (BMP only)}
				1140
				1141	\lineiii{utf_16_le}
				1142	{UTF-16LE}
				1143	{all languages (BMP only)}
				1144
				1145	\lineiii{utf_7}
Walter Dörwald	007f8df	2005-10-09 19:42:27 +0000	[diff] [blame]	1146	{U7, unicode-1-1-utf-7}
Martin v. Löwis	5c37a77	2002-12-31 12:39:07 +0000	[diff] [blame]	1147	{all languages}
				1148
				1149	\lineiii{utf_8}
				1150	{U8, UTF, utf8}
				1151	{all languages}
				1152
Martin v. Löwis	412ed3b	2006-01-08 10:45:39 +0000	[diff] [blame]	1153	\lineiii{utf_8_sig}
				1154	{}
				1155	{all languages}
				1156
Martin v. Löwis	5c37a77	2002-12-31 12:39:07 +0000	[diff] [blame]	1157	\end{longtableiii}
				1158
				1159	A number of codecs are specific to Python, so their codec names have
				1160	no meaning outside Python. Some of them don't convert from Unicode
				1161	strings to byte strings, but instead use the property of the Python
				1162	codecs machinery that any bijective function with one argument can be
				1163	considered as an encoding.
				1164
				1165	For the codecs listed below, the result in the ``encoding'' direction
				1166	is always a byte string. The result of the ``decoding'' direction is
				1167	listed as operand type in the table.
				1168
				1169	\begin{tableiv}{l\|l\|l\|l}{textrm}{Codec}{Aliases}{Operand type}{Purpose}
				1170
				1171	\lineiv{base64_codec}
				1172	{base64, base-64}
				1173	{byte string}
				1174	{Convert operand to MIME base64}
				1175
Raymond Hettinger	9a80c5d	2003-09-23 20:21:01 +0000	[diff] [blame]	1176	\lineiv{bz2_codec}
				1177	{bz2}
				1178	{byte string}
				1179	{Compress the operand using bz2}
				1180
Martin v. Löwis	5c37a77	2002-12-31 12:39:07 +0000	[diff] [blame]	1181	\lineiv{hex_codec}
				1182	{hex}
				1183	{byte string}
Fred Drake	d4be747	2003-04-30 15:02:07 +0000	[diff] [blame]	1184	{Convert operand to hexadecimal representation, with two
				1185	digits per byte}
Martin v. Löwis	5c37a77	2002-12-31 12:39:07 +0000	[diff] [blame]	1186
Martin v. Löwis	2548c73	2003-04-18 10:39:54 +0000	[diff] [blame]	1187	\lineiv{idna}
				1188	{}
				1189	{Unicode string}
Fred Drake	d4be747	2003-04-30 15:02:07 +0000	[diff] [blame]	1190	{Implements \rfc{3490}.
Raymond Hettinger	aa1178b	2003-09-01 23:13:04 +0000	[diff] [blame]	1191	\versionadded{2.3}
Fred Drake	d4be747	2003-04-30 15:02:07 +0000	[diff] [blame]	1192	See also \refmodule{encodings.idna}}
Martin v. Löwis	2548c73	2003-04-18 10:39:54 +0000	[diff] [blame]	1193
Martin v. Löwis	5c37a77	2002-12-31 12:39:07 +0000	[diff] [blame]	1194	\lineiv{mbcs}
				1195	{dbcs}
				1196	{Unicode string}
				1197	{Windows only: Encode operand according to the ANSI codepage (CP_ACP)}
				1198
				1199	\lineiv{palmos}
				1200	{}
				1201	{Unicode string}
				1202	{Encoding of PalmOS 3.5}
				1203
Martin v. Löwis	2548c73	2003-04-18 10:39:54 +0000	[diff] [blame]	1204	\lineiv{punycode}
				1205	{}
				1206	{Unicode string}
Fred Drake	d4be747	2003-04-30 15:02:07 +0000	[diff] [blame]	1207	{Implements \rfc{3492}.
				1208	\versionadded{2.3}}
Martin v. Löwis	2548c73	2003-04-18 10:39:54 +0000	[diff] [blame]	1209
Martin v. Löwis	5c37a77	2002-12-31 12:39:07 +0000	[diff] [blame]	1210	\lineiv{quopri_codec}
				1211	{quopri, quoted-printable, quotedprintable}
				1212	{byte string}
				1213	{Convert operand to MIME quoted printable}
				1214
				1215	\lineiv{raw_unicode_escape}
				1216	{}
				1217	{Unicode string}
Fred Drake	d4be747	2003-04-30 15:02:07 +0000	[diff] [blame]	1218	{Produce a string that is suitable as raw Unicode literal in
				1219	Python source code}
Martin v. Löwis	5c37a77	2002-12-31 12:39:07 +0000	[diff] [blame]	1220
				1221	\lineiv{rot_13}
				1222	{rot13}
				1223	{byte string}
				1224	{Returns the Caesar-cypher encryption of the operand}
				1225
				1226	\lineiv{string_escape}
				1227	{}
				1228	{byte string}
Fred Drake	d4be747	2003-04-30 15:02:07 +0000	[diff] [blame]	1229	{Produce a string that is suitable as string literal in
				1230	Python source code}
Martin v. Löwis	5c37a77	2002-12-31 12:39:07 +0000	[diff] [blame]	1231
				1232	\lineiv{undefined}
				1233	{}
				1234	{any}
Georg Brandl	8f4b4db	2006-03-09 10:16:42 +0000	[diff] [blame]	1235	{Raise an exception for all conversions. Can be used as the
Fred Drake	d4be747	2003-04-30 15:02:07 +0000	[diff] [blame]	1236	system encoding if no automatic coercion between byte and
				1237	Unicode strings is desired.}
Martin v. Löwis	5c37a77	2002-12-31 12:39:07 +0000	[diff] [blame]	1238
				1239	\lineiv{unicode_escape}
				1240	{}
				1241	{Unicode string}
Fred Drake	d4be747	2003-04-30 15:02:07 +0000	[diff] [blame]	1242	{Produce a string that is suitable as Unicode literal in
				1243	Python source code}
Martin v. Löwis	5c37a77	2002-12-31 12:39:07 +0000	[diff] [blame]	1244
				1245	\lineiv{unicode_internal}
				1246	{}
				1247	{Unicode string}
Raymond Hettinger	6880431	2005-01-01 00:28:46 +0000	[diff] [blame]	1248	{Return the internal representation of the operand}
Martin v. Löwis	5c37a77	2002-12-31 12:39:07 +0000	[diff] [blame]	1249
				1250	\lineiv{uu_codec}
				1251	{uu}
				1252	{byte string}
				1253	{Convert the operand using uuencode}
				1254
				1255	\lineiv{zlib_codec}
				1256	{zip, zlib}
				1257	{byte string}
				1258	{Compress the operand using gzip}
				1259
				1260	\end{tableiv}
Martin v. Löwis	2548c73	2003-04-18 10:39:54 +0000	[diff] [blame]	1261
				1262	\subsection{\module{encodings.idna} ---
				1263	Internationalized Domain Names in Applications}
				1264
				1265	\declaremodule{standard}{encodings.idna}
				1266	\modulesynopsis{Internationalized Domain Names implementation}
Fred Drake	d4be747	2003-04-30 15:02:07 +0000	[diff] [blame]	1267	% XXX The next line triggers a formatting bug, so it's commented out
				1268	% until that can be fixed.
				1269	%\moduleauthor{Martin v. L\"owis}
				1270
				1271	\versionadded{2.3}
Martin v. Löwis	2548c73	2003-04-18 10:39:54 +0000	[diff] [blame]	1272
				1273	This module implements \rfc{3490} (Internationalized Domain Names in
				1274	Applications) and \rfc{3492} (Nameprep: A Stringprep Profile for
				1275	Internationalized Domain Names (IDN)). It builds upon the
Fred Drake	d24c767	2003-07-16 05:17:23 +0000	[diff] [blame]	1276	\code{punycode} encoding and \refmodule{stringprep}.
Martin v. Löwis	2548c73	2003-04-18 10:39:54 +0000	[diff] [blame]	1277
Fred Drake	d4be747	2003-04-30 15:02:07 +0000	[diff] [blame]	1278	These RFCs together define a protocol to support non-\ASCII{} characters
				1279	in domain names. A domain name containing non-\ASCII{} characters (such
Fred Drake	d24c767	2003-07-16 05:17:23 +0000	[diff] [blame]	1280	as ``www.Alliancefran\c caise.nu'') is converted into an
Fred Drake	d4be747	2003-04-30 15:02:07 +0000	[diff] [blame]	1281	\ASCII-compatible encoding (ACE, such as
Martin v. Löwis	2548c73	2003-04-18 10:39:54 +0000	[diff] [blame]	1282	``www.xn--alliancefranaise-npb.nu''). The ACE form of the domain name
				1283	is then used in all places where arbitrary characters are not allowed
Fred Drake	d4be747	2003-04-30 15:02:07 +0000	[diff] [blame]	1284	by the protocol, such as DNS queries, HTTP \mailheader{Host} fields, and so
Martin v. Löwis	2548c73	2003-04-18 10:39:54 +0000	[diff] [blame]	1285	on. This conversion is carried out in the application; if possible
				1286	invisible to the user: The application should transparently convert
				1287	Unicode domain labels to IDNA on the wire, and convert back ACE labels
				1288	to Unicode before presenting them to the user.
				1289
				1290	Python supports this conversion in several ways: The \code{idna} codec
				1291	allows to convert between Unicode and the ACE. Furthermore, the
Fred Drake	d24c767	2003-07-16 05:17:23 +0000	[diff] [blame]	1292	\refmodule{socket} module transparently converts Unicode host names to
Martin v. Löwis	2548c73	2003-04-18 10:39:54 +0000	[diff] [blame]	1293	ACE, so that applications need not be concerned about converting host
				1294	names themselves when they pass them to the socket module. On top of
				1295	that, modules that have host names as function parameters, such as
Fred Drake	d24c767	2003-07-16 05:17:23 +0000	[diff] [blame]	1296	\refmodule{httplib} and \refmodule{ftplib}, accept Unicode host names
				1297	(\refmodule{httplib} then also transparently sends an IDNA hostname in
				1298	the \mailheader{Host} field if it sends that field at all).
Martin v. Löwis	2548c73	2003-04-18 10:39:54 +0000	[diff] [blame]	1299
				1300	When receiving host names from the wire (such as in reverse name
				1301	lookup), no automatic conversion to Unicode is performed: Applications
				1302	wishing to present such host names to the user should decode them to
				1303	Unicode.
				1304
				1305	The module \module{encodings.idna} also implements the nameprep
				1306	procedure, which performs certain normalizations on host names, to
				1307	achieve case-insensitivity of international domain names, and to unify
				1308	similar characters. The nameprep functions can be used directly if
				1309	desired.
				1310
				1311	\begin{funcdesc}{nameprep}{label}
				1312	Return the nameprepped version of \var{label}. The implementation
				1313	currently assumes query strings, so \code{AllowUnassigned} is
				1314	true.
				1315	\end{funcdesc}
				1316
Raymond Hettinger	b5155e3	2003-06-18 01:58:31 +0000	[diff] [blame]	1317	\begin{funcdesc}{ToASCII}{label}
Fred Drake	d4be747	2003-04-30 15:02:07 +0000	[diff] [blame]	1318	Convert a label to \ASCII, as specified in \rfc{3490}.
Martin v. Löwis	2548c73	2003-04-18 10:39:54 +0000	[diff] [blame]	1319	\code{UseSTD3ASCIIRules} is assumed to be false.
				1320	\end{funcdesc}
				1321
				1322	\begin{funcdesc}{ToUnicode}{label}
				1323	Convert a label to Unicode, as specified in \rfc{3490}.
				1324	\end{funcdesc}
Martin v. Löwis	412ed3b	2006-01-08 10:45:39 +0000	[diff] [blame]	1325
				1326	\subsection{\module{encodings.utf_8_sig} ---
				1327	UTF-8 codec with BOM signature}
				1328	\declaremodule{standard}{encodings.utf-8-sig} % XXX utf_8_sig gives TeX errors
				1329	\modulesynopsis{UTF-8 codec with BOM signature}
				1330	\moduleauthor{Walter D\"orwald}
				1331
				1332	\versionadded{2.5}
				1333
				1334	This module implements a variant of the UTF-8 codec: On encoding a
				1335	UTF-8 encoded BOM will be prepended to the UTF-8 encoded bytes. For
				1336	the stateful encoder this is only done once (on the first write to the
				1337	byte stream). For decoding an optional UTF-8 encoded BOM at the start
				1338	of the data will be skipped.