Blame - Doc/lib/libcodecs.tex - platform/external/python/cpython3

blob: d5c0d9fc52e058866fdc4518d263ac6512afe036 [file] [log] [blame]

Fred Drake	b7979c7	2000-04-06 14:21:58 +0000	[diff] [blame]	1	\section{\module{codecs} ---
Fred Drake	69ca950	2000-04-06 16:09:59 +0000	[diff] [blame]	2	Codec registry and base classes}
Fred Drake	b7979c7	2000-04-06 14:21:58 +0000	[diff] [blame]	3
Fred Drake	69ca950	2000-04-06 16:09:59 +0000	[diff] [blame]	4	\declaremodule{standard}{codecs}
Fred Drake	b7979c7	2000-04-06 14:21:58 +0000	[diff] [blame]	5	\modulesynopsis{Encode and decode data and streams.}
				6	\moduleauthor{Marc-Andre Lemburg}{mal@lemburg.com}
				7	\sectionauthor{Marc-Andre Lemburg}{mal@lemburg.com}
Martin v. Löwis	2548c73	2003-04-18 10:39:54 +0000	[diff] [blame]	8	\sectionauthor{Martin v. L\"owis}{martin@v.loewis.de}
Fred Drake	b7979c7	2000-04-06 14:21:58 +0000	[diff] [blame]	9
				10	\index{Unicode}
				11	\index{Codecs}
				12	\indexii{Codecs}{encode}
				13	\indexii{Codecs}{decode}
				14	\index{streams}
				15	\indexii{stackable}{streams}
				16
				17
				18	This module defines base classes for standard Python codecs (encoders
				19	and decoders) and provides access to the internal Python codec
Walter Dörwald	3aeb632	2002-09-02 13:14:32 +0000	[diff] [blame]	20	registry which manages the codec and error handling lookup process.
Fred Drake	b7979c7	2000-04-06 14:21:58 +0000	[diff] [blame]	21
				22	It defines the following functions:
				23
				24	\begin{funcdesc}{register}{search_function}
				25	Register a codec search function. Search functions are expected to
				26	take one argument, the encoding name in all lower case letters, and
				27	return a tuple of functions \code{(\var{encoder}, \var{decoder}, \var{stream_reader},
				28	\var{stream_writer})} taking the following arguments:
				29
				30	\var{encoder} and \var{decoder}: These must be functions or methods
Fred Drake	602aa77	2000-10-12 20:50:55 +0000	[diff] [blame]	31	which have the same interface as the
				32	\method{encode()}/\method{decode()} methods of Codec instances (see
				33	Codec Interface). The functions/methods are expected to work in a
				34	stateless mode.
Fred Drake	b7979c7	2000-04-06 14:21:58 +0000	[diff] [blame]	35
				36	\var{stream_reader} and \var{stream_writer}: These have to be
				37	factory functions providing the following interface:
				38
Fred Drake	602aa77	2000-10-12 20:50:55 +0000	[diff] [blame]	39	\code{factory(\var{stream}, \var{errors}='strict')}
Fred Drake	b7979c7	2000-04-06 14:21:58 +0000	[diff] [blame]	40
				41	The factory functions must return objects providing the interfaces
Fred Drake	69ca950	2000-04-06 16:09:59 +0000	[diff] [blame]	42	defined by the base classes \class{StreamWriter} and
				43	\class{StreamReader}, respectively. Stream codecs can maintain
				44	state.
Fred Drake	b7979c7	2000-04-06 14:21:58 +0000	[diff] [blame]	45
Fred Drake	69ca950	2000-04-06 16:09:59 +0000	[diff] [blame]	46	Possible values for errors are \code{'strict'} (raise an exception
				47	in case of an encoding error), \code{'replace'} (replace malformed
Walter Dörwald	72f8616	2002-11-19 21:51:35 +0000	[diff] [blame]	48	data with a suitable replacement marker, such as \character{?}),
Fred Drake	69ca950	2000-04-06 16:09:59 +0000	[diff] [blame]	49	\code{'ignore'} (ignore malformed data and continue without further
Walter Dörwald	72f8616	2002-11-19 21:51:35 +0000	[diff] [blame]	50	notice), \code{'xmlcharrefreplace'} (replace with the appropriate XML
				51	character reference (for encoding only)) and \code{'backslashreplace'}
				52	(replace with backslashed escape sequences (for encoding only)) as
				53	well as any other error handling name defined via
				54	\function{register_error()}.
Fred Drake	b7979c7	2000-04-06 14:21:58 +0000	[diff] [blame]	55
				56	In case a search function cannot find a given encoding, it should
Fred Drake	69ca950	2000-04-06 16:09:59 +0000	[diff] [blame]	57	return \code{None}.
Fred Drake	b7979c7	2000-04-06 14:21:58 +0000	[diff] [blame]	58	\end{funcdesc}
				59
				60	\begin{funcdesc}{lookup}{encoding}
				61	Looks up a codec tuple in the Python codec registry and returns the
				62	function tuple as defined above.
				63
				64	Encodings are first looked up in the registry's cache. If not found,
				65	the list of registered search functions is scanned. If no codecs tuple
Fred Drake	69ca950	2000-04-06 16:09:59 +0000	[diff] [blame]	66	is found, a \exception{LookupError} is raised. Otherwise, the codecs
				67	tuple is stored in the cache and returned to the caller.
Fred Drake	b7979c7	2000-04-06 14:21:58 +0000	[diff] [blame]	68	\end{funcdesc}
				69
Skip Montanaro	b02ea65	2002-04-17 19:33:06 +0000	[diff] [blame]	70	To simplify access to the various codecs, the module provides these
Marc-André Lemburg	494f2ae	2001-09-19 11:33:31 +0000	[diff] [blame]	71	additional functions which use \function{lookup()} for the codec
				72	lookup:
				73
				74	\begin{funcdesc}{getencoder}{encoding}
				75	Lookup up the codec for the given encoding and return its encoder
				76	function.
				77
				78	Raises a \exception{LookupError} in case the encoding cannot be found.
				79	\end{funcdesc}
				80
				81	\begin{funcdesc}{getdecoder}{encoding}
				82	Lookup up the codec for the given encoding and return its decoder
				83	function.
				84
				85	Raises a \exception{LookupError} in case the encoding cannot be found.
				86	\end{funcdesc}
				87
				88	\begin{funcdesc}{getreader}{encoding}
				89	Lookup up the codec for the given encoding and return its StreamReader
				90	class or factory function.
				91
				92	Raises a \exception{LookupError} in case the encoding cannot be found.
				93	\end{funcdesc}
				94
				95	\begin{funcdesc}{getwriter}{encoding}
				96	Lookup up the codec for the given encoding and return its StreamWriter
				97	class or factory function.
				98
				99	Raises a \exception{LookupError} in case the encoding cannot be found.
				100	\end{funcdesc}
				101
Walter Dörwald	3aeb632	2002-09-02 13:14:32 +0000	[diff] [blame]	102	\begin{funcdesc}{register_error}{name, error_handler}
				103	Register the error handling function \var{error_handler} under the
Raymond Hettinger	8a64d40	2002-09-08 22:26:13 +0000	[diff] [blame]	104	name \var{name}. \var{error_handler} will be called during encoding
Walter Dörwald	3aeb632	2002-09-02 13:14:32 +0000	[diff] [blame]	105	and decoding in case of an error, when \var{name} is specified as the
Walter Dörwald	2e0b18a	2003-01-31 17:19:08 +0000	[diff] [blame]	106	errors parameter.
				107
				108	For encoding \var{error_handler} will be called with a
				109	\exception{UnicodeEncodeError} instance, which contains information about
				110	the location of the error. The error handler must either raise this or
				111	a different exception or return a tuple with a replacement for the
				112	unencodable part of the input and a position where encoding should
				113	continue. The encoder will encode the replacement and continue encoding
				114	the original input at the specified position. Negative position values
				115	will be treated as being relative to the end of the input string. If the
				116	resulting position is out of bound an IndexError will be raised.
				117
				118	Decoding and translating works similar, except \exception{UnicodeDecodeError}
				119	or \exception{UnicodeTranslateError} will be passed to the handler and
				120	that the replacement from the error handler will be put into the output
				121	directly.
Walter Dörwald	3aeb632	2002-09-02 13:14:32 +0000	[diff] [blame]	122	\end{funcdesc}
				123
				124	\begin{funcdesc}{lookup_error}{name}
				125	Return the error handler previously register under the name \var{name}.
				126
				127	Raises a \exception{LookupError} in case the handler cannot be found.
				128	\end{funcdesc}
				129
				130	\begin{funcdesc}{strict_errors}{exception}
				131	Implements the \code{strict} error handling.
				132	\end{funcdesc}
				133
				134	\begin{funcdesc}{replace_errors}{exception}
				135	Implements the \code{replace} error handling.
				136	\end{funcdesc}
				137
				138	\begin{funcdesc}{ignore_errors}{exception}
				139	Implements the \code{ignore} error handling.
				140	\end{funcdesc}
				141
				142	\begin{funcdesc}{xmlcharrefreplace_errors_errors}{exception}
				143	Implements the \code{xmlcharrefreplace} error handling.
				144	\end{funcdesc}
				145
				146	\begin{funcdesc}{backslashreplace_errors_errors}{exception}
				147	Implements the \code{backslashreplace} error handling.
				148	\end{funcdesc}
				149
Walter Dörwald	1a7a894	2002-11-02 13:32:07 +0000	[diff] [blame]	150	To simplify working with encoded files or stream, the module
				151	also defines these utility functions:
				152
Fred Drake	e1b304d	2000-07-24 19:35:52 +0000	[diff] [blame]	153	\begin{funcdesc}{open}{filename, mode\optional{, encoding\optional{,
				154	errors\optional{, buffering}}}}
Fred Drake	b7979c7	2000-04-06 14:21:58 +0000	[diff] [blame]	155	Open an encoded file using the given \var{mode} and return
				156	a wrapped version providing transparent encoding/decoding.
				157
Fred Drake	0aa811c	2001-10-20 04:24:09 +0000	[diff] [blame]	158	\note{The wrapped version will only accept the object format
Fred Drake	e1b304d	2000-07-24 19:35:52 +0000	[diff] [blame]	159	defined by the codecs, i.e.\ Unicode objects for most built-in
				160	codecs. Output is also codec-dependent and will usually be Unicode as
Fred Drake	0aa811c	2001-10-20 04:24:09 +0000	[diff] [blame]	161	well.}
Fred Drake	b7979c7	2000-04-06 14:21:58 +0000	[diff] [blame]	162
				163	\var{encoding} specifies the encoding which is to be used for the
Raymond Hettinger	7e43110	2003-09-22 15:00:55 +0000	[diff] [blame]	164	file.
Fred Drake	b7979c7	2000-04-06 14:21:58 +0000	[diff] [blame]	165
				166	\var{errors} may be given to define the error handling. It defaults
Fred Drake	e1b304d	2000-07-24 19:35:52 +0000	[diff] [blame]	167	to \code{'strict'} which causes a \exception{ValueError} to be raised
				168	in case an encoding error occurs.
Fred Drake	b7979c7	2000-04-06 14:21:58 +0000	[diff] [blame]	169
Fred Drake	69ca950	2000-04-06 16:09:59 +0000	[diff] [blame]	170	\var{buffering} has the same meaning as for the built-in
				171	\function{open()} function. It defaults to line buffered.
Fred Drake	b7979c7	2000-04-06 14:21:58 +0000	[diff] [blame]	172	\end{funcdesc}
				173
Fred Drake	e1b304d	2000-07-24 19:35:52 +0000	[diff] [blame]	174	\begin{funcdesc}{EncodedFile}{file, input\optional{,
				175	output\optional{, errors}}}
Fred Drake	b7979c7	2000-04-06 14:21:58 +0000	[diff] [blame]	176	Return a wrapped version of file which provides transparent
				177	encoding translation.
				178
				179	Strings written to the wrapped file are interpreted according to the
				180	given \var{input} encoding and then written to the original file as
Fred Drake	e1b304d	2000-07-24 19:35:52 +0000	[diff] [blame]	181	strings using the \var{output} encoding. The intermediate encoding will
Fred Drake	b7979c7	2000-04-06 14:21:58 +0000	[diff] [blame]	182	usually be Unicode but depends on the specified codecs.
				183
Fred Drake	e1b304d	2000-07-24 19:35:52 +0000	[diff] [blame]	184	If \var{output} is not given, it defaults to \var{input}.
Fred Drake	b7979c7	2000-04-06 14:21:58 +0000	[diff] [blame]	185
				186	\var{errors} may be given to define the error handling. It defaults to
Fred Drake	e1b304d	2000-07-24 19:35:52 +0000	[diff] [blame]	187	\code{'strict'}, which causes \exception{ValueError} to be raised in case
Fred Drake	b7979c7	2000-04-06 14:21:58 +0000	[diff] [blame]	188	an encoding error occurs.
				189	\end{funcdesc}
				190
Fred Drake	b7979c7	2000-04-06 14:21:58 +0000	[diff] [blame]	191	The module also provides the following constants which are useful
				192	for reading and writing to platform dependent files:
				193
				194	\begin{datadesc}{BOM}
				195	\dataline{BOM_BE}
				196	\dataline{BOM_LE}
Walter Dörwald	474458d	2002-06-04 15:16:29 +0000	[diff] [blame]	197	\dataline{BOM_UTF8}
				198	\dataline{BOM_UTF16}
				199	\dataline{BOM_UTF16_BE}
				200	\dataline{BOM_UTF16_LE}
				201	\dataline{BOM_UTF32}
				202	\dataline{BOM_UTF32_BE}
				203	\dataline{BOM_UTF32_LE}
				204	These constants define various encodings of the Unicode byte order mark
				205	(BOM) used in UTF-16 and UTF-32 data streams to indicate the byte order
				206	used in the stream or file and in UTF-8 as a Unicode signature.
				207	\constant{BOM_UTF16} is either \constant{BOM_UTF16_BE} or
				208	\constant{BOM_UTF16_LE} depending on the platform's native byte order,
				209	\constant{BOM} is an alias for \constant{BOM_UTF16}, \constant{BOM_LE}
				210	for \constant{BOM_UTF16_LE} and \constant{BOM_BE} for \constant{BOM_UTF16_BE}.
				211	The others represent the BOM in UTF-8 and UTF-32 encodings.
Fred Drake	b7979c7	2000-04-06 14:21:58 +0000	[diff] [blame]	212	\end{datadesc}
				213
Fred Drake	dc40ac0	2001-01-22 20:17:54 +0000	[diff] [blame]	214
Walter Dörwald	d4bfe2c	2005-11-25 17:17:12 +0000	[diff] [blame]	215	\subsection{Codec Base Classes \label{codec-base-classes}}
Fred Drake	602aa77	2000-10-12 20:50:55 +0000	[diff] [blame]	216
Fred Drake	9984e70	2005-10-20 17:52:05 +0000	[diff] [blame]	217	The \module{codecs} module defines a set of base classes which define the
Fred Drake	602aa77	2000-10-12 20:50:55 +0000	[diff] [blame]	218	interface and can also be used to easily write you own codecs for use
				219	in Python.
				220
				221	Each codec has to define four interfaces to make it usable as codec in
				222	Python: stateless encoder, stateless decoder, stream reader and stream
				223	writer. The stream reader and writers typically reuse the stateless
				224	encoder/decoder to implement the file protocols.
				225
				226	The \class{Codec} class defines the interface for stateless
				227	encoders/decoders.
				228
				229	To simplify and standardize error handling, the \method{encode()} and
				230	\method{decode()} methods may implement different error handling
				231	schemes by providing the \var{errors} string argument. The following
				232	string values are defined and implemented by all standard Python
				233	codecs:
				234
Fred Drake	dc40ac0	2001-01-22 20:17:54 +0000	[diff] [blame]	235	\begin{tableii}{l\|l}{code}{Value}{Meaning}
Walter Dörwald	430b156	2002-11-07 22:33:17 +0000	[diff] [blame]	236	\lineii{'strict'}{Raise \exception{UnicodeError} (or a subclass);
Fred Drake	dc40ac0	2001-01-22 20:17:54 +0000	[diff] [blame]	237	this is the default.}
				238	\lineii{'ignore'}{Ignore the character and continue with the next.}
				239	\lineii{'replace'}{Replace with a suitable replacement character;
				240	Python will use the official U+FFFD REPLACEMENT
Walter Dörwald	430b156	2002-11-07 22:33:17 +0000	[diff] [blame]	241	CHARACTER for the built-in Unicode codecs on
				242	decoding and '?' on encoding.}
				243	\lineii{'xmlcharrefreplace'}{Replace with the appropriate XML
				244	character reference (only for encoding).}
				245	\lineii{'backslashreplace'}{Replace with backslashed escape sequences
				246	(only for encoding).}
Fred Drake	dc40ac0	2001-01-22 20:17:54 +0000	[diff] [blame]	247	\end{tableii}
Fred Drake	602aa77	2000-10-12 20:50:55 +0000	[diff] [blame]	248
Walter Dörwald	430b156	2002-11-07 22:33:17 +0000	[diff] [blame]	249	The set of allowed values can be extended via \method{register_error}.
				250
Fred Drake	602aa77	2000-10-12 20:50:55 +0000	[diff] [blame]	251
				252	\subsubsection{Codec Objects \label{codec-objects}}
				253
				254	The \class{Codec} class defines these methods which also define the
				255	function interfaces of the stateless encoder and decoder:
				256
				257	\begin{methoddesc}{encode}{input\optional{, errors}}
				258	Encodes the object \var{input} and returns a tuple (output object,
Skip Montanaro	6c7bc31	2002-04-16 15:12:10 +0000	[diff] [blame]	259	length consumed). While codecs are not restricted to use with Unicode, in
				260	a Unicode context, encoding converts a Unicode object to a plain string
				261	using a particular character set encoding (e.g., \code{cp1252} or
				262	\code{iso-8859-1}).
Fred Drake	602aa77	2000-10-12 20:50:55 +0000	[diff] [blame]	263
				264	\var{errors} defines the error handling to apply. It defaults to
				265	\code{'strict'} handling.
				266
				267	The method may not store state in the \class{Codec} instance. Use
				268	\class{StreamCodec} for codecs which have to keep state in order to
				269	make encoding/decoding efficient.
				270
				271	The encoder must be able to handle zero length input and return an
				272	empty object of the output object type in this situation.
				273	\end{methoddesc}
				274
				275	\begin{methoddesc}{decode}{input\optional{, errors}}
				276	Decodes the object \var{input} and returns a tuple (output object,
Skip Montanaro	6c7bc31	2002-04-16 15:12:10 +0000	[diff] [blame]	277	length consumed). In a Unicode context, decoding converts a plain string
				278	encoded using a particular character set encoding to a Unicode object.
Fred Drake	602aa77	2000-10-12 20:50:55 +0000	[diff] [blame]	279
				280	\var{input} must be an object which provides the \code{bf_getreadbuf}
				281	buffer slot. Python strings, buffer objects and memory mapped files
				282	are examples of objects providing this slot.
				283
				284	\var{errors} defines the error handling to apply. It defaults to
				285	\code{'strict'} handling.
				286
				287	The method may not store state in the \class{Codec} instance. Use
				288	\class{StreamCodec} for codecs which have to keep state in order to
				289	make encoding/decoding efficient.
				290
				291	The decoder must be able to handle zero length input and return an
				292	empty object of the output object type in this situation.
				293	\end{methoddesc}
				294
				295	The \class{StreamWriter} and \class{StreamReader} classes provide
				296	generic working interfaces which can be used to implement new
				297	encodings submodules very easily. See \module{encodings.utf_8} for an
				298	example on how this is done.
				299
				300
				301	\subsubsection{StreamWriter Objects \label{stream-writer-objects}}
				302
				303	The \class{StreamWriter} class is a subclass of \class{Codec} and
				304	defines the following methods which every stream writer must define in
				305	order to be compatible to the Python codec registry.
				306
				307	\begin{classdesc}{StreamWriter}{stream\optional{, errors}}
				308	Constructor for a \class{StreamWriter} instance.
				309
				310	All stream writers must provide this constructor interface. They are
				311	free to add additional keyword arguments, but only the ones defined
				312	here are used by the Python codec registry.
				313
				314	\var{stream} must be a file-like object open for writing (binary)
				315	data.
				316
				317	The \class{StreamWriter} may implement different error handling
				318	schemes by providing the \var{errors} keyword argument. These
Walter Dörwald	430b156	2002-11-07 22:33:17 +0000	[diff] [blame]	319	parameters are predefined:
Fred Drake	602aa77	2000-10-12 20:50:55 +0000	[diff] [blame]	320
				321	\begin{itemize}
				322	\item \code{'strict'} Raise \exception{ValueError} (or a subclass);
				323	this is the default.
				324	\item \code{'ignore'} Ignore the character and continue with the next.
				325	\item \code{'replace'} Replace with a suitable replacement character
Walter Dörwald	430b156	2002-11-07 22:33:17 +0000	[diff] [blame]	326	\item \code{'xmlcharrefreplace'} Replace with the appropriate XML
				327	character reference
				328	\item \code{'backslashreplace'} Replace with backslashed escape sequences.
Fred Drake	602aa77	2000-10-12 20:50:55 +0000	[diff] [blame]	329	\end{itemize}
Walter Dörwald	430b156	2002-11-07 22:33:17 +0000	[diff] [blame]	330
				331	The \var{errors} argument will be assigned to an attribute of the
				332	same name. Assigning to this attribute makes it possible to switch
				333	between different error handling strategies during the lifetime
				334	of the \class{StreamWriter} object.
				335
				336	The set of allowed values for the \var{errors} argument can
				337	be extended with \function{register_error()}.
Fred Drake	602aa77	2000-10-12 20:50:55 +0000	[diff] [blame]	338	\end{classdesc}
				339
				340	\begin{methoddesc}{write}{object}
				341	Writes the object's contents encoded to the stream.
				342	\end{methoddesc}
				343
				344	\begin{methoddesc}{writelines}{list}
				345	Writes the concatenated list of strings to the stream (possibly by
				346	reusing the \method{write()} method).
				347	\end{methoddesc}
				348
				349	\begin{methoddesc}{reset}{}
				350	Flushes and resets the codec buffers used for keeping state.
				351
				352	Calling this method should ensure that the data on the output is put
				353	into a clean state, that allows appending of new fresh data without
				354	having to rescan the whole stream to recover state.
				355	\end{methoddesc}
				356
				357	In addition to the above methods, the \class{StreamWriter} must also
				358	inherit all other methods and attribute from the underlying stream.
				359
				360
				361	\subsubsection{StreamReader Objects \label{stream-reader-objects}}
				362
				363	The \class{StreamReader} class is a subclass of \class{Codec} and
				364	defines the following methods which every stream reader must define in
				365	order to be compatible to the Python codec registry.
				366
				367	\begin{classdesc}{StreamReader}{stream\optional{, errors}}
				368	Constructor for a \class{StreamReader} instance.
				369
				370	All stream readers must provide this constructor interface. They are
				371	free to add additional keyword arguments, but only the ones defined
				372	here are used by the Python codec registry.
				373
				374	\var{stream} must be a file-like object open for reading (binary)
				375	data.
				376
				377	The \class{StreamReader} may implement different error handling
				378	schemes by providing the \var{errors} keyword argument. These
				379	parameters are defined:
				380
				381	\begin{itemize}
				382	\item \code{'strict'} Raise \exception{ValueError} (or a subclass);
				383	this is the default.
				384	\item \code{'ignore'} Ignore the character and continue with the next.
				385	\item \code{'replace'} Replace with a suitable replacement character.
				386	\end{itemize}
Walter Dörwald	430b156	2002-11-07 22:33:17 +0000	[diff] [blame]	387
				388	The \var{errors} argument will be assigned to an attribute of the
				389	same name. Assigning to this attribute makes it possible to switch
				390	between different error handling strategies during the lifetime
				391	of the \class{StreamReader} object.
				392
				393	The set of allowed values for the \var{errors} argument can
				394	be extended with \function{register_error()}.
Fred Drake	602aa77	2000-10-12 20:50:55 +0000	[diff] [blame]	395	\end{classdesc}
				396
Martin v. Löwis	56066d2	2005-08-24 07:38:12 +0000	[diff] [blame]	397	\begin{methoddesc}{read}{\optional{size\optional{, chars, \optional{firstline}}}}
Fred Drake	602aa77	2000-10-12 20:50:55 +0000	[diff] [blame]	398	Decodes data from the stream and returns the resulting object.
				399
Walter Dörwald	6965203	2004-09-07 20:24:22 +0000	[diff] [blame]	400	\var{chars} indicates the number of characters to read from the
Fred Drake	a2544ee	2004-09-10 01:16:49 +0000	[diff] [blame]	401	stream. \function{read()} will never return more than \var{chars}
Walter Dörwald	6965203	2004-09-07 20:24:22 +0000	[diff] [blame]	402	characters, but it might return less, if there are not enough
				403	characters available.
				404
Fred Drake	602aa77	2000-10-12 20:50:55 +0000	[diff] [blame]	405	\var{size} indicates the approximate maximum number of bytes to read
				406	from the stream for decoding purposes. The decoder can modify this
				407	setting as appropriate. The default value -1 indicates to read and
				408	decode as much as possible. \var{size} is intended to prevent having
				409	to decode huge files in one step.
				410
Martin v. Löwis	56066d2	2005-08-24 07:38:12 +0000	[diff] [blame]	411	\var{firstline} indicates that it would be sufficient to only return
				412	the first line, if there are decoding errors on later lines.
				413
Fred Drake	602aa77	2000-10-12 20:50:55 +0000	[diff] [blame]	414	The method should use a greedy read strategy meaning that it should
				415	read as much data as is allowed within the definition of the encoding
				416	and the given size, e.g. if optional encoding endings or state
				417	markers are available on the stream, these should be read too.
Walter Dörwald	6965203	2004-09-07 20:24:22 +0000	[diff] [blame]	418
				419	\versionchanged[\var{chars} argument added]{2.4}
Martin v. Löwis	56066d2	2005-08-24 07:38:12 +0000	[diff] [blame]	420	\versionchanged[\var{firstline} argument added]{2.4.2}
Fred Drake	602aa77	2000-10-12 20:50:55 +0000	[diff] [blame]	421	\end{methoddesc}
				422
Walter Dörwald	6965203	2004-09-07 20:24:22 +0000	[diff] [blame]	423	\begin{methoddesc}{readline}{\optional{size\optional{, keepends}}}
Fred Drake	602aa77	2000-10-12 20:50:55 +0000	[diff] [blame]	424	Read one line from the input stream and return the
				425	decoded data.
				426
Fred Drake	602aa77	2000-10-12 20:50:55 +0000	[diff] [blame]	427	\var{size}, if given, is passed as size argument to the stream's
				428	\method{readline()} method.
Walter Dörwald	6965203	2004-09-07 20:24:22 +0000	[diff] [blame]	429
				430	If \var{keepends} is false lineends will be stripped from the
				431	lines returned.
				432
				433	\versionchanged[\var{keepends} argument added]{2.4}
Fred Drake	602aa77	2000-10-12 20:50:55 +0000	[diff] [blame]	434	\end{methoddesc}
				435
Walter Dörwald	6965203	2004-09-07 20:24:22 +0000	[diff] [blame]	436	\begin{methoddesc}{readlines}{\optional{sizehint\optional{, keepends}}}
Fred Drake	602aa77	2000-10-12 20:50:55 +0000	[diff] [blame]	437	Read all lines available on the input stream and return them as list
				438	of lines.
				439
				440	Line breaks are implemented using the codec's decoder method and are
Walter Dörwald	6965203	2004-09-07 20:24:22 +0000	[diff] [blame]	441	included in the list entries if \var{keepends} is true.
Fred Drake	602aa77	2000-10-12 20:50:55 +0000	[diff] [blame]	442
				443	\var{sizehint}, if given, is passed as \var{size} argument to the
				444	stream's \method{read()} method.
				445	\end{methoddesc}
				446
				447	\begin{methoddesc}{reset}{}
				448	Resets the codec buffers used for keeping state.
				449
				450	Note that no stream repositioning should take place. This method is
				451	primarily intended to be able to recover from decoding errors.
				452	\end{methoddesc}
				453
				454	In addition to the above methods, the \class{StreamReader} must also
				455	inherit all other methods and attribute from the underlying stream.
				456
				457	The next two base classes are included for convenience. They are not
				458	needed by the codec registry, but may provide useful in practice.
				459
				460
				461	\subsubsection{StreamReaderWriter Objects \label{stream-reader-writer}}
				462
				463	The \class{StreamReaderWriter} allows wrapping streams which work in
				464	both read and write modes.
				465
				466	The design is such that one can use the factory functions returned by
				467	the \function{lookup()} function to construct the instance.
				468
				469	\begin{classdesc}{StreamReaderWriter}{stream, Reader, Writer, errors}
				470	Creates a \class{StreamReaderWriter} instance.
				471	\var{stream} must be a file-like object.
				472	\var{Reader} and \var{Writer} must be factory functions or classes
				473	providing the \class{StreamReader} and \class{StreamWriter} interface
				474	resp.
				475	Error handling is done in the same way as defined for the
				476	stream readers and writers.
				477	\end{classdesc}
				478
				479	\class{StreamReaderWriter} instances define the combined interfaces of
				480	\class{StreamReader} and \class{StreamWriter} classes. They inherit
				481	all other methods and attribute from the underlying stream.
				482
				483
				484	\subsubsection{StreamRecoder Objects \label{stream-recoder-objects}}
				485
				486	The \class{StreamRecoder} provide a frontend - backend view of
				487	encoding data which is sometimes useful when dealing with different
				488	encoding environments.
				489
				490	The design is such that one can use the factory functions returned by
				491	the \function{lookup()} function to construct the instance.
				492
				493	\begin{classdesc}{StreamRecoder}{stream, encode, decode,
				494	Reader, Writer, errors}
				495	Creates a \class{StreamRecoder} instance which implements a two-way
				496	conversion: \var{encode} and \var{decode} work on the frontend (the
				497	input to \method{read()} and output of \method{write()}) while
				498	\var{Reader} and \var{Writer} work on the backend (reading and
				499	writing to the stream).
				500
				501	You can use these objects to do transparent direct recodings from
				502	e.g.\ Latin-1 to UTF-8 and back.
				503
				504	\var{stream} must be a file-like object.
				505
				506	\var{encode}, \var{decode} must adhere to the \class{Codec}
				507	interface, \var{Reader}, \var{Writer} must be factory functions or
Raymond Hettinger	f17d65d	2003-08-12 00:01:16 +0000	[diff] [blame]	508	classes providing objects of the \class{StreamReader} and
Fred Drake	602aa77	2000-10-12 20:50:55 +0000	[diff] [blame]	509	\class{StreamWriter} interface respectively.
				510
				511	\var{encode} and \var{decode} are needed for the frontend
				512	translation, \var{Reader} and \var{Writer} for the backend
				513	translation. The intermediate format used is determined by the two
				514	sets of codecs, e.g. the Unicode codecs will use Unicode as
				515	intermediate encoding.
				516
				517	Error handling is done in the same way as defined for the
				518	stream readers and writers.
				519	\end{classdesc}
				520
				521	\class{StreamRecoder} instances define the combined interfaces of
				522	\class{StreamReader} and \class{StreamWriter} classes. They inherit
				523	all other methods and attribute from the underlying stream.
				524
Martin v. Löwis	412ed3b	2006-01-08 10:45:39 +0000	[diff] [blame]	525	\subsection{Encodings and Unicode\label{encodings-overview}}
				526
				527	Unicode strings are stored internally as sequences of codepoints (to
Georg Brandl	131e4f7	2006-01-23 21:33:48 +0000	[diff] [blame]	528	be precise as \ctype{Py_UNICODE} arrays). Depending on the way Python is
				529	compiled (either via \longprogramopt{enable-unicode=ucs2} or
				530	\longprogramopt{enable-unicode=ucs4}, with the former being the default)
				531	\ctype{Py_UNICODE} is either a 16-bit or
Martin v. Löwis	412ed3b	2006-01-08 10:45:39 +0000	[diff] [blame]	532	32-bit data type. Once a Unicode object is used outside of CPU and
				533	memory, CPU endianness and how these arrays are stored as bytes become
				534	an issue. Transforming a unicode object into a sequence of bytes is
				535	called encoding and recreating the unicode object from the sequence of
				536	bytes is known as decoding. There are many different methods how this
				537	transformation can be done (these methods are also called encodings).
				538	The simplest method is to map the codepoints 0-255 to the bytes
Georg Brandl	131e4f7	2006-01-23 21:33:48 +0000	[diff] [blame]	539	\code{0x0}-\code{0xff}. This means that a unicode object that contains
				540	codepoints above \code{U+00FF} can't be encoded with this method (which
				541	is called \code{'latin-1'} or \code{'iso-8859-1'}). unicode.encode() will
				542	raise a UnicodeEncodeError that looks like this: \samp{UnicodeEncodeError:
				543	'latin-1' codec can't encode character u'\e u1234' in position 3: ordinal
				544	not in range(256)}.
Martin v. Löwis	412ed3b	2006-01-08 10:45:39 +0000	[diff] [blame]	545
				546	There's another group of encodings (the so called charmap encodings)
				547	that choose a different subset of all unicode code points and how
Georg Brandl	131e4f7	2006-01-23 21:33:48 +0000	[diff] [blame]	548	these codepoints are mapped to the bytes \code{0x0}-\code{0xff.}
				549	To see how this is done simply open e.g. \file{encodings/cp1252.py}
				550	(which is an encoding that is used primarily on Windows).
				551	There's a string constant with 256 characters that shows you which
				552	character is mapped to which byte value.
Martin v. Löwis	412ed3b	2006-01-08 10:45:39 +0000	[diff] [blame]	553
				554	All of these encodings can only encode 256 of the 65536 (or 1114111)
				555	codepoints defined in unicode. A simple and straightforward way that
				556	can store each Unicode code point, is to store each codepoint as two
				557	consecutive bytes. There are two possibilities: Store the bytes in big
				558	endian or in little endian order. These two encodings are called
				559	UTF-16-BE and UTF-16-LE respectively. Their disadvantage is that if
				560	e.g. you use UTF-16-BE on a little endian machine you will always have
				561	to swap bytes on encoding and decoding. UTF-16 avoids this problem:
				562	Bytes will always be in natural endianness. When these bytes are read
				563	by a CPU with a different endianness, then bytes have to be swapped
				564	though. To be able to detect the endianness of a UTF-16 byte sequence,
				565	there's the so called BOM (the "Byte Order Mark"). This is the Unicode
Georg Brandl	131e4f7	2006-01-23 21:33:48 +0000	[diff] [blame]	566	character \code{U+FEFF}. This character will be prepended to every UTF-16
				567	byte sequence. The byte swapped version of this character (\code{0xFFFE}) is
Martin v. Löwis	412ed3b	2006-01-08 10:45:39 +0000	[diff] [blame]	568	an illegal character that may not appear in a Unicode text. So when
Georg Brandl	131e4f7	2006-01-23 21:33:48 +0000	[diff] [blame]	569	the first character in an UTF-16 byte sequence appears to be a \code{U+FFFE}
Martin v. Löwis	412ed3b	2006-01-08 10:45:39 +0000	[diff] [blame]	570	the bytes have to be swapped on decoding. Unfortunately upto Unicode
Georg Brandl	131e4f7	2006-01-23 21:33:48 +0000	[diff] [blame]	571	4.0 the character \code{U+FEFF} had a second purpose as a \samp{ZERO WIDTH
				572	NO-BREAK SPACE}: A character that has no width and doesn't allow a
Martin v. Löwis	412ed3b	2006-01-08 10:45:39 +0000	[diff] [blame]	573	word to be split. It can e.g. be used to give hints to a ligature
Georg Brandl	131e4f7	2006-01-23 21:33:48 +0000	[diff] [blame]	574	algorithm. With Unicode 4.0 using \code{U+FEFF} as a \samp{ZERO WIDTH NO-BREAK
				575	SPACE} has been deprecated (with \code{U+2060} (\samp{WORD JOINER}) assuming
				576	this role). Nevertheless Unicode software still must be able to handle
				577	\code{U+FEFF} in both roles: As a BOM it's a device to determine the storage
Martin v. Löwis	412ed3b	2006-01-08 10:45:39 +0000	[diff] [blame]	578	layout of the encoded bytes, and vanishes once the byte sequence has
Georg Brandl	131e4f7	2006-01-23 21:33:48 +0000	[diff] [blame]	579	been decoded into a Unicode string; as a \samp{ZERO WIDTH NO-BREAK SPACE}
Martin v. Löwis	412ed3b	2006-01-08 10:45:39 +0000	[diff] [blame]	580	it's a normal character that will be decoded like any other.
				581
				582	There's another encoding that is able to encoding the full range of
				583	Unicode characters: UTF-8. UTF-8 is an 8bit encoding, which means
				584	there are no issues with byte order in UTF-8. Each byte in a UTF-8
				585	byte sequence consists of two parts: Marker bits (the most significant
				586	bits) and payload bits. The marker bits are a sequence of zero to six
				587	1 bits followed by a 0 bit. Unicode characters are encoded like this
Walter Dörwald	b754fe4	2006-01-09 12:45:01 +0000	[diff] [blame]	588	(with x being payload bits, which when concatenated give the Unicode
Martin v. Löwis	412ed3b	2006-01-08 10:45:39 +0000	[diff] [blame]	589	character):
				590
				591	\begin{tableii}{l\|l}{textrm}{}{Range}{Encoding}
Georg Brandl	131e4f7	2006-01-23 21:33:48 +0000	[diff] [blame]	592	\lineii{\code{U-00000000} ... \code{U-0000007F}}{0xxxxxxx}
				593	\lineii{\code{U-00000080} ... \code{U-000007FF}}{110xxxxx 10xxxxxx}
				594	\lineii{\code{U-00000800} ... \code{U-0000FFFF}}{1110xxxx 10xxxxxx 10xxxxxx}
				595	\lineii{\code{U-00010000} ... \code{U-001FFFFF}}{11110xxx 10xxxxxx 10xxxxxx 10xxxxxx}
				596	\lineii{\code{U-00200000} ... \code{U-03FFFFFF}}{111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx}
				597	\lineii{\code{U-04000000} ... \code{U-7FFFFFFF}}{1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx}
Martin v. Löwis	412ed3b	2006-01-08 10:45:39 +0000	[diff] [blame]	598	\end{tableii}
				599
				600	The least significant bit of the Unicode character is the rightmost x
				601	bit.
				602
Georg Brandl	131e4f7	2006-01-23 21:33:48 +0000	[diff] [blame]	603	As UTF-8 is an 8bit encoding no BOM is required and any \code{U+FEFF}
Martin v. Löwis	412ed3b	2006-01-08 10:45:39 +0000	[diff] [blame]	604	character in the decoded Unicode string (even if it's the first
Georg Brandl	131e4f7	2006-01-23 21:33:48 +0000	[diff] [blame]	605	character) is treated as a \samp{ZERO WIDTH NO-BREAK SPACE}.
Martin v. Löwis	412ed3b	2006-01-08 10:45:39 +0000	[diff] [blame]	606
				607	Without external information it's impossible to reliably determine
				608	which encoding was used for encoding a Unicode string. Each charmap
				609	encoding can decode any random byte sequence. However that's not
				610	possible with UTF-8, as UTF-8 byte sequences have a structure that
				611	doesn't allow arbitrary byte sequence. To increase the reliability
Walter Dörwald	b754fe4	2006-01-09 12:45:01 +0000	[diff] [blame]	612	with which a UTF-8 encoding can be detected, Microsoft invented a
Georg Brandl	131e4f7	2006-01-23 21:33:48 +0000	[diff] [blame]	613	variant of UTF-8 (that Python 2.5 calls \code{"utf-8-sig"}) for its Notepad
Martin v. Löwis	412ed3b	2006-01-08 10:45:39 +0000	[diff] [blame]	614	program: Before any of the Unicode characters is written to the file,
Georg Brandl	131e4f7	2006-01-23 21:33:48 +0000	[diff] [blame]	615	a UTF-8 encoded BOM (which looks like this as a byte sequence: \code{0xef},
				616	\code{0xbb}, \code{0xbf}) is written. As it's rather improbably that any
				617	charmap encoded file starts with these byte values (which would e.g. map to
Martin v. Löwis	412ed3b	2006-01-08 10:45:39 +0000	[diff] [blame]	618
Georg Brandl	131e4f7	2006-01-23 21:33:48 +0000	[diff] [blame]	619	LATIN SMALL LETTER I WITH DIAERESIS \\
				620	RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK \\
Martin v. Löwis	412ed3b	2006-01-08 10:45:39 +0000	[diff] [blame]	621	INVERTED QUESTION MARK
				622
				623	in iso-8859-1), this increases the probability that a utf-8-sig
				624	encoding can be correctly guessed from the byte sequence. So here the
				625	BOM is not used to be able to determine the byte order used for
				626	generating the byte sequence, but as a signature that helps in
				627	guessing the encoding. On encoding the utf-8-sig codec will write
Georg Brandl	131e4f7	2006-01-23 21:33:48 +0000	[diff] [blame]	628	\code{0xef}, \code{0xbb}, \code{0xbf} as the first three bytes to the file.
				629	On decoding utf-8-sig will skip those three bytes if they appear as the
				630	first three bytes in the file.
Martin v. Löwis	412ed3b	2006-01-08 10:45:39 +0000	[diff] [blame]	631
				632
Skip Montanaro	ecf7a52	2004-07-01 19:26:04 +0000	[diff] [blame]	633	\subsection{Standard Encodings\label{standard-encodings}}
Martin v. Löwis	5c37a77	2002-12-31 12:39:07 +0000	[diff] [blame]	634
				635	Python comes with a number of codecs builtin, either implemented as C
				636	functions, or with dictionaries as mapping tables. The following table
				637	lists the codecs by name, together with a few common aliases, and the
				638	languages for which the encoding is likely used. Neither the list of
				639	aliases nor the list of languages is meant to be exhaustive. Notice
				640	that spelling alternatives that only differ in case or use a hyphen
				641	instead of an underscore are also valid aliases.
				642
				643	Many of the character sets support the same languages. They vary in
				644	individual characters (e.g. whether the EURO SIGN is supported or
				645	not), and in the assignment of characters to code positions. For the
				646	European languages in particular, the following variants typically
				647	exist:
				648
				649	\begin{itemize}
				650	\item an ISO 8859 codeset
				651	\item a Microsoft Windows code page, which is typically derived from
				652	a 8859 codeset, but replaces control characters with additional
				653	graphic characters
				654	\item an IBM EBCDIC code page
Fred Drake	d4be747	2003-04-30 15:02:07 +0000	[diff] [blame]	655	\item an IBM PC code page, which is \ASCII{} compatible
Martin v. Löwis	5c37a77	2002-12-31 12:39:07 +0000	[diff] [blame]	656	\end{itemize}
				657
				658	\begin{longtableiii}{l\|l\|l}{textrm}{Codec}{Aliases}{Languages}
				659
				660	\lineiii{ascii}
				661	{646, us-ascii}
				662	{English}
				663
Hye-Shik Chang	3e2a306	2004-01-17 14:29:29 +0000	[diff] [blame]	664	\lineiii{big5}
Hye-Shik Chang	910d8f1	2004-07-17 14:44:43 +0000	[diff] [blame]	665	{big5-tw, csbig5}
Hye-Shik Chang	3e2a306	2004-01-17 14:29:29 +0000	[diff] [blame]	666	{Traditional Chinese}
				667
Hye-Shik Chang	2bb146f	2004-07-18 03:06:29 +0000	[diff] [blame]	668	\lineiii{big5hkscs}
				669	{big5-hkscs, hkscs}
				670	{Traditional Chinese}
				671
Martin v. Löwis	5c37a77	2002-12-31 12:39:07 +0000	[diff] [blame]	672	\lineiii{cp037}
				673	{IBM037, IBM039}
				674	{English}
				675
				676	\lineiii{cp424}
				677	{EBCDIC-CP-HE, IBM424}
				678	{Hebrew}
				679
				680	\lineiii{cp437}
				681	{437, IBM437}
				682	{English}
				683
				684	\lineiii{cp500}
				685	{EBCDIC-CP-BE, EBCDIC-CP-CH, IBM500}
				686	{Western Europe}
				687
				688	\lineiii{cp737}
				689	{}
				690	{Greek}
				691
				692	\lineiii{cp775}
				693	{IBM775}
				694	{Baltic languages}
				695
				696	\lineiii{cp850}
				697	{850, IBM850}
				698	{Western Europe}
				699
				700	\lineiii{cp852}
				701	{852, IBM852}
				702	{Central and Eastern Europe}
				703
				704	\lineiii{cp855}
				705	{855, IBM855}
				706	{Bulgarian, Byelorussian, Macedonian, Russian, Serbian}
				707
				708	\lineiii{cp856}
				709	{}
				710	{Hebrew}
				711
				712	\lineiii{cp857}
				713	{857, IBM857}
				714	{Turkish}
				715
				716	\lineiii{cp860}
				717	{860, IBM860}
				718	{Portuguese}
				719
				720	\lineiii{cp861}
				721	{861, CP-IS, IBM861}
				722	{Icelandic}
				723
				724	\lineiii{cp862}
				725	{862, IBM862}
				726	{Hebrew}
				727
				728	\lineiii{cp863}
				729	{863, IBM863}
				730	{Canadian}
				731
				732	\lineiii{cp864}
				733	{IBM864}
				734	{Arabic}
				735
				736	\lineiii{cp865}
				737	{865, IBM865}
				738	{Danish, Norwegian}
				739
Skip Montanaro	78bace7	2004-07-02 02:14:34 +0000	[diff] [blame]	740	\lineiii{cp866}
				741	{866, IBM866}
				742	{Russian}
				743
Martin v. Löwis	5c37a77	2002-12-31 12:39:07 +0000	[diff] [blame]	744	\lineiii{cp869}
				745	{869, CP-GR, IBM869}
				746	{Greek}
				747
				748	\lineiii{cp874}
				749	{}
				750	{Thai}
				751
				752	\lineiii{cp875}
				753	{}
				754	{Greek}
				755
Hye-Shik Chang	3e2a306	2004-01-17 14:29:29 +0000	[diff] [blame]	756	\lineiii{cp932}
Hye-Shik Chang	910d8f1	2004-07-17 14:44:43 +0000	[diff] [blame]	757	{932, ms932, mskanji, ms-kanji}
Hye-Shik Chang	3e2a306	2004-01-17 14:29:29 +0000	[diff] [blame]	758	{Japanese}
				759
				760	\lineiii{cp949}
				761	{949, ms949, uhc}
				762	{Korean}
				763
				764	\lineiii{cp950}
				765	{950, ms950}
				766	{Traditional Chinese}
				767
Martin v. Löwis	5c37a77	2002-12-31 12:39:07 +0000	[diff] [blame]	768	\lineiii{cp1006}
				769	{}
				770	{Urdu}
				771
				772	\lineiii{cp1026}
				773	{ibm1026}
				774	{Turkish}
				775
				776	\lineiii{cp1140}
				777	{ibm1140}
				778	{Western Europe}
				779
				780	\lineiii{cp1250}
				781	{windows-1250}
				782	{Central and Eastern Europe}
				783
				784	\lineiii{cp1251}
				785	{windows-1251}
				786	{Bulgarian, Byelorussian, Macedonian, Russian, Serbian}
				787
				788	\lineiii{cp1252}
				789	{windows-1252}
				790	{Western Europe}
				791
				792	\lineiii{cp1253}
				793	{windows-1253}
				794	{Greek}
				795
				796	\lineiii{cp1254}
				797	{windows-1254}
				798	{Turkish}
				799
				800	\lineiii{cp1255}
				801	{windows-1255}
				802	{Hebrew}
				803
				804	\lineiii{cp1256}
				805	{windows1256}
				806	{Arabic}
				807
				808	\lineiii{cp1257}
				809	{windows-1257}
				810	{Baltic languages}
				811
				812	\lineiii{cp1258}
				813	{windows-1258}
				814	{Vietnamese}
				815
Hye-Shik Chang	3e2a306	2004-01-17 14:29:29 +0000	[diff] [blame]	816	\lineiii{euc_jp}
Hye-Shik Chang	910d8f1	2004-07-17 14:44:43 +0000	[diff] [blame]	817	{eucjp, ujis, u-jis}
Hye-Shik Chang	3e2a306	2004-01-17 14:29:29 +0000	[diff] [blame]	818	{Japanese}
				819
Hye-Shik Chang	2bb146f	2004-07-18 03:06:29 +0000	[diff] [blame]	820	\lineiii{euc_jis_2004}
				821	{jisx0213, eucjis2004}
				822	{Japanese}
				823
Hye-Shik Chang	3e2a306	2004-01-17 14:29:29 +0000	[diff] [blame]	824	\lineiii{euc_jisx0213}
Hye-Shik Chang	2bb146f	2004-07-18 03:06:29 +0000	[diff] [blame]	825	{eucjisx0213}
Hye-Shik Chang	3e2a306	2004-01-17 14:29:29 +0000	[diff] [blame]	826	{Japanese}
				827
				828	\lineiii{euc_kr}
Hye-Shik Chang	910d8f1	2004-07-17 14:44:43 +0000	[diff] [blame]	829	{euckr, korean, ksc5601, ks_c-5601, ks_c-5601-1987, ksx1001, ks_x-1001}
Hye-Shik Chang	3e2a306	2004-01-17 14:29:29 +0000	[diff] [blame]	830	{Korean}
				831
				832	\lineiii{gb2312}
Hye-Shik Chang	910d8f1	2004-07-17 14:44:43 +0000	[diff] [blame]	833	{chinese, csiso58gb231280, euc-cn, euccn, eucgb2312-cn, gb2312-1980,
				834	gb2312-80, iso-ir-58}
Hye-Shik Chang	3e2a306	2004-01-17 14:29:29 +0000	[diff] [blame]	835	{Simplified Chinese}
				836
				837	\lineiii{gbk}
				838	{936, cp936, ms936}
				839	{Unified Chinese}
				840
				841	\lineiii{gb18030}
Hye-Shik Chang	910d8f1	2004-07-17 14:44:43 +0000	[diff] [blame]	842	{gb18030-2000}
Hye-Shik Chang	3e2a306	2004-01-17 14:29:29 +0000	[diff] [blame]	843	{Unified Chinese}
				844
				845	\lineiii{hz}
Hye-Shik Chang	910d8f1	2004-07-17 14:44:43 +0000	[diff] [blame]	846	{hzgb, hz-gb, hz-gb-2312}
Hye-Shik Chang	3e2a306	2004-01-17 14:29:29 +0000	[diff] [blame]	847	{Simplified Chinese}
				848
				849	\lineiii{iso2022_jp}
Hye-Shik Chang	910d8f1	2004-07-17 14:44:43 +0000	[diff] [blame]	850	{csiso2022jp, iso2022jp, iso-2022-jp}
Hye-Shik Chang	3e2a306	2004-01-17 14:29:29 +0000	[diff] [blame]	851	{Japanese}
				852
				853	\lineiii{iso2022_jp_1}
Hye-Shik Chang	910d8f1	2004-07-17 14:44:43 +0000	[diff] [blame]	854	{iso2022jp-1, iso-2022-jp-1}
Hye-Shik Chang	3e2a306	2004-01-17 14:29:29 +0000	[diff] [blame]	855	{Japanese}
				856
				857	\lineiii{iso2022_jp_2}
Hye-Shik Chang	910d8f1	2004-07-17 14:44:43 +0000	[diff] [blame]	858	{iso2022jp-2, iso-2022-jp-2}
Hye-Shik Chang	3e2a306	2004-01-17 14:29:29 +0000	[diff] [blame]	859	{Japanese, Korean, Simplified Chinese, Western Europe, Greek}
				860
Hye-Shik Chang	2bb146f	2004-07-18 03:06:29 +0000	[diff] [blame]	861	\lineiii{iso2022_jp_2004}
				862	{iso2022jp-2004, iso-2022-jp-2004}
				863	{Japanese}
				864
Hye-Shik Chang	3e2a306	2004-01-17 14:29:29 +0000	[diff] [blame]	865	\lineiii{iso2022_jp_3}
Hye-Shik Chang	910d8f1	2004-07-17 14:44:43 +0000	[diff] [blame]	866	{iso2022jp-3, iso-2022-jp-3}
Hye-Shik Chang	3e2a306	2004-01-17 14:29:29 +0000	[diff] [blame]	867	{Japanese}
				868
				869	\lineiii{iso2022_jp_ext}
Hye-Shik Chang	910d8f1	2004-07-17 14:44:43 +0000	[diff] [blame]	870	{iso2022jp-ext, iso-2022-jp-ext}
Hye-Shik Chang	3e2a306	2004-01-17 14:29:29 +0000	[diff] [blame]	871	{Japanese}
				872
				873	\lineiii{iso2022_kr}
Hye-Shik Chang	910d8f1	2004-07-17 14:44:43 +0000	[diff] [blame]	874	{csiso2022kr, iso2022kr, iso-2022-kr}
Hye-Shik Chang	3e2a306	2004-01-17 14:29:29 +0000	[diff] [blame]	875	{Korean}
				876
Martin v. Löwis	5c37a77	2002-12-31 12:39:07 +0000	[diff] [blame]	877	\lineiii{latin_1}
				878	{iso-8859-1, iso8859-1, 8859, cp819, latin, latin1, L1}
				879	{West Europe}
				880
				881	\lineiii{iso8859_2}
				882	{iso-8859-2, latin2, L2}
				883	{Central and Eastern Europe}
				884
				885	\lineiii{iso8859_3}
				886	{iso-8859-3, latin3, L3}
				887	{Esperanto, Maltese}
				888
				889	\lineiii{iso8859_4}
				890	{iso-8859-4, latin4, L4}
				891	{Baltic languagues}
				892
				893	\lineiii{iso8859_5}
				894	{iso-8859-5, cyrillic}
				895	{Bulgarian, Byelorussian, Macedonian, Russian, Serbian}
				896
				897	\lineiii{iso8859_6}
				898	{iso-8859-6, arabic}
				899	{Arabic}
				900
				901	\lineiii{iso8859_7}
				902	{iso-8859-7, greek, greek8}
				903	{Greek}
				904
				905	\lineiii{iso8859_8}
				906	{iso-8859-8, hebrew}
				907	{Hebrew}
				908
				909	\lineiii{iso8859_9}
				910	{iso-8859-9, latin5, L5}
				911	{Turkish}
				912
				913	\lineiii{iso8859_10}
				914	{iso-8859-10, latin6, L6}
				915	{Nordic languages}
				916
				917	\lineiii{iso8859_13}
				918	{iso-8859-13}
				919	{Baltic languages}
				920
				921	\lineiii{iso8859_14}
				922	{iso-8859-14, latin8, L8}
				923	{Celtic languages}
				924
				925	\lineiii{iso8859_15}
				926	{iso-8859-15}
				927	{Western Europe}
				928
Hye-Shik Chang	3e2a306	2004-01-17 14:29:29 +0000	[diff] [blame]	929	\lineiii{johab}
				930	{cp1361, ms1361}
				931	{Korean}
				932
Martin v. Löwis	5c37a77	2002-12-31 12:39:07 +0000	[diff] [blame]	933	\lineiii{koi8_r}
				934	{}
				935	{Russian}
				936
				937	\lineiii{koi8_u}
				938	{}
				939	{Ukrainian}
				940
				941	\lineiii{mac_cyrillic}
				942	{maccyrillic}
				943	{Bulgarian, Byelorussian, Macedonian, Russian, Serbian}
				944
				945	\lineiii{mac_greek}
				946	{macgreek}
				947	{Greek}
				948
				949	\lineiii{mac_iceland}
				950	{maciceland}
				951	{Icelandic}
				952
				953	\lineiii{mac_latin2}
				954	{maclatin2, maccentraleurope}
				955	{Central and Eastern Europe}
				956
				957	\lineiii{mac_roman}
				958	{macroman}
				959	{Western Europe}
				960
				961	\lineiii{mac_turkish}
				962	{macturkish}
				963	{Turkish}
				964
Hye-Shik Chang	5c5316f	2004-03-19 08:06:07 +0000	[diff] [blame]	965	\lineiii{ptcp154}
				966	{csptcp154, pt154, cp154, cyrillic-asian}
				967	{Kazakh}
				968
Hye-Shik Chang	3e2a306	2004-01-17 14:29:29 +0000	[diff] [blame]	969	\lineiii{shift_jis}
				970	{csshiftjis, shiftjis, sjis, s_jis}
				971	{Japanese}
				972
Hye-Shik Chang	2bb146f	2004-07-18 03:06:29 +0000	[diff] [blame]	973	\lineiii{shift_jis_2004}
				974	{shiftjis2004, sjis_2004, sjis2004}
				975	{Japanese}
				976
Hye-Shik Chang	3e2a306	2004-01-17 14:29:29 +0000	[diff] [blame]	977	\lineiii{shift_jisx0213}
				978	{shiftjisx0213, sjisx0213, s_jisx0213}
				979	{Japanese}
				980
Martin v. Löwis	5c37a77	2002-12-31 12:39:07 +0000	[diff] [blame]	981	\lineiii{utf_16}
				982	{U16, utf16}
				983	{all languages}
				984
				985	\lineiii{utf_16_be}
				986	{UTF-16BE}
				987	{all languages (BMP only)}
				988
				989	\lineiii{utf_16_le}
				990	{UTF-16LE}
				991	{all languages (BMP only)}
				992
				993	\lineiii{utf_7}
Walter Dörwald	007f8df	2005-10-09 19:42:27 +0000	[diff] [blame]	994	{U7, unicode-1-1-utf-7}
Martin v. Löwis	5c37a77	2002-12-31 12:39:07 +0000	[diff] [blame]	995	{all languages}
				996
				997	\lineiii{utf_8}
				998	{U8, UTF, utf8}
				999	{all languages}
				1000
Martin v. Löwis	412ed3b	2006-01-08 10:45:39 +0000	[diff] [blame]	1001	\lineiii{utf_8_sig}
				1002	{}
				1003	{all languages}
				1004
Martin v. Löwis	5c37a77	2002-12-31 12:39:07 +0000	[diff] [blame]	1005	\end{longtableiii}
				1006
				1007	A number of codecs are specific to Python, so their codec names have
				1008	no meaning outside Python. Some of them don't convert from Unicode
				1009	strings to byte strings, but instead use the property of the Python
				1010	codecs machinery that any bijective function with one argument can be
				1011	considered as an encoding.
				1012
				1013	For the codecs listed below, the result in the ``encoding'' direction
				1014	is always a byte string. The result of the ``decoding'' direction is
				1015	listed as operand type in the table.
				1016
				1017	\begin{tableiv}{l\|l\|l\|l}{textrm}{Codec}{Aliases}{Operand type}{Purpose}
				1018
				1019	\lineiv{base64_codec}
				1020	{base64, base-64}
				1021	{byte string}
				1022	{Convert operand to MIME base64}
				1023
Raymond Hettinger	9a80c5d	2003-09-23 20:21:01 +0000	[diff] [blame]	1024	\lineiv{bz2_codec}
				1025	{bz2}
				1026	{byte string}
				1027	{Compress the operand using bz2}
				1028
Martin v. Löwis	5c37a77	2002-12-31 12:39:07 +0000	[diff] [blame]	1029	\lineiv{hex_codec}
				1030	{hex}
				1031	{byte string}
Fred Drake	d4be747	2003-04-30 15:02:07 +0000	[diff] [blame]	1032	{Convert operand to hexadecimal representation, with two
				1033	digits per byte}
Martin v. Löwis	5c37a77	2002-12-31 12:39:07 +0000	[diff] [blame]	1034
Martin v. Löwis	2548c73	2003-04-18 10:39:54 +0000	[diff] [blame]	1035	\lineiv{idna}
				1036	{}
				1037	{Unicode string}
Fred Drake	d4be747	2003-04-30 15:02:07 +0000	[diff] [blame]	1038	{Implements \rfc{3490}.
Raymond Hettinger	aa1178b	2003-09-01 23:13:04 +0000	[diff] [blame]	1039	\versionadded{2.3}
Fred Drake	d4be747	2003-04-30 15:02:07 +0000	[diff] [blame]	1040	See also \refmodule{encodings.idna}}
Martin v. Löwis	2548c73	2003-04-18 10:39:54 +0000	[diff] [blame]	1041
Martin v. Löwis	5c37a77	2002-12-31 12:39:07 +0000	[diff] [blame]	1042	\lineiv{mbcs}
				1043	{dbcs}
				1044	{Unicode string}
				1045	{Windows only: Encode operand according to the ANSI codepage (CP_ACP)}
				1046
				1047	\lineiv{palmos}
				1048	{}
				1049	{Unicode string}
				1050	{Encoding of PalmOS 3.5}
				1051
Martin v. Löwis	2548c73	2003-04-18 10:39:54 +0000	[diff] [blame]	1052	\lineiv{punycode}
				1053	{}
				1054	{Unicode string}
Fred Drake	d4be747	2003-04-30 15:02:07 +0000	[diff] [blame]	1055	{Implements \rfc{3492}.
				1056	\versionadded{2.3}}
Martin v. Löwis	2548c73	2003-04-18 10:39:54 +0000	[diff] [blame]	1057
Martin v. Löwis	5c37a77	2002-12-31 12:39:07 +0000	[diff] [blame]	1058	\lineiv{quopri_codec}
				1059	{quopri, quoted-printable, quotedprintable}
				1060	{byte string}
				1061	{Convert operand to MIME quoted printable}
				1062
				1063	\lineiv{raw_unicode_escape}
				1064	{}
				1065	{Unicode string}
Fred Drake	d4be747	2003-04-30 15:02:07 +0000	[diff] [blame]	1066	{Produce a string that is suitable as raw Unicode literal in
				1067	Python source code}
Martin v. Löwis	5c37a77	2002-12-31 12:39:07 +0000	[diff] [blame]	1068
				1069	\lineiv{rot_13}
				1070	{rot13}
				1071	{byte string}
				1072	{Returns the Caesar-cypher encryption of the operand}
				1073
				1074	\lineiv{string_escape}
				1075	{}
				1076	{byte string}
Fred Drake	d4be747	2003-04-30 15:02:07 +0000	[diff] [blame]	1077	{Produce a string that is suitable as string literal in
				1078	Python source code}
Martin v. Löwis	5c37a77	2002-12-31 12:39:07 +0000	[diff] [blame]	1079
				1080	\lineiv{undefined}
				1081	{}
				1082	{any}
Fred Drake	d4be747	2003-04-30 15:02:07 +0000	[diff] [blame]	1083	{Raise an exception for all conversion. Can be used as the
				1084	system encoding if no automatic coercion between byte and
				1085	Unicode strings is desired.}
Martin v. Löwis	5c37a77	2002-12-31 12:39:07 +0000	[diff] [blame]	1086
				1087	\lineiv{unicode_escape}
				1088	{}
				1089	{Unicode string}
Fred Drake	d4be747	2003-04-30 15:02:07 +0000	[diff] [blame]	1090	{Produce a string that is suitable as Unicode literal in
				1091	Python source code}
Martin v. Löwis	5c37a77	2002-12-31 12:39:07 +0000	[diff] [blame]	1092
				1093	\lineiv{unicode_internal}
				1094	{}
				1095	{Unicode string}
Raymond Hettinger	6880431	2005-01-01 00:28:46 +0000	[diff] [blame]	1096	{Return the internal representation of the operand}
Martin v. Löwis	5c37a77	2002-12-31 12:39:07 +0000	[diff] [blame]	1097
				1098	\lineiv{uu_codec}
				1099	{uu}
				1100	{byte string}
				1101	{Convert the operand using uuencode}
				1102
				1103	\lineiv{zlib_codec}
				1104	{zip, zlib}
				1105	{byte string}
				1106	{Compress the operand using gzip}
				1107
				1108	\end{tableiv}
Martin v. Löwis	2548c73	2003-04-18 10:39:54 +0000	[diff] [blame]	1109
				1110	\subsection{\module{encodings.idna} ---
				1111	Internationalized Domain Names in Applications}
				1112
				1113	\declaremodule{standard}{encodings.idna}
				1114	\modulesynopsis{Internationalized Domain Names implementation}
Fred Drake	d4be747	2003-04-30 15:02:07 +0000	[diff] [blame]	1115	% XXX The next line triggers a formatting bug, so it's commented out
				1116	% until that can be fixed.
				1117	%\moduleauthor{Martin v. L\"owis}
				1118
				1119	\versionadded{2.3}
Martin v. Löwis	2548c73	2003-04-18 10:39:54 +0000	[diff] [blame]	1120
				1121	This module implements \rfc{3490} (Internationalized Domain Names in
				1122	Applications) and \rfc{3492} (Nameprep: A Stringprep Profile for
				1123	Internationalized Domain Names (IDN)). It builds upon the
Fred Drake	d24c767	2003-07-16 05:17:23 +0000	[diff] [blame]	1124	\code{punycode} encoding and \refmodule{stringprep}.
Martin v. Löwis	2548c73	2003-04-18 10:39:54 +0000	[diff] [blame]	1125
Fred Drake	d4be747	2003-04-30 15:02:07 +0000	[diff] [blame]	1126	These RFCs together define a protocol to support non-\ASCII{} characters
				1127	in domain names. A domain name containing non-\ASCII{} characters (such
Fred Drake	d24c767	2003-07-16 05:17:23 +0000	[diff] [blame]	1128	as ``www.Alliancefran\c caise.nu'') is converted into an
Fred Drake	d4be747	2003-04-30 15:02:07 +0000	[diff] [blame]	1129	\ASCII-compatible encoding (ACE, such as
Martin v. Löwis	2548c73	2003-04-18 10:39:54 +0000	[diff] [blame]	1130	``www.xn--alliancefranaise-npb.nu''). The ACE form of the domain name
				1131	is then used in all places where arbitrary characters are not allowed
Fred Drake	d4be747	2003-04-30 15:02:07 +0000	[diff] [blame]	1132	by the protocol, such as DNS queries, HTTP \mailheader{Host} fields, and so
Martin v. Löwis	2548c73	2003-04-18 10:39:54 +0000	[diff] [blame]	1133	on. This conversion is carried out in the application; if possible
				1134	invisible to the user: The application should transparently convert
				1135	Unicode domain labels to IDNA on the wire, and convert back ACE labels
				1136	to Unicode before presenting them to the user.
				1137
				1138	Python supports this conversion in several ways: The \code{idna} codec
				1139	allows to convert between Unicode and the ACE. Furthermore, the
Fred Drake	d24c767	2003-07-16 05:17:23 +0000	[diff] [blame]	1140	\refmodule{socket} module transparently converts Unicode host names to
Martin v. Löwis	2548c73	2003-04-18 10:39:54 +0000	[diff] [blame]	1141	ACE, so that applications need not be concerned about converting host
				1142	names themselves when they pass them to the socket module. On top of
				1143	that, modules that have host names as function parameters, such as
Fred Drake	d24c767	2003-07-16 05:17:23 +0000	[diff] [blame]	1144	\refmodule{httplib} and \refmodule{ftplib}, accept Unicode host names
				1145	(\refmodule{httplib} then also transparently sends an IDNA hostname in
				1146	the \mailheader{Host} field if it sends that field at all).
Martin v. Löwis	2548c73	2003-04-18 10:39:54 +0000	[diff] [blame]	1147
				1148	When receiving host names from the wire (such as in reverse name
				1149	lookup), no automatic conversion to Unicode is performed: Applications
				1150	wishing to present such host names to the user should decode them to
				1151	Unicode.
				1152
				1153	The module \module{encodings.idna} also implements the nameprep
				1154	procedure, which performs certain normalizations on host names, to
				1155	achieve case-insensitivity of international domain names, and to unify
				1156	similar characters. The nameprep functions can be used directly if
				1157	desired.
				1158
				1159	\begin{funcdesc}{nameprep}{label}
				1160	Return the nameprepped version of \var{label}. The implementation
				1161	currently assumes query strings, so \code{AllowUnassigned} is
				1162	true.
				1163	\end{funcdesc}
				1164
Raymond Hettinger	b5155e3	2003-06-18 01:58:31 +0000	[diff] [blame]	1165	\begin{funcdesc}{ToASCII}{label}
Fred Drake	d4be747	2003-04-30 15:02:07 +0000	[diff] [blame]	1166	Convert a label to \ASCII, as specified in \rfc{3490}.
Martin v. Löwis	2548c73	2003-04-18 10:39:54 +0000	[diff] [blame]	1167	\code{UseSTD3ASCIIRules} is assumed to be false.
				1168	\end{funcdesc}
				1169
				1170	\begin{funcdesc}{ToUnicode}{label}
				1171	Convert a label to Unicode, as specified in \rfc{3490}.
				1172	\end{funcdesc}
Martin v. Löwis	412ed3b	2006-01-08 10:45:39 +0000	[diff] [blame]	1173
				1174	\subsection{\module{encodings.utf_8_sig} ---
				1175	UTF-8 codec with BOM signature}
				1176	\declaremodule{standard}{encodings.utf-8-sig} % XXX utf_8_sig gives TeX errors
				1177	\modulesynopsis{UTF-8 codec with BOM signature}
				1178	\moduleauthor{Walter D\"orwald}
				1179
				1180	\versionadded{2.5}
				1181
				1182	This module implements a variant of the UTF-8 codec: On encoding a
				1183	UTF-8 encoded BOM will be prepended to the UTF-8 encoded bytes. For
				1184	the stateful encoder this is only done once (on the first write to the
				1185	byte stream). For decoding an optional UTF-8 encoded BOM at the start
				1186	of the data will be skipped.