Blame - Doc/lib/emailheaders.tex - platform/external/python/cpython2

blob: 172e5d6539d8ea5acf0e7780b8e8296ff9ca63ae [file] [log] [blame]

Barry Warsaw	5b9da89	2002-10-01 01:05:52 +0000	[diff] [blame^]	1	\declaremodule{standard}{email.Header}
				2	\modulesynopsis{Representing non-ASCII headers}
				3
				4	\rfc{2822} is the base standard that describes the format of email
				5	messages. It derives from the older \rfc{822} standard which came
				6	into widespread at a time when most email was composed of \ASCII{}
				7	characters only. \rfc{2822} is a specification written assuming email
				8	contains only 7-bit \ASCII{} characters.
				9
				10	Of course, as email has been deployed worldwide, it has become
				11	internationalized, such that language specific character sets can now
				12	be used in email messages. The base standard still requires email
				13	messages to be transfered using only 7-bit \ASCII{} characters, so a
				14	slew of RFCs have been written describing how to encode email
				15	containing non-\ASCII{} characters into \rfc{2822}-compliant format.
				16	These RFCs include \rfc{2045}, \rfc{2046}, \rfc{2047}, and \rfc{2231}.
				17	The \module{email} package supports these standards in its
				18	\module{email.Header} and \module{email.Charset} modules.
				19
				20	If you want to include non-\ASCII{} characters in your email headers,
				21	say in the \mailheader{Subject} or \mailheader{To} fields, you should
				22	use the \class{Header} class (in module \module{email.Header} and
				23	assign the field in the \class{Message} object to an instance of
				24	\class{Header} instead of using a string for the header value. For
				25	example:
				26
				27	\begin{verbatim}
				28	>>> from email.Message import Message
				29	>>> from email.Header import Header
				30	>>> msg = Message()
				31	>>> h = Header('p\xf6stal', 'iso-8859-1')
				32	>>> msg['Subject'] = h
				33	>>> print msg.as_string()
				34	Subject: =?iso-8859-1?q?p=F6stal?=
				35
				36
				37	\end{verbatim}
				38
				39	Notice here how we wanted the \mailheader{Subject} field to contain a
				40	non-\ASCII{} character? We did this by creating a \class{Header}
				41	instance and passing in the character set that the byte string was
				42	encoded in. When the subsequent \class{Message} instance was
				43	flattened, the \mailheader{Subject} field was properly \rfc{2047}
				44	encoded. MIME-aware mail readers would show this header using the
				45	embedded ISO-8859-1 character.
				46
				47	\versionadded{2.2.2}
				48
				49	Here is the \class{Header} class description:
				50
				51	\begin{classdesc}{Header}{\optional{s\optional{, charset\optional{,
				52	maxlinelen\optional{, header_name\optional{, continuation_ws}}}}}}
				53	Create a MIME-compliant header that can contain many character sets.
				54
				55	Optional \var{s} is the initial header value. If \code{None} (the
				56	default), the initial header value is not set. You can later append
				57	to the header with \method{append()} method calls. \var{s} may be a
				58	byte string or a Unicode string, but see the \method{append()}
				59	documentation for semantics.
				60
				61	Optional \var{charset} serves two purposes: it has the same meaning as
				62	the \var{charset} argument to the \method{append()} method. It also
				63	sets the default character set for all subsequent \method{append()}
				64	calls that omit the \var{charset} argument. If \var{charset} is not
				65	provided in the constructor (the default), the \code{us-ascii}
				66	character set is used both as \var{s}'s initial charset and as the
				67	default for subsequent \method{append()} calls.
				68
				69	The maximum line length can be specified explicit via
				70	\var{maxlinelen}. For splitting the first line to a shorter value (to
				71	account for the field header which isn't included in \var{s},
				72	e.g. \mailheader{Subject}) pass in the name of the field in
				73	\var{header_name}. The default \var{maxlinelen} is 76, and the
				74	default value for \var{header_name} is \code{None}, meaning it is not
				75	taken into account for the first line of a long, split header.
				76
				77	Optional \var{continuation_ws} must be RFC 2822 compliant folding
				78	whitespace, and is usually either a space or a hard tab character.
				79	This character will be prepended to continuation lines.
				80	\end{classdesc}
				81
				82	\begin{methoddesc}[Header]{append}{s\optional{, charset}}
				83	Append the string \var{s} to the MIME header.
				84
				85	Optional \var{charset}, if given, should be a \class{Charset} instance
				86	(see \refmodule{email.Charset}) or the name of a character set, which
				87	will be converted to a \class{Charset} instance. A value of
				88	\code{None} (the default) means that the \var{charset} given in the
				89	constructor is used.
				90
				91	\var{s} may be a byte string or a Unicode string. If it is a byte
				92	string (i.e. \code{isinstance(s, StringType)} is true), then
				93	\var{charset} is the encoding of that byte string, and a
				94	\exception{UnicodeError} will be raised if the string cannot be
				95	decoded with that character set.
				96
				97	If \var{s} is a Unicode string, then \var{charset} is a hint
				98	specifying the character set of the characters in the string. In this
				99	case, when producing an \rfc{2822}-compliant header using \rfc{2047}
				100	rules, the Unicode string will be encoded using the following charsets
				101	in order: \code{us-ascii}, the \var{charset} hint, \code{utf-8}. The
				102	first character set to not provoke a \exception{UnicodeError} is used.
				103	\end{methoddesc}
				104
				105	\begin{methoddesc}[Header]{encode}{}
				106	Encode a message header into an RFC-compliant format, possibly
				107	wrapping long lines and encapsulating non-\ASCII{} parts in base64 or
				108	quoted-printable encodings.
				109	\end{methoddesc}
				110
				111	The \class{Header} class also provides a number of methods to support
				112	standard operators and built-in functions.
				113
				114	\begin{methoddesc}[Header]{__str__}{}
				115	A synonym for \method{Header.encode()}. Useful for
				116	\code{str(aHeader)} calls.
				117	\end{methoddesc}
				118
				119	\begin{methoddesc}[Header]{__unicode__}{}
				120	A helper for the built-in \function{unicode()} function. Returns the
				121	header as a Unicode string.
				122	\end{methoddesc}
				123
				124	\begin{methoddesc}[Header]{__eq__}{other}
				125	This method allows you to compare two \class{Header} instances for equality.
				126	\end{methoddesc}
				127
				128	\begin{methoddesc}[Header]{__ne__}{other}
				129	This method allows you to compare two \class{Header} instances for inequality.
				130	\end{methoddesc}
				131
				132	The \module{email.Header} module also provides the following
				133	convenient functions.
				134
				135	\begin{funcdesc}{decode_header}{header}
				136	Decode a message header value without converting the character set.
				137	The header value is in \var{header}.
				138
				139	This function returns a list of \code{(decoded_string, charset)} pairs
				140	containing each of the decoded parts of the header. \var{charset} is
				141	\code{None} for non-encoded parts of the header, otherwise a lower
				142	case string containing the name of the character set specified in the
				143	encoded string.
				144
				145	Here's an example:
				146
				147	\begin{verbatim}
				148	>>> from email.Header import decode_header
				149	>>> decode_header('=?iso-8859-1?q?p=F6stal?=')
				150	[('p\\xf6stal', 'iso-8859-1')]
				151	\end{verbatim}
				152	\end{funcdesc}
				153
				154	\begin{funcdesc}{make_header}{decoded_seq\optional{, maxlinelen\optional{,
				155	header_name\optional{, continuation_ws}}}}
				156	Create a \class{Header} instance from a sequence of pairs as returned
				157	by \function{decode_header()}.
				158
				159	\function{decode_header()} takes a header value string and returns a
				160	sequence of pairs of the format \code{(decoded_string, charset)} where
				161	\var{charset} is the name of the character set.
				162
				163	This function takes one of those sequence of pairs and returns a
				164	\class{Header} instance. Optional \var{maxlinelen},
				165	\var{header_name}, and \var{continuation_ws} are as in the
				166	\class{Header} constructor.
				167	\end{funcdesc}
				168
				169	\declaremodule{standard}{email.Charset}
				170	\modulesynopsis{Character Sets}
				171
				172	This module provides a class \class{Charset} for representing
				173	character sets and character set conversions in email messages, as
				174	well as a character set registry and several convenience methods for
				175	manipulating this registry. Instances of \class{Charset} are used in
				176	several other modules within the \module{email} package.
				177
				178	\versionadded{2.2.2}
				179
				180	\begin{classdesc}{Charset}{\optional{input_charset}}
				181	Map character sets to their email properties.
				182
				183	This class provides information about the requirements imposed on
				184	email for a specific character set. It also provides convenience
				185	routines for converting between character sets, given the availability
				186	of the applicable codecs. Given a character set, it will do its best
				187	to provide information on how to use that character set in an email
				188	message in an RFC-compliant way.
				189
				190	Certain character sets must be encoded with quoted-printable or base64
				191	when used in email headers or bodies. Certain character sets must be
				192	converted outright, and are not allowed in email.
				193
				194	Optional \var{input_charset} is as described below. After being alias
				195	normalized it is also used as a lookup into the registry of character
				196	sets to find out the header encoding, body encoding, and output
				197	conversion codec to be used for the character set. For example, if
				198	\var{input_charset} is \code{iso-8859-1}, then headers and bodies will
				199	be encoded using quoted-printable and no output conversion codec is
				200	necessary. If \var{input_charset} is \code{euc-jp}, then headers will
				201	be encoded with base64, bodies will not be encoded, but output text
				202	will be converted from the \code{euc-jp} character set to the
				203	\code{iso-2022-jp} character set.
				204	\end{classdesc}
				205
				206	\class{Charset} instances have the following data attributes:
				207
				208	\begin{datadesc}{input_charset}
				209	The initial character set specified. Common aliases are converted to
				210	their \emph{official} email names (e.g. \code{latin_1} is converted to
				211	\code{iso-8859-1}). Defaults to 7-bit \code{us-ascii}.
				212	\end{datadesc}
				213
				214	\begin{datadesc}{header_encoding}
				215	If the character set must be encoded before it can be used in an
				216	email header, this attribute will be set to \code{Charset.QP} (for
				217	quoted-printable), \code{Charset.BASE64} (for base64 encoding), or
				218	\code{Charset.SHORTEST} for the shortest of QP or BASE64 encoding.
				219	Otherwise, it will be \code{None}.
				220	\end{datadesc}
				221
				222	\begin{datadesc}{body_encoding}
				223	Same as \var{header_encoding}, but describes the encoding for the
				224	mail message's body, which indeed may be different than the header
				225	encoding. \code{Charset.SHORTEST} is not allowed for
				226	\var{body_encoding}.
				227	\end{datadesc}
				228
				229	\begin{datadesc}{output_charset}
				230	Some character sets must be converted before the can be used in
				231	email headers or bodies. If the \var{input_charset} is one of
				232	them, this attribute will contain the name of the character set
				233	output will be converted to. Otherwise, it will be \code{None}.
				234	\end{datadesc}
				235
				236	\begin{datadesc}{input_codec}
				237	The name of the Python codec used to convert the \var{input_charset} to
				238	Unicode. If no conversion codec is necessary, this attribute will be
				239	\code{None}.
				240	\end{datadesc}
				241
				242	\begin{datadesc}{output_codec}
				243	The name of the Python codec used to convert Unicode to the
				244	\var{output_charset}. If no conversion codec is necessary, this
				245	attribute will have the same value as the \var{input_codec}.
				246	\end{datadesc}
				247
				248	\class{Charset} instances also have the following methods:
				249
				250	\begin{methoddesc}[Charset]{get_body_encoding}{}
				251	Return the content transfer encoding used for body encoding.
				252
				253	This is either the string \samp{quoted-printable} or \samp{base64}
				254	depending on the encoding used, or it is a function, in which case you
				255	should call the function with a single argument, the Message object
				256	being encoded. The function should then set the
				257	\mailheader{Content-Transfer-Encoding} header itself to whatever is
				258	appropriate.
				259
				260	Returns the string \samp{quoted-printable} if
				261	\var{body_encoding} is \code{QP}, returns the string
				262	\samp{base64} if \var{body_encoding} is \code{BASE64}, and returns the
				263	string \samp{7bit} otherwise.
				264	\end{methoddesc}
				265
				266	\begin{methoddesc}{convert}{s}
				267	Convert the string \var{s} from the \var{input_codec} to the
				268	\var{output_codec}.
				269	\end{methoddesc}
				270
				271	\begin{methoddesc}{to_splittable}{s}
				272	Convert a possibly multibyte string to a safely splittable format.
				273	\var{s} is the string to split.
				274
				275	Uses the \var{input_codec} to try and convert the string to Unicode,
				276	so it can be safely split on character boundaries (even for multibyte
				277	characters).
				278
				279	Returns the string as-is if it isn't known how to convert \var{s} to
				280	Unicode with the \var{input_charset}.
				281
				282	Characters that could not be converted to Unicode will be replaced
				283	with the Unicode replacement character \character{U+FFFD}.
				284	\end{methoddesc}
				285
				286	\begin{methoddesc}{from_splittable}{ustr\optional{, to_output}}
				287	Convert a splittable string back into an encoded string. \var{ustr}
				288	is a Unicode string to ``unsplit''.
				289
				290	This method uses the proper codec to try and convert the string from
				291	Unicode back into an encoded format. Return the string as-is if it is
				292	not Unicode, or if it could not be converted from Unicode.
				293
				294	Characters that could not be converted from Unicode will be replaced
				295	with an appropriate character (usually \character{?}).
				296
				297	If \var{to_output} is \code{True} (the default), uses
				298	\var{output_codec} to convert to an
				299	encoded format. If \var{to_output} is \code{False}, it uses
				300	\var{input_codec}.
				301	\end{methoddesc}
				302
				303	\begin{methoddesc}{get_output_charset}{}
				304	Return the output character set.
				305
				306	This is the \var{output_charset} attribute if that is not \code{None},
				307	otherwise it is \var{input_charset}.
				308	\end{methoddesc}
				309
				310	\begin{methoddesc}{encoded_header_len}{}
				311	Return the length of the encoded header string, properly calculating
				312	for quoted-printable or base64 encoding.
				313	\end{methoddesc}
				314
				315	\begin{methoddesc}{header_encode}{s\optional{, convert}}
				316	Header-encode the string \var{s}.
				317
				318	If \var{convert} is \code{True}, the string will be converted from the
				319	input charset to the output charset automatically. This is not useful
				320	for multibyte character sets, which have line length issues (multibyte
				321	characters must be split on a character, not a byte boundary); use the
				322	higher-level \class{Header} class to deal with these issues (see
				323	\refmodule{email.Header}). \var{convert} defaults to \code{False}.
				324
				325	The type of encoding (base64 or quoted-printable) will be based on
				326	the \var{header_encoding} attribute.
				327	\end{methoddesc}
				328
				329	\begin{methoddesc}{body_encode}{s\optional{, convert}}
				330	Body-encode the string \var{s}.
				331
				332	If \var{convert} is \code{True} (the default), the string will be
				333	converted from the input charset to output charset automatically.
				334	Unlike \method{header_encode()}, there are no issues with byte
				335	boundaries and multibyte charsets in email bodies, so this is usually
				336	pretty safe.
				337
				338	The type of encoding (base64 or quoted-printable) will be based on
				339	the \var{body_encoding} attribute.
				340	\end{methoddesc}
				341
				342	The \class{Charset} class also provides a number of methods to support
				343	standard operations and built-in functions.
				344
				345	\begin{methoddesc}[Charset]{__str__}{}
				346	Returns \var{input_charset} as a string coerced to lower case.
				347	\end{methoddesc}
				348
				349	\begin{methoddesc}[Charset]{__eq__}{other}
				350	This method allows you to compare two \class{Charset} instances for equality.
				351	\end{methoddesc}
				352
				353	\begin{methoddesc}[Header]{__ne__}{other}
				354	This method allows you to compare two \class{Charset} instances for inequality.
				355	\end{methoddesc}
				356
				357	The \module{email.Charset} module also provides the following
				358	functions for adding new entries to the global character set, alias,
				359	and codec registries:
				360
				361	\begin{funcdesc}{add_charset}{charset\optional{, header_enc\optional{,
				362	body_enc\optional{, output_charset}}}}
				363	Add character properties to the global registry.
				364
				365	\var{charset} is the input character set, and must be the canonical
				366	name of a character set.
				367
				368	Optional \var{header_enc} and \var{body_enc} is either
				369	\code{Charset.QP} for quoted-printable, \code{Charset.BASE64} for
				370	base64 encoding, \code{Charset.SHORTEST} for the shortest of qp or
				371	base64 encoding, or \code{None} for no encoding. \code{SHORTEST} is
				372	only valid for \var{header_enc}. It describes how message headers and
				373	message bodies in the input charset are to be encoded. Default is no
				374	encoding.
				375
				376	Optional \var{output_charset} is the character set that the output
				377	should be in. Conversions will proceed from input charset, to
				378	Unicode, to the output charset when the method
				379	\method{Charset.convert()} is called. The default is to output in the
				380	same character set as the input.
				381
				382	Both \var{input_charset} and \var{output_charset} must have Unicode
				383	codec entries in the module's character set-to-codec mapping; use
				384	\function{add_codec(charset, codecname)} to add codecs the module does
				385	not know about. See the \refmodule{codecs} module's documentation for
				386	more information.
				387
				388	The global character set registry is kept in the module global
				389	dictionary \code{CHARSETS}.
				390	\end{funcdesc}
				391
				392	\begin{funcdesc}{add_alias}{alias, canonical}
				393	Add a character set alias. \var{alias} is the alias name,
				394	e.g. \code{latin-1}. \var{canonical} is the character set's canonical
				395	name, e.g. \code{iso-8859-1}.
				396
				397	The global charset alias registry is kept in the module global
				398	dictionary \code{ALIASES}.
				399	\end{funcdesc}
				400
				401	\begin{funcdesc}{add_codec}{charset, codecname}
				402	Add a codec that map characters in the given character set to and from
				403	Unicode.
				404
				405	\var{charset} is the canonical name of a character set.
				406	\var{codecname} is the name of a Python codec, as appropriate for the
				407	second argument to the \function{unicode()} built-in, or to the
				408	\method{encode()} method of a Unicode string.
				409	\end{funcdesc}