Blame - Doc/library/email.charset.rst - platform/external/python/cpython2

blob: 01529a0a01c5aed99971b7ba821642ea509ad18b [file] [log] [blame]

Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1	:mod:`email`: Representing character sets
				2	-----------------------------------------
				3
				4	.. module:: email.charset
				5	:synopsis: Character Sets
				6
				7
				8	This module provides a class :class:`Charset` for representing character sets
				9	and character set conversions in email messages, as well as a character set
				10	registry and several convenience methods for manipulating this registry.
				11	Instances of :class:`Charset` are used in several other modules within the
				12	:mod:`email` package.
				13
				14	Import this class from the :mod:`email.charset` module.
				15
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	16
				17	.. class:: Charset([input_charset])
				18
				19	Map character sets to their email properties.
				20
				21	This class provides information about the requirements imposed on email for a
				22	specific character set. It also provides convenience routines for converting
				23	between character sets, given the availability of the applicable codecs. Given
				24	a character set, it will do its best to provide information on how to use that
				25	character set in an email message in an RFC-compliant way.
				26
				27	Certain character sets must be encoded with quoted-printable or base64 when used
				28	in email headers or bodies. Certain character sets must be converted outright,
				29	and are not allowed in email.
				30
				31	Optional input_charset is as described below; it is always coerced to lower
				32	case. After being alias normalized it is also used as a lookup into the
				33	registry of character sets to find out the header encoding, body encoding, and
				34	output conversion codec to be used for the character set. For example, if
				35	input_charset is ``iso-8859-1``, then headers and bodies will be encoded using
				36	quoted-printable and no output conversion codec is necessary. If
				37	input_charset is ``euc-jp``, then headers will be encoded with base64, bodies
				38	will not be encoded, but output text will be converted from the ``euc-jp``
				39	character set to the ``iso-2022-jp`` character set.
				40
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	41	:class:`Charset` instances have the following data attributes:
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	42
				43
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	44	.. attribute:: input_charset
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	45
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	46	The initial character set specified. Common aliases are converted to
				47	their official email names (e.g. ``latin_1`` is converted to
				48	``iso-8859-1``). Defaults to 7-bit ``us-ascii``.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	49
				50
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	51	.. attribute:: header_encoding
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	52
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	53	If the character set must be encoded before it can be used in an email
				54	header, this attribute will be set to ``Charset.QP`` (for
				55	quoted-printable), ``Charset.BASE64`` (for base64 encoding), or
				56	``Charset.SHORTEST`` for the shortest of QP or BASE64 encoding. Otherwise,
				57	it will be ``None``.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	58
				59
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	60	.. attribute:: body_encoding
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	61
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	62	Same as header_encoding, but describes the encoding for the mail
				63	message's body, which indeed may be different than the header encoding.
				64	``Charset.SHORTEST`` is not allowed for body_encoding.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	65
				66
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	67	.. attribute:: output_charset
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	68
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	69	Some character sets must be converted before they can be used in email headers
				70	or bodies. If the input_charset is one of them, this attribute will
				71	contain the name of the character set output will be converted to. Otherwise, it will
				72	be ``None``.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	73
				74
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	75	.. attribute:: input_codec
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	76
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	77	The name of the Python codec used to convert the input_charset to
				78	Unicode. If no conversion codec is necessary, this attribute will be
				79	``None``.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	80
				81
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	82	.. attribute:: output_codec
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	83
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	84	The name of the Python codec used to convert Unicode to the
				85	output_charset. If no conversion codec is necessary, this attribute
				86	will have the same value as the input_codec.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	87
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	88	:class:`Charset` instances also have the following methods:
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	89
				90
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	91	.. method:: get_body_encoding()
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	92
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	93	Return the content transfer encoding used for body encoding.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	94
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	95	This is either the string ``quoted-printable`` or ``base64`` depending on
				96	the encoding used, or it is a function, in which case you should call the
				97	function with a single argument, the Message object being encoded. The
				98	function should then set the :mailheader:`Content-Transfer-Encoding`
				99	header itself to whatever is appropriate.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	100
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	101	Returns the string ``quoted-printable`` if body_encoding is ``QP``,
				102	returns the string ``base64`` if body_encoding is ``BASE64``, and
				103	returns the string ``7bit`` otherwise.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	104
				105
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	106	.. method:: convert(s)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	107
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	108	Convert the string s from the input_codec to the output_codec.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	109
				110
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	111	.. method:: to_splittable(s)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	112
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	113	Convert a possibly multibyte string to a safely splittable format. s is
				114	the string to split.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	115
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	116	Uses the input_codec to try and convert the string to Unicode, so it can
				117	be safely split on character boundaries (even for multibyte characters).
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	118
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	119	Returns the string as-is if it isn't known how to convert s to Unicode
				120	with the input_charset.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	121
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	122	Characters that could not be converted to Unicode will be replaced with
				123	the Unicode replacement character ``'U+FFFD'``.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	124
				125
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	126	.. method:: from_splittable(ustr[, to_output])
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	127
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	128	Convert a splittable string back into an encoded string. ustr is a
				129	Unicode string to "unsplit".
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	130
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	131	This method uses the proper codec to try and convert the string from
				132	Unicode back into an encoded format. Return the string as-is if it is not
				133	Unicode, or if it could not be converted from Unicode.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	134
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	135	Characters that could not be converted from Unicode will be replaced with
				136	an appropriate character (usually ``'?'``).
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	137
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	138	If to_output is ``True`` (the default), uses output_codec to convert
				139	to an encoded format. If to_output is ``False``, it uses input_codec.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	140
				141
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	142	.. method:: get_output_charset()
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	143
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	144	Return the output character set.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	145
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	146	This is the output_charset attribute if that is not ``None``, otherwise
				147	it is input_charset.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	148
				149
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	150	.. method:: encoded_header_len()
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	151
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	152	Return the length of the encoded header string, properly calculating for
				153	quoted-printable or base64 encoding.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	154
				155
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	156	.. method:: header_encode(s[, convert])
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	157
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	158	Header-encode the string s.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	159
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	160	If convert is ``True``, the string will be converted from the input
				161	charset to the output charset automatically. This is not useful for
				162	multibyte character sets, which have line length issues (multibyte
				163	characters must be split on a character, not a byte boundary); use the
				164	higher-level :class:`Header` class to deal with these issues (see
				165	:mod:`email.header`). convert defaults to ``False``.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	166
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	167	The type of encoding (base64 or quoted-printable) will be based on the
				168	header_encoding attribute.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	169
				170
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	171	.. method:: body_encode(s[, convert])
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	172
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	173	Body-encode the string s.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	174
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	175	If convert is ``True`` (the default), the string will be converted from
				176	the input charset to output charset automatically. Unlike
				177	:meth:`header_encode`, there are no issues with byte boundaries and
				178	multibyte charsets in email bodies, so this is usually pretty safe.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	179
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	180	The type of encoding (base64 or quoted-printable) will be based on the
				181	body_encoding attribute.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	182
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	183	The :class:`Charset` class also provides a number of methods to support
				184	standard operations and built-in functions.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	185
				186
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	187	.. method:: __str__()
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	188
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	189	Returns input_charset as a string coerced to lower
				190	case. :meth:`__repr__` is an alias for :meth:`__str__`.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	191
				192
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	193	.. method:: __eq__(other)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	194
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	195	This method allows you to compare two :class:`Charset` instances for
				196	equality.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	197
				198
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	199	.. method:: __ne__(other)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	200
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	201	This method allows you to compare two :class:`Charset` instances for
				202	inequality.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	203
				204	The :mod:`email.charset` module also provides the following functions for adding
				205	new entries to the global character set, alias, and codec registries:
				206
				207
				208	.. function:: add_charset(charset[, header_enc[, body_enc[, output_charset]]])
				209
				210	Add character properties to the global registry.
				211
				212	charset is the input character set, and must be the canonical name of a
				213	character set.
				214
				215	Optional header_enc and body_enc is either ``Charset.QP`` for
				216	quoted-printable, ``Charset.BASE64`` for base64 encoding,
				217	``Charset.SHORTEST`` for the shortest of quoted-printable or base64 encoding,
				218	or ``None`` for no encoding. ``SHORTEST`` is only valid for
				219	header_enc. The default is ``None`` for no encoding.
				220
				221	Optional output_charset is the character set that the output should be in.
				222	Conversions will proceed from input charset, to Unicode, to the output charset
				223	when the method :meth:`Charset.convert` is called. The default is to output in
				224	the same character set as the input.
				225
				226	Both input_charset and output_charset must have Unicode codec entries in the
				227	module's character set-to-codec mapping; use :func:`add_codec` to add codecs the
				228	module does not know about. See the :mod:`codecs` module's documentation for
				229	more information.
				230
				231	The global character set registry is kept in the module global dictionary
				232	``CHARSETS``.
				233
				234
				235	.. function:: add_alias(alias, canonical)
				236
				237	Add a character set alias. alias is the alias name, e.g. ``latin-1``.
				238	canonical is the character set's canonical name, e.g. ``iso-8859-1``.
				239
				240	The global charset alias registry is kept in the module global dictionary
				241	``ALIASES``.
				242
				243
				244	.. function:: add_codec(charset, codecname)
				245
				246	Add a codec that map characters in the given character set to and from Unicode.
				247
				248	charset is the canonical name of a character set. codecname is the name of a
Georg Brandl	f694518	2008-02-01 11:56:49 +0000	[diff] [blame]	249	Python codec, as appropriate for the second argument to the :class:`str`'s
				250	:func:`decode` method
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	251