Blame - Doc/library/email.charset.rst - platform/external/python/cpython3

blob: 19a69532edabe84200d29547beb52ef5ed35d640 [file] [log] [blame]

R David Murray	79cf3ba	2012-05-27 17:10:36 -0400	[diff] [blame]	1	:mod:`email.charset`: Representing character sets
				2	-------------------------------------------------
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	3
				4	.. module:: email.charset
				5	:synopsis: Character Sets
				6
				7
				8	This module provides a class :class:`Charset` for representing character sets
				9	and character set conversions in email messages, as well as a character set
				10	registry and several convenience methods for manipulating this registry.
				11	Instances of :class:`Charset` are used in several other modules within the
				12	:mod:`email` package.
				13
				14	Import this class from the :mod:`email.charset` module.
				15
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	16
Georg Brandl	3f076d8	2009-05-17 11:28:33 +0000	[diff] [blame]	17	.. class:: Charset(input_charset=DEFAULT_CHARSET)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	18
				19	Map character sets to their email properties.
				20
				21	This class provides information about the requirements imposed on email for a
				22	specific character set. It also provides convenience routines for converting
				23	between character sets, given the availability of the applicable codecs. Given
				24	a character set, it will do its best to provide information on how to use that
				25	character set in an email message in an RFC-compliant way.
				26
				27	Certain character sets must be encoded with quoted-printable or base64 when used
				28	in email headers or bodies. Certain character sets must be converted outright,
				29	and are not allowed in email.
				30
				31	Optional input_charset is as described below; it is always coerced to lower
				32	case. After being alias normalized it is also used as a lookup into the
				33	registry of character sets to find out the header encoding, body encoding, and
				34	output conversion codec to be used for the character set. For example, if
				35	input_charset is ``iso-8859-1``, then headers and bodies will be encoded using
				36	quoted-printable and no output conversion codec is necessary. If
				37	input_charset is ``euc-jp``, then headers will be encoded with base64, bodies
				38	will not be encoded, but output text will be converted from the ``euc-jp``
				39	character set to the ``iso-2022-jp`` character set.
				40
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	41	:class:`Charset` instances have the following data attributes:
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	42
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	43	.. attribute:: input_charset
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	44
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	45	The initial character set specified. Common aliases are converted to
				46	their official email names (e.g. ``latin_1`` is converted to
				47	``iso-8859-1``). Defaults to 7-bit ``us-ascii``.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	48
				49
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	50	.. attribute:: header_encoding
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	51
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	52	If the character set must be encoded before it can be used in an email
				53	header, this attribute will be set to ``Charset.QP`` (for
				54	quoted-printable), ``Charset.BASE64`` (for base64 encoding), or
				55	``Charset.SHORTEST`` for the shortest of QP or BASE64 encoding. Otherwise,
				56	it will be ``None``.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	57
				58
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	59	.. attribute:: body_encoding
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	60
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	61	Same as header_encoding, but describes the encoding for the mail
				62	message's body, which indeed may be different than the header encoding.
				63	``Charset.SHORTEST`` is not allowed for body_encoding.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	64
				65
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	66	.. attribute:: output_charset
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	67
Georg Brandl	3f076d8	2009-05-17 11:28:33 +0000	[diff] [blame]	68	Some character sets must be converted before they can be used in email
				69	headers or bodies. If the input_charset is one of them, this attribute
				70	will contain the name of the character set output will be converted to.
				71	Otherwise, it will be ``None``.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	72
				73
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	74	.. attribute:: input_codec
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	75
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	76	The name of the Python codec used to convert the input_charset to
				77	Unicode. If no conversion codec is necessary, this attribute will be
				78	``None``.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	79
				80
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	81	.. attribute:: output_codec
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	82
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	83	The name of the Python codec used to convert Unicode to the
				84	output_charset. If no conversion codec is necessary, this attribute
				85	will have the same value as the input_codec.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	86
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	87
Georg Brandl	3f076d8	2009-05-17 11:28:33 +0000	[diff] [blame]	88	:class:`Charset` instances also have the following methods:
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	89
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	90	.. method:: get_body_encoding()
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	91
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	92	Return the content transfer encoding used for body encoding.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	93
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	94	This is either the string ``quoted-printable`` or ``base64`` depending on
				95	the encoding used, or it is a function, in which case you should call the
				96	function with a single argument, the Message object being encoded. The
				97	function should then set the :mailheader:`Content-Transfer-Encoding`
				98	header itself to whatever is appropriate.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	99
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	100	Returns the string ``quoted-printable`` if body_encoding is ``QP``,
				101	returns the string ``base64`` if body_encoding is ``BASE64``, and
				102	returns the string ``7bit`` otherwise.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	103
Georg Brandl	3f076d8	2009-05-17 11:28:33 +0000	[diff] [blame]	104	.. XXX to_splittable and from_splittable are not there anymore!
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	105
Georg Brandl	3f076d8	2009-05-17 11:28:33 +0000	[diff] [blame]	106	.. method to_splittable(s)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	107
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	108	Convert a possibly multibyte string to a safely splittable format. s is
				109	the string to split.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	110
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	111	Uses the input_codec to try and convert the string to Unicode, so it can
				112	be safely split on character boundaries (even for multibyte characters).
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	113
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	114	Returns the string as-is if it isn't known how to convert s to Unicode
				115	with the input_charset.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	116
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	117	Characters that could not be converted to Unicode will be replaced with
				118	the Unicode replacement character ``'U+FFFD'``.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	119
				120
Georg Brandl	3f076d8	2009-05-17 11:28:33 +0000	[diff] [blame]	121	.. method from_splittable(ustr[, to_output])
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	122
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	123	Convert a splittable string back into an encoded string. ustr is a
				124	Unicode string to "unsplit".
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	125
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	126	This method uses the proper codec to try and convert the string from
				127	Unicode back into an encoded format. Return the string as-is if it is not
				128	Unicode, or if it could not be converted from Unicode.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	129
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	130	Characters that could not be converted from Unicode will be replaced with
				131	an appropriate character (usually ``'?'``).
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	132
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	133	If to_output is ``True`` (the default), uses output_codec to convert
				134	to an encoded format. If to_output is ``False``, it uses input_codec.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	135
				136
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	137	.. method:: get_output_charset()
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	138
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	139	Return the output character set.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	140
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	141	This is the output_charset attribute if that is not ``None``, otherwise
				142	it is input_charset.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	143
				144
Georg Brandl	3f076d8	2009-05-17 11:28:33 +0000	[diff] [blame]	145	.. method:: header_encode(string)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	146
Georg Brandl	3f076d8	2009-05-17 11:28:33 +0000	[diff] [blame]	147	Header-encode the string string.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	148
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	149	The type of encoding (base64 or quoted-printable) will be based on the
				150	header_encoding attribute.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	151
				152
Georg Brandl	b30f330	2011-01-06 09:23:56 +0000	[diff] [blame]	153	.. method:: header_encode_lines(string, maxlengths)
				154
				155	Header-encode a string by converting it first to bytes.
				156
				157	This is similar to :meth:`header_encode` except that the string is fit
				158	into maximum line lengths as given by the argument maxlengths, which
				159	must be an iterator: each element returned from this iterator will provide
				160	the next maximum line length.
				161
				162
Georg Brandl	3f076d8	2009-05-17 11:28:33 +0000	[diff] [blame]	163	.. method:: body_encode(string)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	164
Georg Brandl	3f076d8	2009-05-17 11:28:33 +0000	[diff] [blame]	165	Body-encode the string string.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	166
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	167	The type of encoding (base64 or quoted-printable) will be based on the
				168	body_encoding attribute.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	169
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	170	The :class:`Charset` class also provides a number of methods to support
				171	standard operations and built-in functions.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	172
				173
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	174	.. method:: __str__()
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	175
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	176	Returns input_charset as a string coerced to lower
				177	case. :meth:`__repr__` is an alias for :meth:`__str__`.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	178
				179
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	180	.. method:: __eq__(other)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	181
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	182	This method allows you to compare two :class:`Charset` instances for
				183	equality.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	184
				185
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	186	.. method:: __ne__(other)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	187
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	188	This method allows you to compare two :class:`Charset` instances for
				189	inequality.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	190
				191	The :mod:`email.charset` module also provides the following functions for adding
				192	new entries to the global character set, alias, and codec registries:
				193
				194
Georg Brandl	3f076d8	2009-05-17 11:28:33 +0000	[diff] [blame]	195	.. function:: add_charset(charset, header_enc=None, body_enc=None, output_charset=None)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	196
				197	Add character properties to the global registry.
				198
				199	charset is the input character set, and must be the canonical name of a
				200	character set.
				201
				202	Optional header_enc and body_enc is either ``Charset.QP`` for
				203	quoted-printable, ``Charset.BASE64`` for base64 encoding,
				204	``Charset.SHORTEST`` for the shortest of quoted-printable or base64 encoding,
				205	or ``None`` for no encoding. ``SHORTEST`` is only valid for
				206	header_enc. The default is ``None`` for no encoding.
				207
				208	Optional output_charset is the character set that the output should be in.
				209	Conversions will proceed from input charset, to Unicode, to the output charset
				210	when the method :meth:`Charset.convert` is called. The default is to output in
				211	the same character set as the input.
				212
				213	Both input_charset and output_charset must have Unicode codec entries in the
				214	module's character set-to-codec mapping; use :func:`add_codec` to add codecs the
				215	module does not know about. See the :mod:`codecs` module's documentation for
				216	more information.
				217
				218	The global character set registry is kept in the module global dictionary
				219	``CHARSETS``.
				220
				221
				222	.. function:: add_alias(alias, canonical)
				223
				224	Add a character set alias. alias is the alias name, e.g. ``latin-1``.
				225	canonical is the character set's canonical name, e.g. ``iso-8859-1``.
				226
				227	The global charset alias registry is kept in the module global dictionary
				228	``ALIASES``.
				229
				230
				231	.. function:: add_codec(charset, codecname)
				232
				233	Add a codec that map characters in the given character set to and from Unicode.
				234
				235	charset is the canonical name of a character set. codecname is the name of a
Georg Brandl	f694518	2008-02-01 11:56:49 +0000	[diff] [blame]	236	Python codec, as appropriate for the second argument to the :class:`str`'s
Serhiy Storchaka	e0f0cf4	2013-08-19 09:59:18 +0300	[diff] [blame]	237	:meth:`~str.encode` method
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	238