Blame - Doc/library/email.charset.rst - platform/external/python/cpython3

blob: c0fab8f05731d4b48dee76b5c0e48d102b6ac104 [file] [log] [blame]

R David Murray	79cf3ba	2012-05-27 17:10:36 -0400	[diff] [blame]	1	:mod:`email.charset`: Representing character sets
				2	-------------------------------------------------
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	3
				4	.. module:: email.charset
				5	:synopsis: Character Sets
				6
				7
				8	This module provides a class :class:`Charset` for representing character sets
				9	and character set conversions in email messages, as well as a character set
				10	registry and several convenience methods for manipulating this registry.
				11	Instances of :class:`Charset` are used in several other modules within the
				12	:mod:`email` package.
				13
				14	Import this class from the :mod:`email.charset` module.
				15
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	16
Georg Brandl	3f076d8	2009-05-17 11:28:33 +0000	[diff] [blame]	17	.. class:: Charset(input_charset=DEFAULT_CHARSET)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	18
				19	Map character sets to their email properties.
				20
				21	This class provides information about the requirements imposed on email for a
				22	specific character set. It also provides convenience routines for converting
				23	between character sets, given the availability of the applicable codecs. Given
				24	a character set, it will do its best to provide information on how to use that
				25	character set in an email message in an RFC-compliant way.
				26
				27	Certain character sets must be encoded with quoted-printable or base64 when used
				28	in email headers or bodies. Certain character sets must be converted outright,
				29	and are not allowed in email.
				30
				31	Optional input_charset is as described below; it is always coerced to lower
				32	case. After being alias normalized it is also used as a lookup into the
				33	registry of character sets to find out the header encoding, body encoding, and
				34	output conversion codec to be used for the character set. For example, if
				35	input_charset is ``iso-8859-1``, then headers and bodies will be encoded using
				36	quoted-printable and no output conversion codec is necessary. If
				37	input_charset is ``euc-jp``, then headers will be encoded with base64, bodies
				38	will not be encoded, but output text will be converted from the ``euc-jp``
				39	character set to the ``iso-2022-jp`` character set.
				40
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	41	:class:`Charset` instances have the following data attributes:
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	42
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	43	.. attribute:: input_charset
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	44
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	45	The initial character set specified. Common aliases are converted to
				46	their official email names (e.g. ``latin_1`` is converted to
				47	``iso-8859-1``). Defaults to 7-bit ``us-ascii``.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	48
				49
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	50	.. attribute:: header_encoding
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	51
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	52	If the character set must be encoded before it can be used in an email
				53	header, this attribute will be set to ``Charset.QP`` (for
				54	quoted-printable), ``Charset.BASE64`` (for base64 encoding), or
				55	``Charset.SHORTEST`` for the shortest of QP or BASE64 encoding. Otherwise,
				56	it will be ``None``.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	57
				58
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	59	.. attribute:: body_encoding
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	60
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	61	Same as header_encoding, but describes the encoding for the mail
				62	message's body, which indeed may be different than the header encoding.
				63	``Charset.SHORTEST`` is not allowed for body_encoding.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	64
				65
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	66	.. attribute:: output_charset
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	67
Georg Brandl	3f076d8	2009-05-17 11:28:33 +0000	[diff] [blame]	68	Some character sets must be converted before they can be used in email
				69	headers or bodies. If the input_charset is one of them, this attribute
				70	will contain the name of the character set output will be converted to.
				71	Otherwise, it will be ``None``.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	72
				73
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	74	.. attribute:: input_codec
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	75
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	76	The name of the Python codec used to convert the input_charset to
				77	Unicode. If no conversion codec is necessary, this attribute will be
				78	``None``.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	79
				80
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	81	.. attribute:: output_codec
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	82
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	83	The name of the Python codec used to convert Unicode to the
				84	output_charset. If no conversion codec is necessary, this attribute
				85	will have the same value as the input_codec.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	86
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	87
Georg Brandl	3f076d8	2009-05-17 11:28:33 +0000	[diff] [blame]	88	:class:`Charset` instances also have the following methods:
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	89
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	90	.. method:: get_body_encoding()
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	91
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	92	Return the content transfer encoding used for body encoding.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	93
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	94	This is either the string ``quoted-printable`` or ``base64`` depending on
				95	the encoding used, or it is a function, in which case you should call the
				96	function with a single argument, the Message object being encoded. The
				97	function should then set the :mailheader:`Content-Transfer-Encoding`
				98	header itself to whatever is appropriate.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	99
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	100	Returns the string ``quoted-printable`` if body_encoding is ``QP``,
				101	returns the string ``base64`` if body_encoding is ``BASE64``, and
				102	returns the string ``7bit`` otherwise.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	103
Georg Brandl	8c16cb9	2016-02-25 20:17:45 +0100	[diff] [blame^]	104
Georg Brandl	3f076d8	2009-05-17 11:28:33 +0000	[diff] [blame]	105	.. XXX to_splittable and from_splittable are not there anymore!
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	106
Georg Brandl	8c16cb9	2016-02-25 20:17:45 +0100	[diff] [blame^]	107	.. to_splittable(s)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	108
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	109	Convert a possibly multibyte string to a safely splittable format. s is
				110	the string to split.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	111
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	112	Uses the input_codec to try and convert the string to Unicode, so it can
				113	be safely split on character boundaries (even for multibyte characters).
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	114
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	115	Returns the string as-is if it isn't known how to convert s to Unicode
				116	with the input_charset.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	117
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	118	Characters that could not be converted to Unicode will be replaced with
				119	the Unicode replacement character ``'U+FFFD'``.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	120
				121
Georg Brandl	8c16cb9	2016-02-25 20:17:45 +0100	[diff] [blame^]	122	.. from_splittable(ustr[, to_output])
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	123
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	124	Convert a splittable string back into an encoded string. ustr is a
				125	Unicode string to "unsplit".
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	126
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	127	This method uses the proper codec to try and convert the string from
				128	Unicode back into an encoded format. Return the string as-is if it is not
				129	Unicode, or if it could not be converted from Unicode.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	130
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	131	Characters that could not be converted from Unicode will be replaced with
				132	an appropriate character (usually ``'?'``).
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	133
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	134	If to_output is ``True`` (the default), uses output_codec to convert
				135	to an encoded format. If to_output is ``False``, it uses input_codec.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	136
				137
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	138	.. method:: get_output_charset()
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	139
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	140	Return the output character set.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	141
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	142	This is the output_charset attribute if that is not ``None``, otherwise
				143	it is input_charset.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	144
				145
Georg Brandl	3f076d8	2009-05-17 11:28:33 +0000	[diff] [blame]	146	.. method:: header_encode(string)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	147
Georg Brandl	3f076d8	2009-05-17 11:28:33 +0000	[diff] [blame]	148	Header-encode the string string.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	149
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	150	The type of encoding (base64 or quoted-printable) will be based on the
				151	header_encoding attribute.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	152
				153
Georg Brandl	b30f330	2011-01-06 09:23:56 +0000	[diff] [blame]	154	.. method:: header_encode_lines(string, maxlengths)
				155
				156	Header-encode a string by converting it first to bytes.
				157
				158	This is similar to :meth:`header_encode` except that the string is fit
				159	into maximum line lengths as given by the argument maxlengths, which
				160	must be an iterator: each element returned from this iterator will provide
				161	the next maximum line length.
				162
				163
Georg Brandl	3f076d8	2009-05-17 11:28:33 +0000	[diff] [blame]	164	.. method:: body_encode(string)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	165
Georg Brandl	3f076d8	2009-05-17 11:28:33 +0000	[diff] [blame]	166	Body-encode the string string.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	167
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	168	The type of encoding (base64 or quoted-printable) will be based on the
				169	body_encoding attribute.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	170
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	171	The :class:`Charset` class also provides a number of methods to support
				172	standard operations and built-in functions.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	173
				174
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	175	.. method:: __str__()
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	176
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	177	Returns input_charset as a string coerced to lower
				178	case. :meth:`__repr__` is an alias for :meth:`__str__`.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	179
				180
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	181	.. method:: __eq__(other)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	182
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	183	This method allows you to compare two :class:`Charset` instances for
				184	equality.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	185
				186
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	187	.. method:: __ne__(other)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	188
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	189	This method allows you to compare two :class:`Charset` instances for
				190	inequality.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	191
				192	The :mod:`email.charset` module also provides the following functions for adding
				193	new entries to the global character set, alias, and codec registries:
				194
				195
Georg Brandl	3f076d8	2009-05-17 11:28:33 +0000	[diff] [blame]	196	.. function:: add_charset(charset, header_enc=None, body_enc=None, output_charset=None)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	197
				198	Add character properties to the global registry.
				199
				200	charset is the input character set, and must be the canonical name of a
				201	character set.
				202
				203	Optional header_enc and body_enc is either ``Charset.QP`` for
				204	quoted-printable, ``Charset.BASE64`` for base64 encoding,
				205	``Charset.SHORTEST`` for the shortest of quoted-printable or base64 encoding,
				206	or ``None`` for no encoding. ``SHORTEST`` is only valid for
				207	header_enc. The default is ``None`` for no encoding.
				208
				209	Optional output_charset is the character set that the output should be in.
				210	Conversions will proceed from input charset, to Unicode, to the output charset
				211	when the method :meth:`Charset.convert` is called. The default is to output in
				212	the same character set as the input.
				213
				214	Both input_charset and output_charset must have Unicode codec entries in the
				215	module's character set-to-codec mapping; use :func:`add_codec` to add codecs the
				216	module does not know about. See the :mod:`codecs` module's documentation for
				217	more information.
				218
				219	The global character set registry is kept in the module global dictionary
				220	``CHARSETS``.
				221
				222
				223	.. function:: add_alias(alias, canonical)
				224
				225	Add a character set alias. alias is the alias name, e.g. ``latin-1``.
				226	canonical is the character set's canonical name, e.g. ``iso-8859-1``.
				227
				228	The global charset alias registry is kept in the module global dictionary
				229	``ALIASES``.
				230
				231
				232	.. function:: add_codec(charset, codecname)
				233
				234	Add a codec that map characters in the given character set to and from Unicode.
				235
				236	charset is the canonical name of a character set. codecname is the name of a
Georg Brandl	f694518	2008-02-01 11:56:49 +0000	[diff] [blame]	237	Python codec, as appropriate for the second argument to the :class:`str`'s
Martin Panter	d21e0b5	2015-10-10 10:36:22 +0000	[diff] [blame]	238	:meth:`~str.encode` method.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	239