Blame - Doc/library/email.charset.rst - platform/external/python/cpython3

blob: 161d86a3b70d2fb8ae81d55ad463a385fc1c4ef8 [file] [log] [blame]

R David Murray	79cf3ba	2012-05-27 17:10:36 -0400	[diff] [blame]	1	:mod:`email.charset`: Representing character sets
				2	-------------------------------------------------
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	3
				4	.. module:: email.charset
				5	:synopsis: Character Sets
				6
Terry Jan Reedy	fa089b9	2016-06-11 15:02:54 -0400	[diff] [blame]	7	Source code: :source:`Lib/email/charset.py`
				8
				9	--------------
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	10
				11	This module provides a class :class:`Charset` for representing character sets
				12	and character set conversions in email messages, as well as a character set
				13	registry and several convenience methods for manipulating this registry.
				14	Instances of :class:`Charset` are used in several other modules within the
				15	:mod:`email` package.
				16
				17	Import this class from the :mod:`email.charset` module.
				18
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	19
Georg Brandl	3f076d8	2009-05-17 11:28:33 +0000	[diff] [blame]	20	.. class:: Charset(input_charset=DEFAULT_CHARSET)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	21
				22	Map character sets to their email properties.
				23
				24	This class provides information about the requirements imposed on email for a
				25	specific character set. It also provides convenience routines for converting
				26	between character sets, given the availability of the applicable codecs. Given
				27	a character set, it will do its best to provide information on how to use that
				28	character set in an email message in an RFC-compliant way.
				29
				30	Certain character sets must be encoded with quoted-printable or base64 when used
				31	in email headers or bodies. Certain character sets must be converted outright,
				32	and are not allowed in email.
				33
				34	Optional input_charset is as described below; it is always coerced to lower
				35	case. After being alias normalized it is also used as a lookup into the
				36	registry of character sets to find out the header encoding, body encoding, and
				37	output conversion codec to be used for the character set. For example, if
				38	input_charset is ``iso-8859-1``, then headers and bodies will be encoded using
				39	quoted-printable and no output conversion codec is necessary. If
				40	input_charset is ``euc-jp``, then headers will be encoded with base64, bodies
				41	will not be encoded, but output text will be converted from the ``euc-jp``
				42	character set to the ``iso-2022-jp`` character set.
				43
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	44	:class:`Charset` instances have the following data attributes:
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	45
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	46	.. attribute:: input_charset
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	47
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	48	The initial character set specified. Common aliases are converted to
				49	their official email names (e.g. ``latin_1`` is converted to
				50	``iso-8859-1``). Defaults to 7-bit ``us-ascii``.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	51
				52
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	53	.. attribute:: header_encoding
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	54
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	55	If the character set must be encoded before it can be used in an email
				56	header, this attribute will be set to ``Charset.QP`` (for
				57	quoted-printable), ``Charset.BASE64`` (for base64 encoding), or
				58	``Charset.SHORTEST`` for the shortest of QP or BASE64 encoding. Otherwise,
				59	it will be ``None``.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	60
				61
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	62	.. attribute:: body_encoding
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	63
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	64	Same as header_encoding, but describes the encoding for the mail
				65	message's body, which indeed may be different than the header encoding.
				66	``Charset.SHORTEST`` is not allowed for body_encoding.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	67
				68
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	69	.. attribute:: output_charset
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	70
Georg Brandl	3f076d8	2009-05-17 11:28:33 +0000	[diff] [blame]	71	Some character sets must be converted before they can be used in email
				72	headers or bodies. If the input_charset is one of them, this attribute
				73	will contain the name of the character set output will be converted to.
				74	Otherwise, it will be ``None``.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	75
				76
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	77	.. attribute:: input_codec
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	78
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	79	The name of the Python codec used to convert the input_charset to
				80	Unicode. If no conversion codec is necessary, this attribute will be
				81	``None``.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	82
				83
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	84	.. attribute:: output_codec
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	85
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	86	The name of the Python codec used to convert Unicode to the
				87	output_charset. If no conversion codec is necessary, this attribute
				88	will have the same value as the input_codec.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	89
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	90
Georg Brandl	3f076d8	2009-05-17 11:28:33 +0000	[diff] [blame]	91	:class:`Charset` instances also have the following methods:
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	92
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	93	.. method:: get_body_encoding()
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	94
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	95	Return the content transfer encoding used for body encoding.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	96
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	97	This is either the string ``quoted-printable`` or ``base64`` depending on
				98	the encoding used, or it is a function, in which case you should call the
				99	function with a single argument, the Message object being encoded. The
				100	function should then set the :mailheader:`Content-Transfer-Encoding`
				101	header itself to whatever is appropriate.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	102
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	103	Returns the string ``quoted-printable`` if body_encoding is ``QP``,
				104	returns the string ``base64`` if body_encoding is ``BASE64``, and
				105	returns the string ``7bit`` otherwise.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	106
Georg Brandl	8c16cb9	2016-02-25 20:17:45 +0100	[diff] [blame]	107
Georg Brandl	3f076d8	2009-05-17 11:28:33 +0000	[diff] [blame]	108	.. XXX to_splittable and from_splittable are not there anymore!
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	109
Georg Brandl	8c16cb9	2016-02-25 20:17:45 +0100	[diff] [blame]	110	.. to_splittable(s)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	111
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	112	Convert a possibly multibyte string to a safely splittable format. s is
				113	the string to split.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	114
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	115	Uses the input_codec to try and convert the string to Unicode, so it can
				116	be safely split on character boundaries (even for multibyte characters).
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	117
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	118	Returns the string as-is if it isn't known how to convert s to Unicode
				119	with the input_charset.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	120
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	121	Characters that could not be converted to Unicode will be replaced with
				122	the Unicode replacement character ``'U+FFFD'``.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	123
				124
Georg Brandl	8c16cb9	2016-02-25 20:17:45 +0100	[diff] [blame]	125	.. from_splittable(ustr[, to_output])
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	126
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	127	Convert a splittable string back into an encoded string. ustr is a
				128	Unicode string to "unsplit".
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	129
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	130	This method uses the proper codec to try and convert the string from
				131	Unicode back into an encoded format. Return the string as-is if it is not
				132	Unicode, or if it could not be converted from Unicode.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	133
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	134	Characters that could not be converted from Unicode will be replaced with
				135	an appropriate character (usually ``'?'``).
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	136
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	137	If to_output is ``True`` (the default), uses output_codec to convert
				138	to an encoded format. If to_output is ``False``, it uses input_codec.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	139
				140
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	141	.. method:: get_output_charset()
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	142
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	143	Return the output character set.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	144
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	145	This is the output_charset attribute if that is not ``None``, otherwise
				146	it is input_charset.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	147
				148
Georg Brandl	3f076d8	2009-05-17 11:28:33 +0000	[diff] [blame]	149	.. method:: header_encode(string)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	150
Georg Brandl	3f076d8	2009-05-17 11:28:33 +0000	[diff] [blame]	151	Header-encode the string string.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	152
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	153	The type of encoding (base64 or quoted-printable) will be based on the
				154	header_encoding attribute.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	155
				156
Georg Brandl	b30f330	2011-01-06 09:23:56 +0000	[diff] [blame]	157	.. method:: header_encode_lines(string, maxlengths)
				158
				159	Header-encode a string by converting it first to bytes.
				160
				161	This is similar to :meth:`header_encode` except that the string is fit
				162	into maximum line lengths as given by the argument maxlengths, which
				163	must be an iterator: each element returned from this iterator will provide
				164	the next maximum line length.
				165
				166
Georg Brandl	3f076d8	2009-05-17 11:28:33 +0000	[diff] [blame]	167	.. method:: body_encode(string)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	168
Georg Brandl	3f076d8	2009-05-17 11:28:33 +0000	[diff] [blame]	169	Body-encode the string string.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	170
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	171	The type of encoding (base64 or quoted-printable) will be based on the
				172	body_encoding attribute.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	173
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	174	The :class:`Charset` class also provides a number of methods to support
				175	standard operations and built-in functions.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	176
				177
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	178	.. method:: __str__()
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	179
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	180	Returns input_charset as a string coerced to lower
				181	case. :meth:`__repr__` is an alias for :meth:`__str__`.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	182
				183
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	184	.. method:: __eq__(other)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	185
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	186	This method allows you to compare two :class:`Charset` instances for
				187	equality.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	188
				189
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	190	.. method:: __ne__(other)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	191
Benjamin Peterson	e41251e	2008-04-25 01:59:09 +0000	[diff] [blame]	192	This method allows you to compare two :class:`Charset` instances for
				193	inequality.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	194
				195	The :mod:`email.charset` module also provides the following functions for adding
				196	new entries to the global character set, alias, and codec registries:
				197
				198
Georg Brandl	3f076d8	2009-05-17 11:28:33 +0000	[diff] [blame]	199	.. function:: add_charset(charset, header_enc=None, body_enc=None, output_charset=None)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	200
				201	Add character properties to the global registry.
				202
				203	charset is the input character set, and must be the canonical name of a
				204	character set.
				205
				206	Optional header_enc and body_enc is either ``Charset.QP`` for
				207	quoted-printable, ``Charset.BASE64`` for base64 encoding,
				208	``Charset.SHORTEST`` for the shortest of quoted-printable or base64 encoding,
				209	or ``None`` for no encoding. ``SHORTEST`` is only valid for
				210	header_enc. The default is ``None`` for no encoding.
				211
				212	Optional output_charset is the character set that the output should be in.
				213	Conversions will proceed from input charset, to Unicode, to the output charset
				214	when the method :meth:`Charset.convert` is called. The default is to output in
				215	the same character set as the input.
				216
				217	Both input_charset and output_charset must have Unicode codec entries in the
				218	module's character set-to-codec mapping; use :func:`add_codec` to add codecs the
				219	module does not know about. See the :mod:`codecs` module's documentation for
				220	more information.
				221
				222	The global character set registry is kept in the module global dictionary
				223	``CHARSETS``.
				224
				225
				226	.. function:: add_alias(alias, canonical)
				227
				228	Add a character set alias. alias is the alias name, e.g. ``latin-1``.
				229	canonical is the character set's canonical name, e.g. ``iso-8859-1``.
				230
				231	The global charset alias registry is kept in the module global dictionary
				232	``ALIASES``.
				233
				234
				235	.. function:: add_codec(charset, codecname)
				236
				237	Add a codec that map characters in the given character set to and from Unicode.
				238
				239	charset is the canonical name of a character set. codecname is the name of a
Georg Brandl	f694518	2008-02-01 11:56:49 +0000	[diff] [blame]	240	Python codec, as appropriate for the second argument to the :class:`str`'s
Martin Panter	d21e0b5	2015-10-10 10:36:22 +0000	[diff] [blame]	241	:meth:`~str.encode` method.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	242