Blame - Doc/library/email.charset.rst - platform/external/python/cpython3

blob: a943fc2a4bb6da5879f98f91da83e6afe93f754b [file] [log] [blame]

Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1	:mod:`email`: Representing character sets
				2	-----------------------------------------
				3
				4	.. module:: email.charset
				5	:synopsis: Character Sets
				6
				7
				8	This module provides a class :class:`Charset` for representing character sets
				9	and character set conversions in email messages, as well as a character set
				10	registry and several convenience methods for manipulating this registry.
				11	Instances of :class:`Charset` are used in several other modules within the
				12	:mod:`email` package.
				13
				14	Import this class from the :mod:`email.charset` module.
				15
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	16
				17	.. class:: Charset([input_charset])
				18
				19	Map character sets to their email properties.
				20
				21	This class provides information about the requirements imposed on email for a
				22	specific character set. It also provides convenience routines for converting
				23	between character sets, given the availability of the applicable codecs. Given
				24	a character set, it will do its best to provide information on how to use that
				25	character set in an email message in an RFC-compliant way.
				26
				27	Certain character sets must be encoded with quoted-printable or base64 when used
				28	in email headers or bodies. Certain character sets must be converted outright,
				29	and are not allowed in email.
				30
				31	Optional input_charset is as described below; it is always coerced to lower
				32	case. After being alias normalized it is also used as a lookup into the
				33	registry of character sets to find out the header encoding, body encoding, and
				34	output conversion codec to be used for the character set. For example, if
				35	input_charset is ``iso-8859-1``, then headers and bodies will be encoded using
				36	quoted-printable and no output conversion codec is necessary. If
				37	input_charset is ``euc-jp``, then headers will be encoded with base64, bodies
				38	will not be encoded, but output text will be converted from the ``euc-jp``
				39	character set to the ``iso-2022-jp`` character set.
				40
				41	:class:`Charset` instances have the following data attributes:
				42
				43
				44	.. data:: input_charset
				45
				46	The initial character set specified. Common aliases are converted to their
				47	official email names (e.g. ``latin_1`` is converted to ``iso-8859-1``).
				48	Defaults to 7-bit ``us-ascii``.
				49
				50
				51	.. data:: header_encoding
				52
				53	If the character set must be encoded before it can be used in an email header,
				54	this attribute will be set to ``Charset.QP`` (for quoted-printable),
				55	``Charset.BASE64`` (for base64 encoding), or ``Charset.SHORTEST`` for the
				56	shortest of QP or BASE64 encoding. Otherwise, it will be ``None``.
				57
				58
				59	.. data:: body_encoding
				60
				61	Same as header_encoding, but describes the encoding for the mail message's
				62	body, which indeed may be different than the header encoding.
				63	``Charset.SHORTEST`` is not allowed for body_encoding.
				64
				65
				66	.. data:: output_charset
				67
				68	Some character sets must be converted before they can be used in email headers
				69	or bodies. If the input_charset is one of them, this attribute will contain
				70	the name of the character set output will be converted to. Otherwise, it will
				71	be ``None``.
				72
				73
				74	.. data:: input_codec
				75
				76	The name of the Python codec used to convert the input_charset to Unicode. If
				77	no conversion codec is necessary, this attribute will be ``None``.
				78
				79
				80	.. data:: output_codec
				81
				82	The name of the Python codec used to convert Unicode to the output_charset.
				83	If no conversion codec is necessary, this attribute will have the same value as
				84	the input_codec.
				85
				86	:class:`Charset` instances also have the following methods:
				87
				88
				89	.. method:: Charset.get_body_encoding()
				90
				91	Return the content transfer encoding used for body encoding.
				92
				93	This is either the string ``quoted-printable`` or ``base64`` depending on the
				94	encoding used, or it is a function, in which case you should call the function
				95	with a single argument, the Message object being encoded. The function should
				96	then set the :mailheader:`Content-Transfer-Encoding` header itself to whatever
				97	is appropriate.
				98
				99	Returns the string ``quoted-printable`` if body_encoding is ``QP``, returns
				100	the string ``base64`` if body_encoding is ``BASE64``, and returns the string
				101	``7bit`` otherwise.
				102
				103
				104	.. method:: Charset.convert(s)
				105
				106	Convert the string s from the input_codec to the output_codec.
				107
				108
				109	.. method:: Charset.to_splittable(s)
				110
				111	Convert a possibly multibyte string to a safely splittable format. s is the
				112	string to split.
				113
				114	Uses the input_codec to try and convert the string to Unicode, so it can be
				115	safely split on character boundaries (even for multibyte characters).
				116
				117	Returns the string as-is if it isn't known how to convert s to Unicode with
				118	the input_charset.
				119
				120	Characters that could not be converted to Unicode will be replaced with the
				121	Unicode replacement character ``'U+FFFD'``.
				122
				123
				124	.. method:: Charset.from_splittable(ustr[, to_output])
				125
				126	Convert a splittable string back into an encoded string. ustr is a Unicode
				127	string to "unsplit".
				128
				129	This method uses the proper codec to try and convert the string from Unicode
				130	back into an encoded format. Return the string as-is if it is not Unicode, or
				131	if it could not be converted from Unicode.
				132
				133	Characters that could not be converted from Unicode will be replaced with an
				134	appropriate character (usually ``'?'``).
				135
				136	If to_output is ``True`` (the default), uses output_codec to convert to an
				137	encoded format. If to_output is ``False``, it uses input_codec.
				138
				139
				140	.. method:: Charset.get_output_charset()
				141
				142	Return the output character set.
				143
				144	This is the output_charset attribute if that is not ``None``, otherwise it is
				145	input_charset.
				146
				147
				148	.. method:: Charset.encoded_header_len()
				149
				150	Return the length of the encoded header string, properly calculating for
				151	quoted-printable or base64 encoding.
				152
				153
				154	.. method:: Charset.header_encode(s[, convert])
				155
				156	Header-encode the string s.
				157
				158	If convert is ``True``, the string will be converted from the input charset to
				159	the output charset automatically. This is not useful for multibyte character
				160	sets, which have line length issues (multibyte characters must be split on a
				161	character, not a byte boundary); use the higher-level :class:`Header` class to
				162	deal with these issues (see :mod:`email.header`). convert defaults to
				163	``False``.
				164
				165	The type of encoding (base64 or quoted-printable) will be based on the
				166	header_encoding attribute.
				167
				168
				169	.. method:: Charset.body_encode(s[, convert])
				170
				171	Body-encode the string s.
				172
				173	If convert is ``True`` (the default), the string will be converted from the
				174	input charset to output charset automatically. Unlike :meth:`header_encode`,
				175	there are no issues with byte boundaries and multibyte charsets in email bodies,
				176	so this is usually pretty safe.
				177
				178	The type of encoding (base64 or quoted-printable) will be based on the
				179	body_encoding attribute.
				180
				181	The :class:`Charset` class also provides a number of methods to support standard
				182	operations and built-in functions.
				183
				184
				185	.. method:: Charset.__str__()
				186
				187	Returns input_charset as a string coerced to lower case. :meth:`__repr__` is
				188	an alias for :meth:`__str__`.
				189
				190
				191	.. method:: Charset.__eq__(other)
				192
				193	This method allows you to compare two :class:`Charset` instances for equality.
				194
				195
				196	.. method:: Header.__ne__(other)
				197
				198	This method allows you to compare two :class:`Charset` instances for inequality.
				199
				200	The :mod:`email.charset` module also provides the following functions for adding
				201	new entries to the global character set, alias, and codec registries:
				202
				203
				204	.. function:: add_charset(charset[, header_enc[, body_enc[, output_charset]]])
				205
				206	Add character properties to the global registry.
				207
				208	charset is the input character set, and must be the canonical name of a
				209	character set.
				210
				211	Optional header_enc and body_enc is either ``Charset.QP`` for
				212	quoted-printable, ``Charset.BASE64`` for base64 encoding,
				213	``Charset.SHORTEST`` for the shortest of quoted-printable or base64 encoding,
				214	or ``None`` for no encoding. ``SHORTEST`` is only valid for
				215	header_enc. The default is ``None`` for no encoding.
				216
				217	Optional output_charset is the character set that the output should be in.
				218	Conversions will proceed from input charset, to Unicode, to the output charset
				219	when the method :meth:`Charset.convert` is called. The default is to output in
				220	the same character set as the input.
				221
				222	Both input_charset and output_charset must have Unicode codec entries in the
				223	module's character set-to-codec mapping; use :func:`add_codec` to add codecs the
				224	module does not know about. See the :mod:`codecs` module's documentation for
				225	more information.
				226
				227	The global character set registry is kept in the module global dictionary
				228	``CHARSETS``.
				229
				230
				231	.. function:: add_alias(alias, canonical)
				232
				233	Add a character set alias. alias is the alias name, e.g. ``latin-1``.
				234	canonical is the character set's canonical name, e.g. ``iso-8859-1``.
				235
				236	The global charset alias registry is kept in the module global dictionary
				237	``ALIASES``.
				238
				239
				240	.. function:: add_codec(charset, codecname)
				241
				242	Add a codec that map characters in the given character set to and from Unicode.
				243
				244	charset is the canonical name of a character set. codecname is the name of a
				245	Python codec, as appropriate for the second argument to the :func:`unicode`
				246	built-in, or to the :meth:`encode` method of a Unicode string.
				247