blob: a943fc2a4bb6da5879f98f91da83e6afe93f754b [file] [log] [blame]
Georg Brandl116aa622007-08-15 14:28:22 +00001:mod:`email`: Representing character sets
2-----------------------------------------
3
4.. module:: email.charset
5 :synopsis: Character Sets
6
7
8This module provides a class :class:`Charset` for representing character sets
9and character set conversions in email messages, as well as a character set
10registry and several convenience methods for manipulating this registry.
11Instances of :class:`Charset` are used in several other modules within the
12:mod:`email` package.
13
14Import this class from the :mod:`email.charset` module.
15
Georg Brandl116aa622007-08-15 14:28:22 +000016
17.. class:: Charset([input_charset])
18
19 Map character sets to their email properties.
20
21 This class provides information about the requirements imposed on email for a
22 specific character set. It also provides convenience routines for converting
23 between character sets, given the availability of the applicable codecs. Given
24 a character set, it will do its best to provide information on how to use that
25 character set in an email message in an RFC-compliant way.
26
27 Certain character sets must be encoded with quoted-printable or base64 when used
28 in email headers or bodies. Certain character sets must be converted outright,
29 and are not allowed in email.
30
31 Optional *input_charset* is as described below; it is always coerced to lower
32 case. After being alias normalized it is also used as a lookup into the
33 registry of character sets to find out the header encoding, body encoding, and
34 output conversion codec to be used for the character set. For example, if
35 *input_charset* is ``iso-8859-1``, then headers and bodies will be encoded using
36 quoted-printable and no output conversion codec is necessary. If
37 *input_charset* is ``euc-jp``, then headers will be encoded with base64, bodies
38 will not be encoded, but output text will be converted from the ``euc-jp``
39 character set to the ``iso-2022-jp`` character set.
40
41:class:`Charset` instances have the following data attributes:
42
43
44.. data:: input_charset
45
46 The initial character set specified. Common aliases are converted to their
47 *official* email names (e.g. ``latin_1`` is converted to ``iso-8859-1``).
48 Defaults to 7-bit ``us-ascii``.
49
50
51.. data:: header_encoding
52
53 If the character set must be encoded before it can be used in an email header,
54 this attribute will be set to ``Charset.QP`` (for quoted-printable),
55 ``Charset.BASE64`` (for base64 encoding), or ``Charset.SHORTEST`` for the
56 shortest of QP or BASE64 encoding. Otherwise, it will be ``None``.
57
58
59.. data:: body_encoding
60
61 Same as *header_encoding*, but describes the encoding for the mail message's
62 body, which indeed may be different than the header encoding.
63 ``Charset.SHORTEST`` is not allowed for *body_encoding*.
64
65
66.. data:: output_charset
67
68 Some character sets must be converted before they can be used in email headers
69 or bodies. If the *input_charset* is one of them, this attribute will contain
70 the name of the character set output will be converted to. Otherwise, it will
71 be ``None``.
72
73
74.. data:: input_codec
75
76 The name of the Python codec used to convert the *input_charset* to Unicode. If
77 no conversion codec is necessary, this attribute will be ``None``.
78
79
80.. data:: output_codec
81
82 The name of the Python codec used to convert Unicode to the *output_charset*.
83 If no conversion codec is necessary, this attribute will have the same value as
84 the *input_codec*.
85
86:class:`Charset` instances also have the following methods:
87
88
89.. method:: Charset.get_body_encoding()
90
91 Return the content transfer encoding used for body encoding.
92
93 This is either the string ``quoted-printable`` or ``base64`` depending on the
94 encoding used, or it is a function, in which case you should call the function
95 with a single argument, the Message object being encoded. The function should
96 then set the :mailheader:`Content-Transfer-Encoding` header itself to whatever
97 is appropriate.
98
99 Returns the string ``quoted-printable`` if *body_encoding* is ``QP``, returns
100 the string ``base64`` if *body_encoding* is ``BASE64``, and returns the string
101 ``7bit`` otherwise.
102
103
104.. method:: Charset.convert(s)
105
106 Convert the string *s* from the *input_codec* to the *output_codec*.
107
108
109.. method:: Charset.to_splittable(s)
110
111 Convert a possibly multibyte string to a safely splittable format. *s* is the
112 string to split.
113
114 Uses the *input_codec* to try and convert the string to Unicode, so it can be
115 safely split on character boundaries (even for multibyte characters).
116
117 Returns the string as-is if it isn't known how to convert *s* to Unicode with
118 the *input_charset*.
119
120 Characters that could not be converted to Unicode will be replaced with the
121 Unicode replacement character ``'U+FFFD'``.
122
123
124.. method:: Charset.from_splittable(ustr[, to_output])
125
126 Convert a splittable string back into an encoded string. *ustr* is a Unicode
127 string to "unsplit".
128
129 This method uses the proper codec to try and convert the string from Unicode
130 back into an encoded format. Return the string as-is if it is not Unicode, or
131 if it could not be converted from Unicode.
132
133 Characters that could not be converted from Unicode will be replaced with an
134 appropriate character (usually ``'?'``).
135
136 If *to_output* is ``True`` (the default), uses *output_codec* to convert to an
137 encoded format. If *to_output* is ``False``, it uses *input_codec*.
138
139
140.. method:: Charset.get_output_charset()
141
142 Return the output character set.
143
144 This is the *output_charset* attribute if that is not ``None``, otherwise it is
145 *input_charset*.
146
147
148.. method:: Charset.encoded_header_len()
149
150 Return the length of the encoded header string, properly calculating for
151 quoted-printable or base64 encoding.
152
153
154.. method:: Charset.header_encode(s[, convert])
155
156 Header-encode the string *s*.
157
158 If *convert* is ``True``, the string will be converted from the input charset to
159 the output charset automatically. This is not useful for multibyte character
160 sets, which have line length issues (multibyte characters must be split on a
161 character, not a byte boundary); use the higher-level :class:`Header` class to
162 deal with these issues (see :mod:`email.header`). *convert* defaults to
163 ``False``.
164
165 The type of encoding (base64 or quoted-printable) will be based on the
166 *header_encoding* attribute.
167
168
169.. method:: Charset.body_encode(s[, convert])
170
171 Body-encode the string *s*.
172
173 If *convert* is ``True`` (the default), the string will be converted from the
174 input charset to output charset automatically. Unlike :meth:`header_encode`,
175 there are no issues with byte boundaries and multibyte charsets in email bodies,
176 so this is usually pretty safe.
177
178 The type of encoding (base64 or quoted-printable) will be based on the
179 *body_encoding* attribute.
180
181The :class:`Charset` class also provides a number of methods to support standard
182operations and built-in functions.
183
184
185.. method:: Charset.__str__()
186
187 Returns *input_charset* as a string coerced to lower case. :meth:`__repr__` is
188 an alias for :meth:`__str__`.
189
190
191.. method:: Charset.__eq__(other)
192
193 This method allows you to compare two :class:`Charset` instances for equality.
194
195
196.. method:: Header.__ne__(other)
197
198 This method allows you to compare two :class:`Charset` instances for inequality.
199
200The :mod:`email.charset` module also provides the following functions for adding
201new entries to the global character set, alias, and codec registries:
202
203
204.. function:: add_charset(charset[, header_enc[, body_enc[, output_charset]]])
205
206 Add character properties to the global registry.
207
208 *charset* is the input character set, and must be the canonical name of a
209 character set.
210
211 Optional *header_enc* and *body_enc* is either ``Charset.QP`` for
212 quoted-printable, ``Charset.BASE64`` for base64 encoding,
213 ``Charset.SHORTEST`` for the shortest of quoted-printable or base64 encoding,
214 or ``None`` for no encoding. ``SHORTEST`` is only valid for
215 *header_enc*. The default is ``None`` for no encoding.
216
217 Optional *output_charset* is the character set that the output should be in.
218 Conversions will proceed from input charset, to Unicode, to the output charset
219 when the method :meth:`Charset.convert` is called. The default is to output in
220 the same character set as the input.
221
222 Both *input_charset* and *output_charset* must have Unicode codec entries in the
223 module's character set-to-codec mapping; use :func:`add_codec` to add codecs the
224 module does not know about. See the :mod:`codecs` module's documentation for
225 more information.
226
227 The global character set registry is kept in the module global dictionary
228 ``CHARSETS``.
229
230
231.. function:: add_alias(alias, canonical)
232
233 Add a character set alias. *alias* is the alias name, e.g. ``latin-1``.
234 *canonical* is the character set's canonical name, e.g. ``iso-8859-1``.
235
236 The global charset alias registry is kept in the module global dictionary
237 ``ALIASES``.
238
239
240.. function:: add_codec(charset, codecname)
241
242 Add a codec that map characters in the given character set to and from Unicode.
243
244 *charset* is the canonical name of a character set. *codecname* is the name of a
245 Python codec, as appropriate for the second argument to the :func:`unicode`
246 built-in, or to the :meth:`encode` method of a Unicode string.
247