blob: 4b89f493056478366904889ed22164a4602a5413 [file] [log] [blame]
Georg Brandl54a3faa2008-01-20 09:30:57 +00001.. highlightlang:: c
2
3.. _unicodeobjects:
4
5Unicode Objects and Codecs
6--------------------------
7
8.. sectionauthor:: Marc-Andre Lemburg <mal@lemburg.com>
9
10Unicode Objects
11^^^^^^^^^^^^^^^
12
Victor Stinner77c38622010-05-14 15:58:55 +000013Unicode Type
14""""""""""""
15
Georg Brandl54a3faa2008-01-20 09:30:57 +000016These are the basic Unicode object types used for the Unicode implementation in
17Python:
18
Georg Brandl54a3faa2008-01-20 09:30:57 +000019
20.. ctype:: Py_UNICODE
21
22 This type represents the storage type which is used by Python internally as
23 basis for holding Unicode ordinals. Python's default builds use a 16-bit type
24 for :ctype:`Py_UNICODE` and store Unicode values internally as UCS2. It is also
25 possible to build a UCS4 version of Python (most recent Linux distributions come
26 with UCS4 builds of Python). These builds then use a 32-bit type for
27 :ctype:`Py_UNICODE` and store Unicode data internally as UCS4. On platforms
28 where :ctype:`wchar_t` is available and compatible with the chosen Python
29 Unicode build variant, :ctype:`Py_UNICODE` is a typedef alias for
30 :ctype:`wchar_t` to enhance native platform compatibility. On all other
31 platforms, :ctype:`Py_UNICODE` is a typedef alias for either :ctype:`unsigned
32 short` (UCS2) or :ctype:`unsigned long` (UCS4).
33
34Note that UCS2 and UCS4 Python builds are not binary compatible. Please keep
35this in mind when writing extensions or interfaces.
36
37
38.. ctype:: PyUnicodeObject
39
40 This subtype of :ctype:`PyObject` represents a Python Unicode object.
41
42
43.. cvar:: PyTypeObject PyUnicode_Type
44
45 This instance of :ctype:`PyTypeObject` represents the Python Unicode type. It
46 is exposed to Python code as ``str``.
47
48The following APIs are really C macros and can be used to do fast checks and to
49access internal read-only data of Unicode objects:
50
51
52.. cfunction:: int PyUnicode_Check(PyObject *o)
53
54 Return true if the object *o* is a Unicode object or an instance of a Unicode
55 subtype.
56
57
58.. cfunction:: int PyUnicode_CheckExact(PyObject *o)
59
60 Return true if the object *o* is a Unicode object, but not an instance of a
61 subtype.
62
63
64.. cfunction:: Py_ssize_t PyUnicode_GET_SIZE(PyObject *o)
65
66 Return the size of the object. *o* has to be a :ctype:`PyUnicodeObject` (not
67 checked).
68
69
70.. cfunction:: Py_ssize_t PyUnicode_GET_DATA_SIZE(PyObject *o)
71
72 Return the size of the object's internal buffer in bytes. *o* has to be a
73 :ctype:`PyUnicodeObject` (not checked).
74
75
76.. cfunction:: Py_UNICODE* PyUnicode_AS_UNICODE(PyObject *o)
77
78 Return a pointer to the internal :ctype:`Py_UNICODE` buffer of the object. *o*
79 has to be a :ctype:`PyUnicodeObject` (not checked).
80
81
82.. cfunction:: const char* PyUnicode_AS_DATA(PyObject *o)
83
84 Return a pointer to the internal buffer of the object. *o* has to be a
85 :ctype:`PyUnicodeObject` (not checked).
86
Christian Heimesa156e092008-02-16 07:38:31 +000087
Alexandre Vassalotti6d3dfc32009-07-29 19:54:39 +000088.. cfunction:: int PyUnicode_ClearFreeList()
Christian Heimesa156e092008-02-16 07:38:31 +000089
90 Clear the free list. Return the total number of freed items.
91
Alexandre Vassalotti6d3dfc32009-07-29 19:54:39 +000092
Victor Stinner77c38622010-05-14 15:58:55 +000093Unicode Character Properties
94""""""""""""""""""""""""""""
95
Georg Brandl54a3faa2008-01-20 09:30:57 +000096Unicode provides many different character properties. The most often needed ones
97are available through these macros which are mapped to C functions depending on
98the Python configuration.
99
Georg Brandl54a3faa2008-01-20 09:30:57 +0000100
101.. cfunction:: int Py_UNICODE_ISSPACE(Py_UNICODE ch)
102
103 Return 1 or 0 depending on whether *ch* is a whitespace character.
104
105
106.. cfunction:: int Py_UNICODE_ISLOWER(Py_UNICODE ch)
107
108 Return 1 or 0 depending on whether *ch* is a lowercase character.
109
110
111.. cfunction:: int Py_UNICODE_ISUPPER(Py_UNICODE ch)
112
113 Return 1 or 0 depending on whether *ch* is an uppercase character.
114
115
116.. cfunction:: int Py_UNICODE_ISTITLE(Py_UNICODE ch)
117
118 Return 1 or 0 depending on whether *ch* is a titlecase character.
119
120
121.. cfunction:: int Py_UNICODE_ISLINEBREAK(Py_UNICODE ch)
122
123 Return 1 or 0 depending on whether *ch* is a linebreak character.
124
125
126.. cfunction:: int Py_UNICODE_ISDECIMAL(Py_UNICODE ch)
127
128 Return 1 or 0 depending on whether *ch* is a decimal character.
129
130
131.. cfunction:: int Py_UNICODE_ISDIGIT(Py_UNICODE ch)
132
133 Return 1 or 0 depending on whether *ch* is a digit character.
134
135
136.. cfunction:: int Py_UNICODE_ISNUMERIC(Py_UNICODE ch)
137
138 Return 1 or 0 depending on whether *ch* is a numeric character.
139
140
141.. cfunction:: int Py_UNICODE_ISALPHA(Py_UNICODE ch)
142
143 Return 1 or 0 depending on whether *ch* is an alphabetic character.
144
145
146.. cfunction:: int Py_UNICODE_ISALNUM(Py_UNICODE ch)
147
148 Return 1 or 0 depending on whether *ch* is an alphanumeric character.
149
Georg Brandl559e5d72008-06-11 18:37:52 +0000150
151.. cfunction:: int Py_UNICODE_ISPRINTABLE(Py_UNICODE ch)
152
153 Return 1 or 0 depending on whether *ch* is a printable character.
154 Nonprintable characters are those characters defined in the Unicode character
155 database as "Other" or "Separator", excepting the ASCII space (0x20) which is
156 considered printable. (Note that printable characters in this context are
157 those which should not be escaped when :func:`repr` is invoked on a string.
158 It has no bearing on the handling of strings written to :data:`sys.stdout` or
159 :data:`sys.stderr`.)
160
161
Georg Brandl54a3faa2008-01-20 09:30:57 +0000162These APIs can be used for fast direct character conversions:
163
164
165.. cfunction:: Py_UNICODE Py_UNICODE_TOLOWER(Py_UNICODE ch)
166
167 Return the character *ch* converted to lower case.
168
169
170.. cfunction:: Py_UNICODE Py_UNICODE_TOUPPER(Py_UNICODE ch)
171
172 Return the character *ch* converted to upper case.
173
174
175.. cfunction:: Py_UNICODE Py_UNICODE_TOTITLE(Py_UNICODE ch)
176
177 Return the character *ch* converted to title case.
178
179
180.. cfunction:: int Py_UNICODE_TODECIMAL(Py_UNICODE ch)
181
182 Return the character *ch* converted to a decimal positive integer. Return
183 ``-1`` if this is not possible. This macro does not raise exceptions.
184
185
186.. cfunction:: int Py_UNICODE_TODIGIT(Py_UNICODE ch)
187
188 Return the character *ch* converted to a single digit integer. Return ``-1`` if
189 this is not possible. This macro does not raise exceptions.
190
191
192.. cfunction:: double Py_UNICODE_TONUMERIC(Py_UNICODE ch)
193
194 Return the character *ch* converted to a double. Return ``-1.0`` if this is not
195 possible. This macro does not raise exceptions.
196
Victor Stinner77c38622010-05-14 15:58:55 +0000197
198Plain Py_UNICODE
199""""""""""""""""
200
Georg Brandl54a3faa2008-01-20 09:30:57 +0000201To create Unicode objects and access their basic sequence properties, use these
202APIs:
203
Georg Brandl54a3faa2008-01-20 09:30:57 +0000204
205.. cfunction:: PyObject* PyUnicode_FromUnicode(const Py_UNICODE *u, Py_ssize_t size)
206
207 Create a Unicode Object from the Py_UNICODE buffer *u* of the given size. *u*
208 may be *NULL* which causes the contents to be undefined. It is the user's
209 responsibility to fill in the needed data. The buffer is copied into the new
210 object. If the buffer is not *NULL*, the return value might be a shared object.
211 Therefore, modification of the resulting Unicode object is only allowed when *u*
212 is *NULL*.
213
214
215.. cfunction:: PyObject* PyUnicode_FromStringAndSize(const char *u, Py_ssize_t size)
216
217 Create a Unicode Object from the char buffer *u*. The bytes will be interpreted
218 as being UTF-8 encoded. *u* may also be *NULL* which
219 causes the contents to be undefined. It is the user's responsibility to fill in
220 the needed data. The buffer is copied into the new object. If the buffer is not
221 *NULL*, the return value might be a shared object. Therefore, modification of
222 the resulting Unicode object is only allowed when *u* is *NULL*.
223
224
225.. cfunction:: PyObject *PyUnicode_FromString(const char *u)
226
227 Create a Unicode object from an UTF-8 encoded null-terminated char buffer
228 *u*.
229
230
231.. cfunction:: PyObject* PyUnicode_FromFormat(const char *format, ...)
232
233 Take a C :cfunc:`printf`\ -style *format* string and a variable number of
234 arguments, calculate the size of the resulting Python unicode string and return
235 a string with the values formatted into it. The variable arguments must be C
236 types and must correspond exactly to the format characters in the *format*
237 string. The following format characters are allowed:
238
Mark Dickinson6ce4a9a2009-11-16 17:00:11 +0000239 .. % This should be exactly the same as the table in PyErr_Format.
Georg Brandl54a3faa2008-01-20 09:30:57 +0000240 .. % The descriptions for %zd and %zu are wrong, but the truth is complicated
241 .. % because not all compilers support the %z width modifier -- we fake it
242 .. % when necessary via interpolating PY_FORMAT_SIZE_T.
Mark Dickinson6ce4a9a2009-11-16 17:00:11 +0000243 .. % Similar comments apply to the %ll width modifier and
244 .. % PY_FORMAT_LONG_LONG.
Georg Brandl54a3faa2008-01-20 09:30:57 +0000245
246 +-------------------+---------------------+--------------------------------+
247 | Format Characters | Type | Comment |
248 +===================+=====================+================================+
249 | :attr:`%%` | *n/a* | The literal % character. |
250 +-------------------+---------------------+--------------------------------+
251 | :attr:`%c` | int | A single character, |
252 | | | represented as an C int. |
253 +-------------------+---------------------+--------------------------------+
254 | :attr:`%d` | int | Exactly equivalent to |
255 | | | ``printf("%d")``. |
256 +-------------------+---------------------+--------------------------------+
257 | :attr:`%u` | unsigned int | Exactly equivalent to |
258 | | | ``printf("%u")``. |
259 +-------------------+---------------------+--------------------------------+
260 | :attr:`%ld` | long | Exactly equivalent to |
261 | | | ``printf("%ld")``. |
262 +-------------------+---------------------+--------------------------------+
263 | :attr:`%lu` | unsigned long | Exactly equivalent to |
264 | | | ``printf("%lu")``. |
265 +-------------------+---------------------+--------------------------------+
Mark Dickinson6ce4a9a2009-11-16 17:00:11 +0000266 | :attr:`%lld` | long long | Exactly equivalent to |
267 | | | ``printf("%lld")``. |
268 +-------------------+---------------------+--------------------------------+
269 | :attr:`%llu` | unsigned long long | Exactly equivalent to |
270 | | | ``printf("%llu")``. |
271 +-------------------+---------------------+--------------------------------+
Georg Brandl54a3faa2008-01-20 09:30:57 +0000272 | :attr:`%zd` | Py_ssize_t | Exactly equivalent to |
273 | | | ``printf("%zd")``. |
274 +-------------------+---------------------+--------------------------------+
275 | :attr:`%zu` | size_t | Exactly equivalent to |
276 | | | ``printf("%zu")``. |
277 +-------------------+---------------------+--------------------------------+
278 | :attr:`%i` | int | Exactly equivalent to |
279 | | | ``printf("%i")``. |
280 +-------------------+---------------------+--------------------------------+
281 | :attr:`%x` | int | Exactly equivalent to |
282 | | | ``printf("%x")``. |
283 +-------------------+---------------------+--------------------------------+
284 | :attr:`%s` | char\* | A null-terminated C character |
285 | | | array. |
286 +-------------------+---------------------+--------------------------------+
287 | :attr:`%p` | void\* | The hex representation of a C |
288 | | | pointer. Mostly equivalent to |
289 | | | ``printf("%p")`` except that |
290 | | | it is guaranteed to start with |
291 | | | the literal ``0x`` regardless |
292 | | | of what the platform's |
293 | | | ``printf`` yields. |
294 +-------------------+---------------------+--------------------------------+
Georg Brandl559e5d72008-06-11 18:37:52 +0000295 | :attr:`%A` | PyObject\* | The result of calling |
296 | | | :func:`ascii`. |
297 +-------------------+---------------------+--------------------------------+
Georg Brandl54a3faa2008-01-20 09:30:57 +0000298 | :attr:`%U` | PyObject\* | A unicode object. |
299 +-------------------+---------------------+--------------------------------+
300 | :attr:`%V` | PyObject\*, char \* | A unicode object (which may be |
301 | | | *NULL*) and a null-terminated |
302 | | | C character array as a second |
303 | | | parameter (which will be used, |
304 | | | if the first parameter is |
305 | | | *NULL*). |
306 +-------------------+---------------------+--------------------------------+
307 | :attr:`%S` | PyObject\* | The result of calling |
Victor Stinner6009ece2010-08-17 22:01:02 +0000308 | | | :cfunc:`PyObject_Str`. |
Georg Brandl54a3faa2008-01-20 09:30:57 +0000309 +-------------------+---------------------+--------------------------------+
310 | :attr:`%R` | PyObject\* | The result of calling |
Victor Stinner6009ece2010-08-17 22:01:02 +0000311 | | | :cfunc:`PyObject_Repr`. |
Georg Brandl54a3faa2008-01-20 09:30:57 +0000312 +-------------------+---------------------+--------------------------------+
313
314 An unrecognized format character causes all the rest of the format string to be
315 copied as-is to the result string, and any extra arguments discarded.
316
Mark Dickinson6ce4a9a2009-11-16 17:00:11 +0000317 .. note::
318
319 The `"%lld"` and `"%llu"` format specifiers are only available
Georg Brandlef871f62010-03-12 10:06:40 +0000320 when :const:`HAVE_LONG_LONG` is defined.
Mark Dickinson6ce4a9a2009-11-16 17:00:11 +0000321
322 .. versionchanged:: 3.2
Georg Brandl67b21b72010-08-17 15:07:14 +0000323 Support for ``"%lld"`` and ``"%llu"`` added.
Mark Dickinson6ce4a9a2009-11-16 17:00:11 +0000324
Georg Brandl54a3faa2008-01-20 09:30:57 +0000325
326.. cfunction:: PyObject* PyUnicode_FromFormatV(const char *format, va_list vargs)
327
Victor Stinner6009ece2010-08-17 22:01:02 +0000328 Identical to :cfunc:`PyUnicode_FromFormat` except that it takes exactly two
Georg Brandl54a3faa2008-01-20 09:30:57 +0000329 arguments.
330
331
332.. cfunction:: Py_UNICODE* PyUnicode_AsUnicode(PyObject *unicode)
333
334 Return a read-only pointer to the Unicode object's internal :ctype:`Py_UNICODE`
335 buffer, *NULL* if *unicode* is not a Unicode object.
336
337
338.. cfunction:: Py_ssize_t PyUnicode_GetSize(PyObject *unicode)
339
340 Return the length of the Unicode object.
341
342
343.. cfunction:: PyObject* PyUnicode_FromEncodedObject(PyObject *obj, const char *encoding, const char *errors)
344
345 Coerce an encoded object *obj* to an Unicode object and return a reference with
346 incremented refcount.
347
Georg Brandl952867a2010-06-27 10:17:12 +0000348 :class:`bytes`, :class:`bytearray` and other char buffer compatible objects
349 are decoded according to the given encoding and using the error handling
350 defined by errors. Both can be *NULL* to have the interface use the default
351 values (see the next section for details).
Georg Brandl54a3faa2008-01-20 09:30:57 +0000352
353 All other objects, including Unicode objects, cause a :exc:`TypeError` to be
354 set.
355
356 The API returns *NULL* if there was an error. The caller is responsible for
357 decref'ing the returned objects.
358
359
360.. cfunction:: PyObject* PyUnicode_FromObject(PyObject *obj)
361
362 Shortcut for ``PyUnicode_FromEncodedObject(obj, NULL, "strict")`` which is used
363 throughout the interpreter whenever coercion to Unicode is needed.
364
365If the platform supports :ctype:`wchar_t` and provides a header file wchar.h,
366Python can interface directly to this type using the following functions.
367Support is optimized if Python's own :ctype:`Py_UNICODE` type is identical to
368the system's :ctype:`wchar_t`.
369
Georg Brandl54a3faa2008-01-20 09:30:57 +0000370
Victor Stinner77c38622010-05-14 15:58:55 +0000371File System Encoding
372""""""""""""""""""""
373
374To encode and decode file names and other environment strings,
375:cdata:`Py_FileSystemEncoding` should be used as the encoding, and
376``"surrogateescape"`` should be used as the error handler (:pep:`383`). To
377encode file names during argument parsing, the ``"O&"`` converter should be
Victor Stinner6009ece2010-08-17 22:01:02 +0000378used, passsing :cfunc:`PyUnicode_FSConverter` as the conversion function:
Victor Stinner77c38622010-05-14 15:58:55 +0000379
380.. cfunction:: int PyUnicode_FSConverter(PyObject* obj, void* result)
381
Victor Stinner47fcb5b2010-08-13 23:59:58 +0000382 ParseTuple converter: encode :class:`str` objects to :class:`bytes` using
383 :cfunc:`PyUnicode_EncodeFSDefault`; :class:`bytes` objects are output as-is.
384 *result* must be a :ctype:`PyBytesObject*` which must be released when it is
385 no longer used.
Victor Stinner77c38622010-05-14 15:58:55 +0000386
387 .. versionadded:: 3.1
388
Georg Brandl67b21b72010-08-17 15:07:14 +0000389
Victor Stinner47fcb5b2010-08-13 23:59:58 +0000390To decode file names during argument parsing, the ``"O&"`` converter should be
Victor Stinner6009ece2010-08-17 22:01:02 +0000391used, passsing :cfunc:`PyUnicode_FSDecoder` as the conversion function:
Victor Stinner47fcb5b2010-08-13 23:59:58 +0000392
393.. cfunction:: int PyUnicode_FSDecoder(PyObject* obj, void* result)
394
395 ParseTuple converter: decode :class:`bytes` objects to :class:`str` using
396 :cfunc:`PyUnicode_DecodeFSDefaultAndSize`; :class:`str` objects are output
397 as-is. *result* must be a :ctype:`PyUnicodeObject*` which must be released
398 when it is no longer used.
399
400 .. versionadded:: 3.2
401
Georg Brandl67b21b72010-08-17 15:07:14 +0000402
Victor Stinner77c38622010-05-14 15:58:55 +0000403.. cfunction:: PyObject* PyUnicode_DecodeFSDefaultAndSize(const char *s, Py_ssize_t size)
404
405 Decode a null-terminated string using :cdata:`Py_FileSystemDefaultEncoding`
406 and the ``"surrogateescape"`` error handler.
407
408 If :cdata:`Py_FileSystemDefaultEncoding` is not set, fall back to UTF-8.
409
Victor Stinner6009ece2010-08-17 22:01:02 +0000410 Use :cfunc:`PyUnicode_DecodeFSDefaultAndSize` if you know the string length.
Victor Stinner77c38622010-05-14 15:58:55 +0000411
Victor Stinnerae6265f2010-05-15 16:27:27 +0000412
Victor Stinner77c38622010-05-14 15:58:55 +0000413.. cfunction:: PyObject* PyUnicode_DecodeFSDefault(const char *s)
414
415 Decode a string using :cdata:`Py_FileSystemDefaultEncoding` and
416 the ``"surrogateescape"`` error handler.
417
418 If :cdata:`Py_FileSystemDefaultEncoding` is not set, fall back to UTF-8.
419
420
Victor Stinnerae6265f2010-05-15 16:27:27 +0000421.. cfunction:: PyObject* PyUnicode_EncodeFSDefault(PyObject *unicode)
422
423 Encode a Unicode object to :cdata:`Py_FileSystemDefaultEncoding` with the
Benjamin Petersonb4324512010-05-15 17:42:02 +0000424 ``'surrogateescape'`` error handler, and return :class:`bytes`.
Victor Stinnerae6265f2010-05-15 16:27:27 +0000425
426 If :cdata:`Py_FileSystemDefaultEncoding` is not set, fall back to UTF-8.
427
428 .. versionadded:: 3.2
429
430
Victor Stinner77c38622010-05-14 15:58:55 +0000431wchar_t Support
432"""""""""""""""
433
434wchar_t support for platforms which support it:
Georg Brandl54a3faa2008-01-20 09:30:57 +0000435
436.. cfunction:: PyObject* PyUnicode_FromWideChar(const wchar_t *w, Py_ssize_t size)
437
438 Create a Unicode object from the :ctype:`wchar_t` buffer *w* of the given size.
Martin v. Löwis790465f2008-04-05 20:41:37 +0000439 Passing -1 as the size indicates that the function must itself compute the length,
440 using wcslen.
Georg Brandl54a3faa2008-01-20 09:30:57 +0000441 Return *NULL* on failure.
442
443
444.. cfunction:: Py_ssize_t PyUnicode_AsWideChar(PyUnicodeObject *unicode, wchar_t *w, Py_ssize_t size)
445
446 Copy the Unicode object contents into the :ctype:`wchar_t` buffer *w*. At most
447 *size* :ctype:`wchar_t` characters are copied (excluding a possibly trailing
448 0-termination character). Return the number of :ctype:`wchar_t` characters
449 copied or -1 in case of an error. Note that the resulting :ctype:`wchar_t`
450 string may or may not be 0-terminated. It is the responsibility of the caller
451 to make sure that the :ctype:`wchar_t` string is 0-terminated in case this is
452 required by the application.
453
454
455.. _builtincodecs:
456
457Built-in Codecs
458^^^^^^^^^^^^^^^
459
Georg Brandl22b34312009-07-26 14:54:51 +0000460Python provides a set of built-in codecs which are written in C for speed. All of
Georg Brandl54a3faa2008-01-20 09:30:57 +0000461these codecs are directly usable via the following functions.
462
463Many of the following APIs take two arguments encoding and errors. These
464parameters encoding and errors have the same semantics as the ones of the
Georg Brandl22b34312009-07-26 14:54:51 +0000465built-in :func:`unicode` Unicode object constructor.
Georg Brandl54a3faa2008-01-20 09:30:57 +0000466
Martin v. Löwisc15bdef2009-05-29 14:47:46 +0000467Setting encoding to *NULL* causes the default encoding to be used
468which is ASCII. The file system calls should use
469:cfunc:`PyUnicode_FSConverter` for encoding file names. This uses the
470variable :cdata:`Py_FileSystemDefaultEncoding` internally. This
471variable should be treated as read-only: On some systems, it will be a
472pointer to a static string, on others, it will change at run-time
473(such as when the application invokes setlocale).
Georg Brandl54a3faa2008-01-20 09:30:57 +0000474
475Error handling is set by errors which may also be set to *NULL* meaning to use
476the default handling defined for the codec. Default error handling for all
Georg Brandl22b34312009-07-26 14:54:51 +0000477built-in codecs is "strict" (:exc:`ValueError` is raised).
Georg Brandl54a3faa2008-01-20 09:30:57 +0000478
479The codecs all use a similar interface. Only deviation from the following
480generic ones are documented for simplicity.
481
Georg Brandl54a3faa2008-01-20 09:30:57 +0000482
Victor Stinner77c38622010-05-14 15:58:55 +0000483Generic Codecs
484""""""""""""""
485
486These are the generic codec APIs:
Georg Brandl54a3faa2008-01-20 09:30:57 +0000487
488
489.. cfunction:: PyObject* PyUnicode_Decode(const char *s, Py_ssize_t size, const char *encoding, const char *errors)
490
491 Create a Unicode object by decoding *size* bytes of the encoded string *s*.
492 *encoding* and *errors* have the same meaning as the parameters of the same name
Georg Brandl22b34312009-07-26 14:54:51 +0000493 in the :func:`unicode` built-in function. The codec to be used is looked up
Georg Brandl54a3faa2008-01-20 09:30:57 +0000494 using the Python codec registry. Return *NULL* if an exception was raised by
495 the codec.
496
497
498.. cfunction:: PyObject* PyUnicode_Encode(const Py_UNICODE *s, Py_ssize_t size, const char *encoding, const char *errors)
499
500 Encode the :ctype:`Py_UNICODE` buffer of the given size and return a Python
Benjamin Petersonb6eba4f2009-01-13 23:14:04 +0000501 bytes object. *encoding* and *errors* have the same meaning as the
502 parameters of the same name in the Unicode :meth:`encode` method. The codec
503 to be used is looked up using the Python codec registry. Return *NULL* if an
504 exception was raised by the codec.
Georg Brandl54a3faa2008-01-20 09:30:57 +0000505
506
507.. cfunction:: PyObject* PyUnicode_AsEncodedString(PyObject *unicode, const char *encoding, const char *errors)
508
Benjamin Petersonb6eba4f2009-01-13 23:14:04 +0000509 Encode a Unicode object and return the result as Python bytes object.
510 *encoding* and *errors* have the same meaning as the parameters of the same
511 name in the Unicode :meth:`encode` method. The codec to be used is looked up
512 using the Python codec registry. Return *NULL* if an exception was raised by
513 the codec.
Georg Brandl54a3faa2008-01-20 09:30:57 +0000514
Georg Brandl54a3faa2008-01-20 09:30:57 +0000515
Victor Stinner77c38622010-05-14 15:58:55 +0000516UTF-8 Codecs
517""""""""""""
518
519These are the UTF-8 codec APIs:
Georg Brandl54a3faa2008-01-20 09:30:57 +0000520
521
522.. cfunction:: PyObject* PyUnicode_DecodeUTF8(const char *s, Py_ssize_t size, const char *errors)
523
524 Create a Unicode object by decoding *size* bytes of the UTF-8 encoded string
525 *s*. Return *NULL* if an exception was raised by the codec.
526
527
528.. cfunction:: PyObject* PyUnicode_DecodeUTF8Stateful(const char *s, Py_ssize_t size, const char *errors, Py_ssize_t *consumed)
529
530 If *consumed* is *NULL*, behave like :cfunc:`PyUnicode_DecodeUTF8`. If
531 *consumed* is not *NULL*, trailing incomplete UTF-8 byte sequences will not be
532 treated as an error. Those bytes will not be decoded and the number of bytes
533 that have been decoded will be stored in *consumed*.
534
535
536.. cfunction:: PyObject* PyUnicode_EncodeUTF8(const Py_UNICODE *s, Py_ssize_t size, const char *errors)
537
Benjamin Petersonb6eba4f2009-01-13 23:14:04 +0000538 Encode the :ctype:`Py_UNICODE` buffer of the given size using UTF-8 and
539 return a Python bytes object. Return *NULL* if an exception was raised by
540 the codec.
Georg Brandl54a3faa2008-01-20 09:30:57 +0000541
542
543.. cfunction:: PyObject* PyUnicode_AsUTF8String(PyObject *unicode)
544
Benjamin Petersonb6eba4f2009-01-13 23:14:04 +0000545 Encode a Unicode object using UTF-8 and return the result as Python bytes
546 object. Error handling is "strict". Return *NULL* if an exception was
547 raised by the codec.
Georg Brandl54a3faa2008-01-20 09:30:57 +0000548
Georg Brandl54a3faa2008-01-20 09:30:57 +0000549
Victor Stinner77c38622010-05-14 15:58:55 +0000550UTF-32 Codecs
551"""""""""""""
552
553These are the UTF-32 codec APIs:
Georg Brandl54a3faa2008-01-20 09:30:57 +0000554
555
556.. cfunction:: PyObject* PyUnicode_DecodeUTF32(const char *s, Py_ssize_t size, const char *errors, int *byteorder)
557
558 Decode *length* bytes from a UTF-32 encoded buffer string and return the
559 corresponding Unicode object. *errors* (if non-*NULL*) defines the error
560 handling. It defaults to "strict".
561
562 If *byteorder* is non-*NULL*, the decoder starts decoding using the given byte
563 order::
564
565 *byteorder == -1: little endian
566 *byteorder == 0: native order
567 *byteorder == 1: big endian
568
Benjamin Peterson4ac9ce42009-10-04 14:49:41 +0000569 If ``*byteorder`` is zero, and the first four bytes of the input data are a
570 byte order mark (BOM), the decoder switches to this byte order and the BOM is
571 not copied into the resulting Unicode string. If ``*byteorder`` is ``-1`` or
572 ``1``, any byte order mark is copied to the output.
573
574 After completion, *\*byteorder* is set to the current byte order at the end
575 of input data.
Georg Brandl54a3faa2008-01-20 09:30:57 +0000576
577 In a narrow build codepoints outside the BMP will be decoded as surrogate pairs.
578
579 If *byteorder* is *NULL*, the codec starts in native order mode.
580
581 Return *NULL* if an exception was raised by the codec.
582
583
584.. cfunction:: PyObject* PyUnicode_DecodeUTF32Stateful(const char *s, Py_ssize_t size, const char *errors, int *byteorder, Py_ssize_t *consumed)
585
586 If *consumed* is *NULL*, behave like :cfunc:`PyUnicode_DecodeUTF32`. If
587 *consumed* is not *NULL*, :cfunc:`PyUnicode_DecodeUTF32Stateful` will not treat
588 trailing incomplete UTF-32 byte sequences (such as a number of bytes not divisible
589 by four) as an error. Those bytes will not be decoded and the number of bytes
590 that have been decoded will be stored in *consumed*.
591
592
593.. cfunction:: PyObject* PyUnicode_EncodeUTF32(const Py_UNICODE *s, Py_ssize_t size, const char *errors, int byteorder)
594
595 Return a Python bytes object holding the UTF-32 encoded value of the Unicode
Benjamin Peterson4ac9ce42009-10-04 14:49:41 +0000596 data in *s*. Output is written according to the following byte order::
Georg Brandl54a3faa2008-01-20 09:30:57 +0000597
598 byteorder == -1: little endian
599 byteorder == 0: native byte order (writes a BOM mark)
600 byteorder == 1: big endian
601
602 If byteorder is ``0``, the output string will always start with the Unicode BOM
603 mark (U+FEFF). In the other two modes, no BOM mark is prepended.
604
605 If *Py_UNICODE_WIDE* is not defined, surrogate pairs will be output
606 as a single codepoint.
607
608 Return *NULL* if an exception was raised by the codec.
609
610
611.. cfunction:: PyObject* PyUnicode_AsUTF32String(PyObject *unicode)
612
Benjamin Petersonb6eba4f2009-01-13 23:14:04 +0000613 Return a Python byte string using the UTF-32 encoding in native byte
614 order. The string always starts with a BOM mark. Error handling is "strict".
615 Return *NULL* if an exception was raised by the codec.
Georg Brandl54a3faa2008-01-20 09:30:57 +0000616
617
Victor Stinner77c38622010-05-14 15:58:55 +0000618UTF-16 Codecs
619"""""""""""""
Georg Brandl54a3faa2008-01-20 09:30:57 +0000620
Victor Stinner77c38622010-05-14 15:58:55 +0000621These are the UTF-16 codec APIs:
Georg Brandl54a3faa2008-01-20 09:30:57 +0000622
623
624.. cfunction:: PyObject* PyUnicode_DecodeUTF16(const char *s, Py_ssize_t size, const char *errors, int *byteorder)
625
626 Decode *length* bytes from a UTF-16 encoded buffer string and return the
627 corresponding Unicode object. *errors* (if non-*NULL*) defines the error
628 handling. It defaults to "strict".
629
630 If *byteorder* is non-*NULL*, the decoder starts decoding using the given byte
631 order::
632
633 *byteorder == -1: little endian
634 *byteorder == 0: native order
635 *byteorder == 1: big endian
636
Benjamin Peterson4ac9ce42009-10-04 14:49:41 +0000637 If ``*byteorder`` is zero, and the first two bytes of the input data are a
638 byte order mark (BOM), the decoder switches to this byte order and the BOM is
639 not copied into the resulting Unicode string. If ``*byteorder`` is ``-1`` or
640 ``1``, any byte order mark is copied to the output (where it will result in
641 either a ``\ufeff`` or a ``\ufffe`` character).
642
643 After completion, *\*byteorder* is set to the current byte order at the end
644 of input data.
Georg Brandl54a3faa2008-01-20 09:30:57 +0000645
646 If *byteorder* is *NULL*, the codec starts in native order mode.
647
648 Return *NULL* if an exception was raised by the codec.
649
650
651.. cfunction:: PyObject* PyUnicode_DecodeUTF16Stateful(const char *s, Py_ssize_t size, const char *errors, int *byteorder, Py_ssize_t *consumed)
652
653 If *consumed* is *NULL*, behave like :cfunc:`PyUnicode_DecodeUTF16`. If
654 *consumed* is not *NULL*, :cfunc:`PyUnicode_DecodeUTF16Stateful` will not treat
655 trailing incomplete UTF-16 byte sequences (such as an odd number of bytes or a
656 split surrogate pair) as an error. Those bytes will not be decoded and the
657 number of bytes that have been decoded will be stored in *consumed*.
658
659
660.. cfunction:: PyObject* PyUnicode_EncodeUTF16(const Py_UNICODE *s, Py_ssize_t size, const char *errors, int byteorder)
661
Benjamin Petersonb6eba4f2009-01-13 23:14:04 +0000662 Return a Python bytes object holding the UTF-16 encoded value of the Unicode
Benjamin Peterson4ac9ce42009-10-04 14:49:41 +0000663 data in *s*. Output is written according to the following byte order::
Georg Brandl54a3faa2008-01-20 09:30:57 +0000664
665 byteorder == -1: little endian
666 byteorder == 0: native byte order (writes a BOM mark)
667 byteorder == 1: big endian
668
669 If byteorder is ``0``, the output string will always start with the Unicode BOM
670 mark (U+FEFF). In the other two modes, no BOM mark is prepended.
671
672 If *Py_UNICODE_WIDE* is defined, a single :ctype:`Py_UNICODE` value may get
673 represented as a surrogate pair. If it is not defined, each :ctype:`Py_UNICODE`
674 values is interpreted as an UCS-2 character.
675
676 Return *NULL* if an exception was raised by the codec.
677
678
679.. cfunction:: PyObject* PyUnicode_AsUTF16String(PyObject *unicode)
680
Benjamin Petersonb6eba4f2009-01-13 23:14:04 +0000681 Return a Python byte string using the UTF-16 encoding in native byte
682 order. The string always starts with a BOM mark. Error handling is "strict".
683 Return *NULL* if an exception was raised by the codec.
Georg Brandl54a3faa2008-01-20 09:30:57 +0000684
Georg Brandl54a3faa2008-01-20 09:30:57 +0000685
Georg Brandl8477f822010-08-02 20:05:19 +0000686UTF-7 Codecs
687""""""""""""
688
689These are the UTF-7 codec APIs:
690
691
692.. cfunction:: PyObject* PyUnicode_DecodeUTF7(const char *s, Py_ssize_t size, const char *errors)
693
694 Create a Unicode object by decoding *size* bytes of the UTF-7 encoded string
695 *s*. Return *NULL* if an exception was raised by the codec.
696
697
Georg Brandl4d224092010-08-13 15:10:49 +0000698.. cfunction:: PyObject* PyUnicode_DecodeUTF7Stateful(const char *s, Py_ssize_t size, const char *errors, Py_ssize_t *consumed)
Georg Brandl8477f822010-08-02 20:05:19 +0000699
700 If *consumed* is *NULL*, behave like :cfunc:`PyUnicode_DecodeUTF7`. If
701 *consumed* is not *NULL*, trailing incomplete UTF-7 base-64 sections will not
702 be treated as an error. Those bytes will not be decoded and the number of
703 bytes that have been decoded will be stored in *consumed*.
704
705
706.. cfunction:: PyObject* PyUnicode_EncodeUTF7(const Py_UNICODE *s, Py_ssize_t size, int base64SetO, int base64WhiteSpace, const char *errors)
707
708 Encode the :ctype:`Py_UNICODE` buffer of the given size using UTF-7 and
709 return a Python bytes object. Return *NULL* if an exception was raised by
710 the codec.
711
712 If *base64SetO* is nonzero, "Set O" (punctuation that has no otherwise
713 special meaning) will be encoded in base-64. If *base64WhiteSpace* is
714 nonzero, whitespace will be encoded in base-64. Both are set to zero for the
715 Python "utf-7" codec.
716
717
Victor Stinner77c38622010-05-14 15:58:55 +0000718Unicode-Escape Codecs
719"""""""""""""""""""""
720
721These are the "Unicode Escape" codec APIs:
Georg Brandl54a3faa2008-01-20 09:30:57 +0000722
723
724.. cfunction:: PyObject* PyUnicode_DecodeUnicodeEscape(const char *s, Py_ssize_t size, const char *errors)
725
726 Create a Unicode object by decoding *size* bytes of the Unicode-Escape encoded
727 string *s*. Return *NULL* if an exception was raised by the codec.
728
729
730.. cfunction:: PyObject* PyUnicode_EncodeUnicodeEscape(const Py_UNICODE *s, Py_ssize_t size)
731
732 Encode the :ctype:`Py_UNICODE` buffer of the given size using Unicode-Escape and
733 return a Python string object. Return *NULL* if an exception was raised by the
734 codec.
735
736
737.. cfunction:: PyObject* PyUnicode_AsUnicodeEscapeString(PyObject *unicode)
738
739 Encode a Unicode object using Unicode-Escape and return the result as Python
740 string object. Error handling is "strict". Return *NULL* if an exception was
741 raised by the codec.
742
Georg Brandl54a3faa2008-01-20 09:30:57 +0000743
Victor Stinner77c38622010-05-14 15:58:55 +0000744Raw-Unicode-Escape Codecs
745"""""""""""""""""""""""""
746
747These are the "Raw Unicode Escape" codec APIs:
Georg Brandl54a3faa2008-01-20 09:30:57 +0000748
749
750.. cfunction:: PyObject* PyUnicode_DecodeRawUnicodeEscape(const char *s, Py_ssize_t size, const char *errors)
751
752 Create a Unicode object by decoding *size* bytes of the Raw-Unicode-Escape
753 encoded string *s*. Return *NULL* if an exception was raised by the codec.
754
755
756.. cfunction:: PyObject* PyUnicode_EncodeRawUnicodeEscape(const Py_UNICODE *s, Py_ssize_t size, const char *errors)
757
758 Encode the :ctype:`Py_UNICODE` buffer of the given size using Raw-Unicode-Escape
759 and return a Python string object. Return *NULL* if an exception was raised by
760 the codec.
761
762
763.. cfunction:: PyObject* PyUnicode_AsRawUnicodeEscapeString(PyObject *unicode)
764
765 Encode a Unicode object using Raw-Unicode-Escape and return the result as
766 Python string object. Error handling is "strict". Return *NULL* if an exception
767 was raised by the codec.
768
Victor Stinner77c38622010-05-14 15:58:55 +0000769
770Latin-1 Codecs
771""""""""""""""
772
Georg Brandl54a3faa2008-01-20 09:30:57 +0000773These are the Latin-1 codec APIs: Latin-1 corresponds to the first 256 Unicode
774ordinals and only these are accepted by the codecs during encoding.
775
Georg Brandl54a3faa2008-01-20 09:30:57 +0000776
777.. cfunction:: PyObject* PyUnicode_DecodeLatin1(const char *s, Py_ssize_t size, const char *errors)
778
779 Create a Unicode object by decoding *size* bytes of the Latin-1 encoded string
780 *s*. Return *NULL* if an exception was raised by the codec.
781
782
783.. cfunction:: PyObject* PyUnicode_EncodeLatin1(const Py_UNICODE *s, Py_ssize_t size, const char *errors)
784
Benjamin Petersonb6eba4f2009-01-13 23:14:04 +0000785 Encode the :ctype:`Py_UNICODE` buffer of the given size using Latin-1 and
786 return a Python bytes object. Return *NULL* if an exception was raised by
787 the codec.
Georg Brandl54a3faa2008-01-20 09:30:57 +0000788
789
790.. cfunction:: PyObject* PyUnicode_AsLatin1String(PyObject *unicode)
791
Benjamin Petersonb6eba4f2009-01-13 23:14:04 +0000792 Encode a Unicode object using Latin-1 and return the result as Python bytes
793 object. Error handling is "strict". Return *NULL* if an exception was
794 raised by the codec.
Georg Brandl54a3faa2008-01-20 09:30:57 +0000795
Victor Stinner77c38622010-05-14 15:58:55 +0000796
797ASCII Codecs
798""""""""""""
799
Georg Brandl54a3faa2008-01-20 09:30:57 +0000800These are the ASCII codec APIs. Only 7-bit ASCII data is accepted. All other
801codes generate errors.
802
Georg Brandl54a3faa2008-01-20 09:30:57 +0000803
804.. cfunction:: PyObject* PyUnicode_DecodeASCII(const char *s, Py_ssize_t size, const char *errors)
805
806 Create a Unicode object by decoding *size* bytes of the ASCII encoded string
807 *s*. Return *NULL* if an exception was raised by the codec.
808
809
810.. cfunction:: PyObject* PyUnicode_EncodeASCII(const Py_UNICODE *s, Py_ssize_t size, const char *errors)
811
Benjamin Petersonb6eba4f2009-01-13 23:14:04 +0000812 Encode the :ctype:`Py_UNICODE` buffer of the given size using ASCII and
813 return a Python bytes object. Return *NULL* if an exception was raised by
814 the codec.
Georg Brandl54a3faa2008-01-20 09:30:57 +0000815
816
817.. cfunction:: PyObject* PyUnicode_AsASCIIString(PyObject *unicode)
818
Benjamin Petersonb6eba4f2009-01-13 23:14:04 +0000819 Encode a Unicode object using ASCII and return the result as Python bytes
820 object. Error handling is "strict". Return *NULL* if an exception was
821 raised by the codec.
Georg Brandl54a3faa2008-01-20 09:30:57 +0000822
Georg Brandl54a3faa2008-01-20 09:30:57 +0000823
Victor Stinner77c38622010-05-14 15:58:55 +0000824Character Map Codecs
825""""""""""""""""""""
826
827These are the mapping codec APIs:
Georg Brandl54a3faa2008-01-20 09:30:57 +0000828
829This codec is special in that it can be used to implement many different codecs
830(and this is in fact what was done to obtain most of the standard codecs
831included in the :mod:`encodings` package). The codec uses mapping to encode and
832decode characters.
833
834Decoding mappings must map single string characters to single Unicode
835characters, integers (which are then interpreted as Unicode ordinals) or None
836(meaning "undefined mapping" and causing an error).
837
838Encoding mappings must map single Unicode characters to single string
839characters, integers (which are then interpreted as Latin-1 ordinals) or None
840(meaning "undefined mapping" and causing an error).
841
842The mapping objects provided must only support the __getitem__ mapping
843interface.
844
845If a character lookup fails with a LookupError, the character is copied as-is
846meaning that its ordinal value will be interpreted as Unicode or Latin-1 ordinal
847resp. Because of this, mappings only need to contain those mappings which map
848characters to different code points.
849
850
851.. cfunction:: PyObject* PyUnicode_DecodeCharmap(const char *s, Py_ssize_t size, PyObject *mapping, const char *errors)
852
853 Create a Unicode object by decoding *size* bytes of the encoded string *s* using
854 the given *mapping* object. Return *NULL* if an exception was raised by the
855 codec. If *mapping* is *NULL* latin-1 decoding will be done. Else it can be a
856 dictionary mapping byte or a unicode string, which is treated as a lookup table.
857 Byte values greater that the length of the string and U+FFFE "characters" are
858 treated as "undefined mapping".
859
860
861.. cfunction:: PyObject* PyUnicode_EncodeCharmap(const Py_UNICODE *s, Py_ssize_t size, PyObject *mapping, const char *errors)
862
863 Encode the :ctype:`Py_UNICODE` buffer of the given size using the given
864 *mapping* object and return a Python string object. Return *NULL* if an
865 exception was raised by the codec.
866
867
868.. cfunction:: PyObject* PyUnicode_AsCharmapString(PyObject *unicode, PyObject *mapping)
869
870 Encode a Unicode object using the given *mapping* object and return the result
871 as Python string object. Error handling is "strict". Return *NULL* if an
872 exception was raised by the codec.
873
874The following codec API is special in that maps Unicode to Unicode.
875
876
877.. cfunction:: PyObject* PyUnicode_TranslateCharmap(const Py_UNICODE *s, Py_ssize_t size, PyObject *table, const char *errors)
878
879 Translate a :ctype:`Py_UNICODE` buffer of the given length by applying a
880 character mapping *table* to it and return the resulting Unicode object. Return
881 *NULL* when an exception was raised by the codec.
882
883 The *mapping* table must map Unicode ordinal integers to Unicode ordinal
884 integers or None (causing deletion of the character).
885
886 Mapping tables need only provide the :meth:`__getitem__` interface; dictionaries
887 and sequences work well. Unmapped character ordinals (ones which cause a
888 :exc:`LookupError`) are left untouched and are copied as-is.
889
Jeroen Ruigrok van der Werven47a7d702009-04-27 05:43:17 +0000890
Georg Brandl54a3faa2008-01-20 09:30:57 +0000891These are the MBCS codec APIs. They are currently only available on Windows and
892use the Win32 MBCS converters to implement the conversions. Note that MBCS (or
893DBCS) is a class of encodings, not just one. The target encoding is defined by
894the user settings on the machine running the codec.
895
Victor Stinner77c38622010-05-14 15:58:55 +0000896
897MBCS codecs for Windows
898"""""""""""""""""""""""
Georg Brandl54a3faa2008-01-20 09:30:57 +0000899
900
901.. cfunction:: PyObject* PyUnicode_DecodeMBCS(const char *s, Py_ssize_t size, const char *errors)
902
903 Create a Unicode object by decoding *size* bytes of the MBCS encoded string *s*.
904 Return *NULL* if an exception was raised by the codec.
905
906
907.. cfunction:: PyObject* PyUnicode_DecodeMBCSStateful(const char *s, int size, const char *errors, int *consumed)
908
909 If *consumed* is *NULL*, behave like :cfunc:`PyUnicode_DecodeMBCS`. If
910 *consumed* is not *NULL*, :cfunc:`PyUnicode_DecodeMBCSStateful` will not decode
911 trailing lead byte and the number of bytes that have been decoded will be stored
912 in *consumed*.
913
914
915.. cfunction:: PyObject* PyUnicode_EncodeMBCS(const Py_UNICODE *s, Py_ssize_t size, const char *errors)
916
Benjamin Petersonb6eba4f2009-01-13 23:14:04 +0000917 Encode the :ctype:`Py_UNICODE` buffer of the given size using MBCS and return
918 a Python bytes object. Return *NULL* if an exception was raised by the
919 codec.
Georg Brandl54a3faa2008-01-20 09:30:57 +0000920
921
922.. cfunction:: PyObject* PyUnicode_AsMBCSString(PyObject *unicode)
923
Benjamin Petersonb6eba4f2009-01-13 23:14:04 +0000924 Encode a Unicode object using MBCS and return the result as Python bytes
925 object. Error handling is "strict". Return *NULL* if an exception was
926 raised by the codec.
Georg Brandl54a3faa2008-01-20 09:30:57 +0000927
Martin v. Löwisc15bdef2009-05-29 14:47:46 +0000928
Victor Stinner77c38622010-05-14 15:58:55 +0000929Methods & Slots
930"""""""""""""""
Georg Brandl54a3faa2008-01-20 09:30:57 +0000931
932
933.. _unicodemethodsandslots:
934
935Methods and Slot Functions
936^^^^^^^^^^^^^^^^^^^^^^^^^^
937
938The following APIs are capable of handling Unicode objects and strings on input
939(we refer to them as strings in the descriptions) and return Unicode objects or
940integers as appropriate.
941
942They all return *NULL* or ``-1`` if an exception occurs.
943
944
945.. cfunction:: PyObject* PyUnicode_Concat(PyObject *left, PyObject *right)
946
947 Concat two strings giving a new Unicode string.
948
949
950.. cfunction:: PyObject* PyUnicode_Split(PyObject *s, PyObject *sep, Py_ssize_t maxsplit)
951
952 Split a string giving a list of Unicode strings. If sep is *NULL*, splitting
953 will be done at all whitespace substrings. Otherwise, splits occur at the given
954 separator. At most *maxsplit* splits will be done. If negative, no limit is
955 set. Separators are not included in the resulting list.
956
957
958.. cfunction:: PyObject* PyUnicode_Splitlines(PyObject *s, int keepend)
959
960 Split a Unicode string at line breaks, returning a list of Unicode strings.
961 CRLF is considered to be one line break. If *keepend* is 0, the Line break
962 characters are not included in the resulting strings.
963
964
965.. cfunction:: PyObject* PyUnicode_Translate(PyObject *str, PyObject *table, const char *errors)
966
967 Translate a string by applying a character mapping table to it and return the
968 resulting Unicode object.
969
970 The mapping table must map Unicode ordinal integers to Unicode ordinal integers
971 or None (causing deletion of the character).
972
973 Mapping tables need only provide the :meth:`__getitem__` interface; dictionaries
974 and sequences work well. Unmapped character ordinals (ones which cause a
975 :exc:`LookupError`) are left untouched and are copied as-is.
976
977 *errors* has the usual meaning for codecs. It may be *NULL* which indicates to
978 use the default error handling.
979
980
981.. cfunction:: PyObject* PyUnicode_Join(PyObject *separator, PyObject *seq)
982
983 Join a sequence of strings using the given separator and return the resulting
984 Unicode string.
985
986
987.. cfunction:: int PyUnicode_Tailmatch(PyObject *str, PyObject *substr, Py_ssize_t start, Py_ssize_t end, int direction)
988
989 Return 1 if *substr* matches *str*[*start*:*end*] at the given tail end
990 (*direction* == -1 means to do a prefix match, *direction* == 1 a suffix match),
991 0 otherwise. Return ``-1`` if an error occurred.
992
993
994.. cfunction:: Py_ssize_t PyUnicode_Find(PyObject *str, PyObject *substr, Py_ssize_t start, Py_ssize_t end, int direction)
995
996 Return the first position of *substr* in *str*[*start*:*end*] using the given
997 *direction* (*direction* == 1 means to do a forward search, *direction* == -1 a
998 backward search). The return value is the index of the first match; a value of
999 ``-1`` indicates that no match was found, and ``-2`` indicates that an error
1000 occurred and an exception has been set.
1001
1002
1003.. cfunction:: Py_ssize_t PyUnicode_Count(PyObject *str, PyObject *substr, Py_ssize_t start, Py_ssize_t end)
1004
1005 Return the number of non-overlapping occurrences of *substr* in
1006 ``str[start:end]``. Return ``-1`` if an error occurred.
1007
1008
1009.. cfunction:: PyObject* PyUnicode_Replace(PyObject *str, PyObject *substr, PyObject *replstr, Py_ssize_t maxcount)
1010
1011 Replace at most *maxcount* occurrences of *substr* in *str* with *replstr* and
1012 return the resulting Unicode object. *maxcount* == -1 means replace all
1013 occurrences.
1014
1015
1016.. cfunction:: int PyUnicode_Compare(PyObject *left, PyObject *right)
1017
1018 Compare two strings and return -1, 0, 1 for less than, equal, and greater than,
1019 respectively.
1020
1021
Benjamin Petersonc22ed142008-07-01 19:12:34 +00001022.. cfunction:: int PyUnicode_CompareWithASCIIString(PyObject *uni, char *string)
1023
1024 Compare a unicode object, *uni*, with *string* and return -1, 0, 1 for less
1025 than, equal, and greater than, respectively.
1026
1027
Georg Brandl54a3faa2008-01-20 09:30:57 +00001028.. cfunction:: int PyUnicode_RichCompare(PyObject *left, PyObject *right, int op)
1029
1030 Rich compare two unicode strings and return one of the following:
1031
1032 * ``NULL`` in case an exception was raised
1033 * :const:`Py_True` or :const:`Py_False` for successful comparisons
1034 * :const:`Py_NotImplemented` in case the type combination is unknown
1035
1036 Note that :const:`Py_EQ` and :const:`Py_NE` comparisons can cause a
1037 :exc:`UnicodeWarning` in case the conversion of the arguments to Unicode fails
1038 with a :exc:`UnicodeDecodeError`.
1039
1040 Possible values for *op* are :const:`Py_GT`, :const:`Py_GE`, :const:`Py_EQ`,
1041 :const:`Py_NE`, :const:`Py_LT`, and :const:`Py_LE`.
1042
1043
1044.. cfunction:: PyObject* PyUnicode_Format(PyObject *format, PyObject *args)
1045
1046 Return a new string object from *format* and *args*; this is analogous to
1047 ``format % args``. The *args* argument must be a tuple.
1048
1049
1050.. cfunction:: int PyUnicode_Contains(PyObject *container, PyObject *element)
1051
1052 Check whether *element* is contained in *container* and return true or false
1053 accordingly.
1054
1055 *element* has to coerce to a one element Unicode string. ``-1`` is returned if
1056 there was an error.
1057
1058
1059.. cfunction:: void PyUnicode_InternInPlace(PyObject **string)
1060
1061 Intern the argument *\*string* in place. The argument must be the address of a
1062 pointer variable pointing to a Python unicode string object. If there is an
1063 existing interned string that is the same as *\*string*, it sets *\*string* to
1064 it (decrementing the reference count of the old string object and incrementing
1065 the reference count of the interned string object), otherwise it leaves
1066 *\*string* alone and interns it (incrementing its reference count).
1067 (Clarification: even though there is a lot of talk about reference counts, think
1068 of this function as reference-count-neutral; you own the object after the call
1069 if and only if you owned it before the call.)
1070
1071
1072.. cfunction:: PyObject* PyUnicode_InternFromString(const char *v)
1073
1074 A combination of :cfunc:`PyUnicode_FromString` and
1075 :cfunc:`PyUnicode_InternInPlace`, returning either a new unicode string object
1076 that has been interned, or a new ("owned") reference to an earlier interned
1077 string object with the same value.
1078