blob: 8308304e7cdcaa89f4f15a9773eade20e2af6590 [file] [log] [blame]
Georg Brandlf6842722008-01-19 22:08:21 +00001.. highlightlang:: c
2
3.. _unicodeobjects:
4
5Unicode Objects and Codecs
6--------------------------
7
8.. sectionauthor:: Marc-Andre Lemburg <mal@lemburg.com>
9
10Unicode Objects
11^^^^^^^^^^^^^^^
12
13
Victor Stinner5f8aae02010-05-14 15:53:20 +000014Unicode Type
15""""""""""""
16
Georg Brandlf6842722008-01-19 22:08:21 +000017These are the basic Unicode object types used for the Unicode implementation in
18Python:
19
Georg Brandlf6842722008-01-19 22:08:21 +000020
21.. ctype:: Py_UNICODE
22
23 This type represents the storage type which is used by Python internally as
24 basis for holding Unicode ordinals. Python's default builds use a 16-bit type
25 for :ctype:`Py_UNICODE` and store Unicode values internally as UCS2. It is also
26 possible to build a UCS4 version of Python (most recent Linux distributions come
27 with UCS4 builds of Python). These builds then use a 32-bit type for
28 :ctype:`Py_UNICODE` and store Unicode data internally as UCS4. On platforms
29 where :ctype:`wchar_t` is available and compatible with the chosen Python
30 Unicode build variant, :ctype:`Py_UNICODE` is a typedef alias for
31 :ctype:`wchar_t` to enhance native platform compatibility. On all other
32 platforms, :ctype:`Py_UNICODE` is a typedef alias for either :ctype:`unsigned
33 short` (UCS2) or :ctype:`unsigned long` (UCS4).
34
35Note that UCS2 and UCS4 Python builds are not binary compatible. Please keep
36this in mind when writing extensions or interfaces.
37
38
39.. ctype:: PyUnicodeObject
40
41 This subtype of :ctype:`PyObject` represents a Python Unicode object.
42
43
44.. cvar:: PyTypeObject PyUnicode_Type
45
46 This instance of :ctype:`PyTypeObject` represents the Python Unicode type. It
47 is exposed to Python code as ``unicode`` and ``types.UnicodeType``.
48
49The following APIs are really C macros and can be used to do fast checks and to
50access internal read-only data of Unicode objects:
51
52
53.. cfunction:: int PyUnicode_Check(PyObject *o)
54
55 Return true if the object *o* is a Unicode object or an instance of a Unicode
56 subtype.
57
58 .. versionchanged:: 2.2
59 Allowed subtypes to be accepted.
60
61
62.. cfunction:: int PyUnicode_CheckExact(PyObject *o)
63
64 Return true if the object *o* is a Unicode object, but not an instance of a
65 subtype.
66
67 .. versionadded:: 2.2
68
69
70.. cfunction:: Py_ssize_t PyUnicode_GET_SIZE(PyObject *o)
71
72 Return the size of the object. *o* has to be a :ctype:`PyUnicodeObject` (not
73 checked).
74
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +000075 .. versionchanged:: 2.5
76 This function returned an :ctype:`int` type. This might require changes
77 in your code for properly supporting 64-bit systems.
78
Georg Brandlf6842722008-01-19 22:08:21 +000079
80.. cfunction:: Py_ssize_t PyUnicode_GET_DATA_SIZE(PyObject *o)
81
82 Return the size of the object's internal buffer in bytes. *o* has to be a
83 :ctype:`PyUnicodeObject` (not checked).
84
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +000085 .. versionchanged:: 2.5
86 This function returned an :ctype:`int` type. This might require changes
87 in your code for properly supporting 64-bit systems.
88
Georg Brandlf6842722008-01-19 22:08:21 +000089
90.. cfunction:: Py_UNICODE* PyUnicode_AS_UNICODE(PyObject *o)
91
92 Return a pointer to the internal :ctype:`Py_UNICODE` buffer of the object. *o*
93 has to be a :ctype:`PyUnicodeObject` (not checked).
94
95
96.. cfunction:: const char* PyUnicode_AS_DATA(PyObject *o)
97
98 Return a pointer to the internal buffer of the object. *o* has to be a
99 :ctype:`PyUnicodeObject` (not checked).
100
Christian Heimes3b718a72008-02-14 12:47:33 +0000101
Georg Brandl36b30b52009-07-24 16:46:38 +0000102.. cfunction:: int PyUnicode_ClearFreeList()
Christian Heimes3b718a72008-02-14 12:47:33 +0000103
104 Clear the free list. Return the total number of freed items.
105
106 .. versionadded:: 2.6
107
Georg Brandl36b30b52009-07-24 16:46:38 +0000108
Victor Stinner5f8aae02010-05-14 15:53:20 +0000109Unicode Character Properties
110""""""""""""""""""""""""""""
111
Georg Brandlf6842722008-01-19 22:08:21 +0000112Unicode provides many different character properties. The most often needed ones
113are available through these macros which are mapped to C functions depending on
114the Python configuration.
115
Georg Brandlf6842722008-01-19 22:08:21 +0000116
117.. cfunction:: int Py_UNICODE_ISSPACE(Py_UNICODE ch)
118
119 Return 1 or 0 depending on whether *ch* is a whitespace character.
120
121
122.. cfunction:: int Py_UNICODE_ISLOWER(Py_UNICODE ch)
123
124 Return 1 or 0 depending on whether *ch* is a lowercase character.
125
126
127.. cfunction:: int Py_UNICODE_ISUPPER(Py_UNICODE ch)
128
129 Return 1 or 0 depending on whether *ch* is an uppercase character.
130
131
132.. cfunction:: int Py_UNICODE_ISTITLE(Py_UNICODE ch)
133
134 Return 1 or 0 depending on whether *ch* is a titlecase character.
135
136
137.. cfunction:: int Py_UNICODE_ISLINEBREAK(Py_UNICODE ch)
138
139 Return 1 or 0 depending on whether *ch* is a linebreak character.
140
141
142.. cfunction:: int Py_UNICODE_ISDECIMAL(Py_UNICODE ch)
143
144 Return 1 or 0 depending on whether *ch* is a decimal character.
145
146
147.. cfunction:: int Py_UNICODE_ISDIGIT(Py_UNICODE ch)
148
149 Return 1 or 0 depending on whether *ch* is a digit character.
150
151
152.. cfunction:: int Py_UNICODE_ISNUMERIC(Py_UNICODE ch)
153
154 Return 1 or 0 depending on whether *ch* is a numeric character.
155
156
157.. cfunction:: int Py_UNICODE_ISALPHA(Py_UNICODE ch)
158
159 Return 1 or 0 depending on whether *ch* is an alphabetic character.
160
161
162.. cfunction:: int Py_UNICODE_ISALNUM(Py_UNICODE ch)
163
164 Return 1 or 0 depending on whether *ch* is an alphanumeric character.
165
166These APIs can be used for fast direct character conversions:
167
168
169.. cfunction:: Py_UNICODE Py_UNICODE_TOLOWER(Py_UNICODE ch)
170
171 Return the character *ch* converted to lower case.
172
173
174.. cfunction:: Py_UNICODE Py_UNICODE_TOUPPER(Py_UNICODE ch)
175
176 Return the character *ch* converted to upper case.
177
178
179.. cfunction:: Py_UNICODE Py_UNICODE_TOTITLE(Py_UNICODE ch)
180
181 Return the character *ch* converted to title case.
182
183
184.. cfunction:: int Py_UNICODE_TODECIMAL(Py_UNICODE ch)
185
186 Return the character *ch* converted to a decimal positive integer. Return
187 ``-1`` if this is not possible. This macro does not raise exceptions.
188
189
190.. cfunction:: int Py_UNICODE_TODIGIT(Py_UNICODE ch)
191
192 Return the character *ch* converted to a single digit integer. Return ``-1`` if
193 this is not possible. This macro does not raise exceptions.
194
195
196.. cfunction:: double Py_UNICODE_TONUMERIC(Py_UNICODE ch)
197
198 Return the character *ch* converted to a double. Return ``-1.0`` if this is not
199 possible. This macro does not raise exceptions.
200
Victor Stinner5f8aae02010-05-14 15:53:20 +0000201
202Plain Py_UNICODE
203""""""""""""""""
204
Georg Brandlf6842722008-01-19 22:08:21 +0000205To create Unicode objects and access their basic sequence properties, use these
206APIs:
207
Georg Brandlf6842722008-01-19 22:08:21 +0000208
209.. cfunction:: PyObject* PyUnicode_FromUnicode(const Py_UNICODE *u, Py_ssize_t size)
210
Georg Brandlb8d0e362010-11-26 07:53:50 +0000211 Create a Unicode object from the Py_UNICODE buffer *u* of the given size. *u*
Georg Brandlf6842722008-01-19 22:08:21 +0000212 may be *NULL* which causes the contents to be undefined. It is the user's
213 responsibility to fill in the needed data. The buffer is copied into the new
214 object. If the buffer is not *NULL*, the return value might be a shared object.
215 Therefore, modification of the resulting Unicode object is only allowed when *u*
216 is *NULL*.
217
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +0000218 .. versionchanged:: 2.5
219 This function used an :ctype:`int` type for *size*. This might require
220 changes in your code for properly supporting 64-bit systems.
221
Georg Brandlf6842722008-01-19 22:08:21 +0000222
Georg Brandl79cdff02010-10-17 10:54:57 +0000223.. cfunction:: PyObject* PyUnicode_FromStringAndSize(const char *u, Py_ssize_t size)
224
Georg Brandlb8d0e362010-11-26 07:53:50 +0000225 Create a Unicode object from the char buffer *u*. The bytes will be interpreted
Georg Brandl79cdff02010-10-17 10:54:57 +0000226 as being UTF-8 encoded. *u* may also be *NULL* which
227 causes the contents to be undefined. It is the user's responsibility to fill in
228 the needed data. The buffer is copied into the new object. If the buffer is not
229 *NULL*, the return value might be a shared object. Therefore, modification of
230 the resulting Unicode object is only allowed when *u* is *NULL*.
231
232 .. versionadded:: 2.6
233
234
235.. cfunction:: PyObject *PyUnicode_FromString(const char *u)
236
237 Create a Unicode object from an UTF-8 encoded null-terminated char buffer
238 *u*.
239
240 .. versionadded:: 2.6
241
242
243.. cfunction:: PyObject* PyUnicode_FromFormat(const char *format, ...)
244
245 Take a C :cfunc:`printf`\ -style *format* string and a variable number of
246 arguments, calculate the size of the resulting Python unicode string and return
247 a string with the values formatted into it. The variable arguments must be C
248 types and must correspond exactly to the format characters in the *format*
249 string. The following format characters are allowed:
250
251 .. % The descriptions for %zd and %zu are wrong, but the truth is complicated
252 .. % because not all compilers support the %z width modifier -- we fake it
253 .. % when necessary via interpolating PY_FORMAT_SIZE_T.
254
255 +-------------------+---------------------+--------------------------------+
256 | Format Characters | Type | Comment |
257 +===================+=====================+================================+
258 | :attr:`%%` | *n/a* | The literal % character. |
259 +-------------------+---------------------+--------------------------------+
260 | :attr:`%c` | int | A single character, |
261 | | | represented as an C int. |
262 +-------------------+---------------------+--------------------------------+
263 | :attr:`%d` | int | Exactly equivalent to |
264 | | | ``printf("%d")``. |
265 +-------------------+---------------------+--------------------------------+
266 | :attr:`%u` | unsigned int | Exactly equivalent to |
267 | | | ``printf("%u")``. |
268 +-------------------+---------------------+--------------------------------+
269 | :attr:`%ld` | long | Exactly equivalent to |
270 | | | ``printf("%ld")``. |
271 +-------------------+---------------------+--------------------------------+
272 | :attr:`%lu` | unsigned long | Exactly equivalent to |
273 | | | ``printf("%lu")``. |
274 +-------------------+---------------------+--------------------------------+
275 | :attr:`%zd` | Py_ssize_t | Exactly equivalent to |
276 | | | ``printf("%zd")``. |
277 +-------------------+---------------------+--------------------------------+
278 | :attr:`%zu` | size_t | Exactly equivalent to |
279 | | | ``printf("%zu")``. |
280 +-------------------+---------------------+--------------------------------+
281 | :attr:`%i` | int | Exactly equivalent to |
282 | | | ``printf("%i")``. |
283 +-------------------+---------------------+--------------------------------+
284 | :attr:`%x` | int | Exactly equivalent to |
285 | | | ``printf("%x")``. |
286 +-------------------+---------------------+--------------------------------+
287 | :attr:`%s` | char\* | A null-terminated C character |
288 | | | array. |
289 +-------------------+---------------------+--------------------------------+
290 | :attr:`%p` | void\* | The hex representation of a C |
291 | | | pointer. Mostly equivalent to |
292 | | | ``printf("%p")`` except that |
293 | | | it is guaranteed to start with |
294 | | | the literal ``0x`` regardless |
295 | | | of what the platform's |
296 | | | ``printf`` yields. |
297 +-------------------+---------------------+--------------------------------+
298 | :attr:`%U` | PyObject\* | A unicode object. |
299 +-------------------+---------------------+--------------------------------+
300 | :attr:`%V` | PyObject\*, char \* | A unicode object (which may be |
301 | | | *NULL*) and a null-terminated |
302 | | | C character array as a second |
303 | | | parameter (which will be used, |
304 | | | if the first parameter is |
305 | | | *NULL*). |
306 +-------------------+---------------------+--------------------------------+
307 | :attr:`%S` | PyObject\* | The result of calling |
308 | | | :func:`PyObject_Unicode`. |
309 +-------------------+---------------------+--------------------------------+
310 | :attr:`%R` | PyObject\* | The result of calling |
311 | | | :func:`PyObject_Repr`. |
312 +-------------------+---------------------+--------------------------------+
313
314 An unrecognized format character causes all the rest of the format string to be
315 copied as-is to the result string, and any extra arguments discarded.
316
317 .. versionadded:: 2.6
318
319
320.. cfunction:: PyObject* PyUnicode_FromFormatV(const char *format, va_list vargs)
321
322 Identical to :func:`PyUnicode_FromFormat` except that it takes exactly two
323 arguments.
324
325 .. versionadded:: 2.6
326
327
Georg Brandlf6842722008-01-19 22:08:21 +0000328.. cfunction:: Py_UNICODE* PyUnicode_AsUnicode(PyObject *unicode)
329
Victor Stinner28a545e2011-12-18 19:39:53 +0100330 Return a read-only pointer to the Unicode object's internal
Ezio Melotti2d679a42011-12-19 07:17:08 +0200331 :ctype:`Py_UNICODE` buffer, *NULL* if *unicode* is not a Unicode object.
332 Note that the resulting :ctype:`Py_UNICODE*` string may contain embedded
Victor Stinner28a545e2011-12-18 19:39:53 +0100333 null characters, which would cause the string to be truncated when used in
334 most C functions.
Georg Brandlf6842722008-01-19 22:08:21 +0000335
336
337.. cfunction:: Py_ssize_t PyUnicode_GetSize(PyObject *unicode)
338
339 Return the length of the Unicode object.
340
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +0000341 .. versionchanged:: 2.5
342 This function returned an :ctype:`int` type. This might require changes
343 in your code for properly supporting 64-bit systems.
344
Georg Brandlf6842722008-01-19 22:08:21 +0000345
346.. cfunction:: PyObject* PyUnicode_FromEncodedObject(PyObject *obj, const char *encoding, const char *errors)
347
348 Coerce an encoded object *obj* to an Unicode object and return a reference with
349 incremented refcount.
350
351 String and other char buffer compatible objects are decoded according to the
352 given encoding and using the error handling defined by errors. Both can be
353 *NULL* to have the interface use the default values (see the next section for
354 details).
355
356 All other objects, including Unicode objects, cause a :exc:`TypeError` to be
357 set.
358
359 The API returns *NULL* if there was an error. The caller is responsible for
360 decref'ing the returned objects.
361
362
363.. cfunction:: PyObject* PyUnicode_FromObject(PyObject *obj)
364
365 Shortcut for ``PyUnicode_FromEncodedObject(obj, NULL, "strict")`` which is used
366 throughout the interpreter whenever coercion to Unicode is needed.
367
368If the platform supports :ctype:`wchar_t` and provides a header file wchar.h,
369Python can interface directly to this type using the following functions.
370Support is optimized if Python's own :ctype:`Py_UNICODE` type is identical to
371the system's :ctype:`wchar_t`.
372
Georg Brandlf6842722008-01-19 22:08:21 +0000373
Victor Stinner5f8aae02010-05-14 15:53:20 +0000374wchar_t Support
375"""""""""""""""
376
Ezio Melotti020f6502011-04-14 07:39:06 +0300377:ctype:`wchar_t` support for platforms which support it:
Georg Brandlf6842722008-01-19 22:08:21 +0000378
379.. cfunction:: PyObject* PyUnicode_FromWideChar(const wchar_t *w, Py_ssize_t size)
380
Ezio Melotti020f6502011-04-14 07:39:06 +0300381 Create a Unicode object from the :ctype:`wchar_t` buffer *w* of the given *size*.
Georg Brandlf6842722008-01-19 22:08:21 +0000382 Return *NULL* on failure.
383
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +0000384 .. versionchanged:: 2.5
385 This function used an :ctype:`int` type for *size*. This might require
386 changes in your code for properly supporting 64-bit systems.
387
Georg Brandlf6842722008-01-19 22:08:21 +0000388
389.. cfunction:: Py_ssize_t PyUnicode_AsWideChar(PyUnicodeObject *unicode, wchar_t *w, Py_ssize_t size)
390
391 Copy the Unicode object contents into the :ctype:`wchar_t` buffer *w*. At most
392 *size* :ctype:`wchar_t` characters are copied (excluding a possibly trailing
393 0-termination character). Return the number of :ctype:`wchar_t` characters
394 copied or -1 in case of an error. Note that the resulting :ctype:`wchar_t`
395 string may or may not be 0-terminated. It is the responsibility of the caller
396 to make sure that the :ctype:`wchar_t` string is 0-terminated in case this is
Ezio Melotti2d679a42011-12-19 07:17:08 +0200397 required by the application. Also, note that the :ctype:`wchar_t*` string
Victor Stinner28a545e2011-12-18 19:39:53 +0100398 might contain null characters, which would cause the string to be truncated
399 when used with most C functions.
Georg Brandlf6842722008-01-19 22:08:21 +0000400
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +0000401 .. versionchanged:: 2.5
402 This function returned an :ctype:`int` type and used an :ctype:`int`
403 type for *size*. This might require changes in your code for properly
404 supporting 64-bit systems.
405
Georg Brandlf6842722008-01-19 22:08:21 +0000406
407.. _builtincodecs:
408
409Built-in Codecs
410^^^^^^^^^^^^^^^
411
Georg Brandld7d4fd72009-07-26 14:37:28 +0000412Python provides a set of built-in codecs which are written in C for speed. All of
Georg Brandlf6842722008-01-19 22:08:21 +0000413these codecs are directly usable via the following functions.
414
Ezio Melotti020f6502011-04-14 07:39:06 +0300415Many of the following APIs take two arguments encoding and errors, and they
416have the same semantics as the ones of the built-in :func:`unicode` Unicode
417object constructor.
Georg Brandlf6842722008-01-19 22:08:21 +0000418
419Setting encoding to *NULL* causes the default encoding to be used which is
420ASCII. The file system calls should use :cdata:`Py_FileSystemDefaultEncoding`
Ezio Melotti020f6502011-04-14 07:39:06 +0300421as the encoding for file names. This variable should be treated as read-only: on
Georg Brandlf6842722008-01-19 22:08:21 +0000422some systems, it will be a pointer to a static string, on others, it will change
423at run-time (such as when the application invokes setlocale).
424
425Error handling is set by errors which may also be set to *NULL* meaning to use
426the default handling defined for the codec. Default error handling for all
Georg Brandld7d4fd72009-07-26 14:37:28 +0000427built-in codecs is "strict" (:exc:`ValueError` is raised).
Georg Brandlf6842722008-01-19 22:08:21 +0000428
429The codecs all use a similar interface. Only deviation from the following
430generic ones are documented for simplicity.
431
Georg Brandlf6842722008-01-19 22:08:21 +0000432
Victor Stinner5f8aae02010-05-14 15:53:20 +0000433Generic Codecs
434""""""""""""""
435
436These are the generic codec APIs:
Georg Brandlf6842722008-01-19 22:08:21 +0000437
438
439.. cfunction:: PyObject* PyUnicode_Decode(const char *s, Py_ssize_t size, const char *encoding, const char *errors)
440
441 Create a Unicode object by decoding *size* bytes of the encoded string *s*.
442 *encoding* and *errors* have the same meaning as the parameters of the same name
Georg Brandld7d4fd72009-07-26 14:37:28 +0000443 in the :func:`unicode` built-in function. The codec to be used is looked up
Georg Brandlf6842722008-01-19 22:08:21 +0000444 using the Python codec registry. Return *NULL* if an exception was raised by
445 the codec.
446
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +0000447 .. versionchanged:: 2.5
448 This function used an :ctype:`int` type for *size*. This might require
449 changes in your code for properly supporting 64-bit systems.
450
Georg Brandlf6842722008-01-19 22:08:21 +0000451
452.. cfunction:: PyObject* PyUnicode_Encode(const Py_UNICODE *s, Py_ssize_t size, const char *encoding, const char *errors)
453
Ezio Melotti020f6502011-04-14 07:39:06 +0300454 Encode the :ctype:`Py_UNICODE` buffer *s* of the given *size* and return a Python
Georg Brandlf6842722008-01-19 22:08:21 +0000455 string object. *encoding* and *errors* have the same meaning as the parameters
456 of the same name in the Unicode :meth:`encode` method. The codec to be used is
457 looked up using the Python codec registry. Return *NULL* if an exception was
458 raised by the codec.
459
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +0000460 .. versionchanged:: 2.5
461 This function used an :ctype:`int` type for *size*. This might require
462 changes in your code for properly supporting 64-bit systems.
463
Georg Brandlf6842722008-01-19 22:08:21 +0000464
465.. cfunction:: PyObject* PyUnicode_AsEncodedString(PyObject *unicode, const char *encoding, const char *errors)
466
467 Encode a Unicode object and return the result as Python string object.
468 *encoding* and *errors* have the same meaning as the parameters of the same name
469 in the Unicode :meth:`encode` method. The codec to be used is looked up using
470 the Python codec registry. Return *NULL* if an exception was raised by the
471 codec.
472
Georg Brandlf6842722008-01-19 22:08:21 +0000473
Victor Stinner5f8aae02010-05-14 15:53:20 +0000474UTF-8 Codecs
475""""""""""""
476
477These are the UTF-8 codec APIs:
Georg Brandlf6842722008-01-19 22:08:21 +0000478
479
480.. cfunction:: PyObject* PyUnicode_DecodeUTF8(const char *s, Py_ssize_t size, const char *errors)
481
482 Create a Unicode object by decoding *size* bytes of the UTF-8 encoded string
483 *s*. Return *NULL* if an exception was raised by the codec.
484
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +0000485 .. versionchanged:: 2.5
486 This function used an :ctype:`int` type for *size*. This might require
487 changes in your code for properly supporting 64-bit systems.
488
Georg Brandlf6842722008-01-19 22:08:21 +0000489
490.. cfunction:: PyObject* PyUnicode_DecodeUTF8Stateful(const char *s, Py_ssize_t size, const char *errors, Py_ssize_t *consumed)
491
492 If *consumed* is *NULL*, behave like :cfunc:`PyUnicode_DecodeUTF8`. If
493 *consumed* is not *NULL*, trailing incomplete UTF-8 byte sequences will not be
494 treated as an error. Those bytes will not be decoded and the number of bytes
495 that have been decoded will be stored in *consumed*.
496
497 .. versionadded:: 2.4
498
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +0000499 .. versionchanged:: 2.5
500 This function used an :ctype:`int` type for *size*. This might require
501 changes in your code for properly supporting 64-bit systems.
502
Georg Brandlf6842722008-01-19 22:08:21 +0000503
504.. cfunction:: PyObject* PyUnicode_EncodeUTF8(const Py_UNICODE *s, Py_ssize_t size, const char *errors)
505
Ezio Melotti020f6502011-04-14 07:39:06 +0300506 Encode the :ctype:`Py_UNICODE` buffer *s* of the given *size* using UTF-8 and return a
Georg Brandlf6842722008-01-19 22:08:21 +0000507 Python string object. Return *NULL* if an exception was raised by the codec.
508
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +0000509 .. versionchanged:: 2.5
510 This function used an :ctype:`int` type for *size*. This might require
511 changes in your code for properly supporting 64-bit systems.
512
Georg Brandlf6842722008-01-19 22:08:21 +0000513
514.. cfunction:: PyObject* PyUnicode_AsUTF8String(PyObject *unicode)
515
516 Encode a Unicode object using UTF-8 and return the result as Python string
517 object. Error handling is "strict". Return *NULL* if an exception was raised
518 by the codec.
519
Georg Brandlf6842722008-01-19 22:08:21 +0000520
Victor Stinner5f8aae02010-05-14 15:53:20 +0000521UTF-32 Codecs
522"""""""""""""
523
524These are the UTF-32 codec APIs:
Georg Brandlf6842722008-01-19 22:08:21 +0000525
526
527.. cfunction:: PyObject* PyUnicode_DecodeUTF32(const char *s, Py_ssize_t size, const char *errors, int *byteorder)
528
Ezio Melotti020f6502011-04-14 07:39:06 +0300529 Decode *size* bytes from a UTF-32 encoded buffer string and return the
Georg Brandlf6842722008-01-19 22:08:21 +0000530 corresponding Unicode object. *errors* (if non-*NULL*) defines the error
531 handling. It defaults to "strict".
532
533 If *byteorder* is non-*NULL*, the decoder starts decoding using the given byte
534 order::
535
536 *byteorder == -1: little endian
537 *byteorder == 0: native order
538 *byteorder == 1: big endian
539
Georg Brandl579a3582009-09-18 21:35:59 +0000540 If ``*byteorder`` is zero, and the first four bytes of the input data are a
541 byte order mark (BOM), the decoder switches to this byte order and the BOM is
542 not copied into the resulting Unicode string. If ``*byteorder`` is ``-1`` or
543 ``1``, any byte order mark is copied to the output.
544
545 After completion, *\*byteorder* is set to the current byte order at the end
546 of input data.
Georg Brandlf6842722008-01-19 22:08:21 +0000547
548 In a narrow build codepoints outside the BMP will be decoded as surrogate pairs.
549
550 If *byteorder* is *NULL*, the codec starts in native order mode.
551
552 Return *NULL* if an exception was raised by the codec.
553
554 .. versionadded:: 2.6
555
556
557.. cfunction:: PyObject* PyUnicode_DecodeUTF32Stateful(const char *s, Py_ssize_t size, const char *errors, int *byteorder, Py_ssize_t *consumed)
558
559 If *consumed* is *NULL*, behave like :cfunc:`PyUnicode_DecodeUTF32`. If
560 *consumed* is not *NULL*, :cfunc:`PyUnicode_DecodeUTF32Stateful` will not treat
561 trailing incomplete UTF-32 byte sequences (such as a number of bytes not divisible
562 by four) as an error. Those bytes will not be decoded and the number of bytes
563 that have been decoded will be stored in *consumed*.
564
565 .. versionadded:: 2.6
566
567
568.. cfunction:: PyObject* PyUnicode_EncodeUTF32(const Py_UNICODE *s, Py_ssize_t size, const char *errors, int byteorder)
569
570 Return a Python bytes object holding the UTF-32 encoded value of the Unicode
Georg Brandl579a3582009-09-18 21:35:59 +0000571 data in *s*. Output is written according to the following byte order::
Georg Brandlf6842722008-01-19 22:08:21 +0000572
573 byteorder == -1: little endian
574 byteorder == 0: native byte order (writes a BOM mark)
575 byteorder == 1: big endian
576
577 If byteorder is ``0``, the output string will always start with the Unicode BOM
578 mark (U+FEFF). In the other two modes, no BOM mark is prepended.
579
580 If *Py_UNICODE_WIDE* is not defined, surrogate pairs will be output
581 as a single codepoint.
582
583 Return *NULL* if an exception was raised by the codec.
584
585 .. versionadded:: 2.6
586
587
588.. cfunction:: PyObject* PyUnicode_AsUTF32String(PyObject *unicode)
589
590 Return a Python string using the UTF-32 encoding in native byte order. The
591 string always starts with a BOM mark. Error handling is "strict". Return
592 *NULL* if an exception was raised by the codec.
593
594 .. versionadded:: 2.6
595
596
Victor Stinner5f8aae02010-05-14 15:53:20 +0000597UTF-16 Codecs
598"""""""""""""
Georg Brandlf6842722008-01-19 22:08:21 +0000599
Victor Stinner5f8aae02010-05-14 15:53:20 +0000600These are the UTF-16 codec APIs:
Georg Brandlf6842722008-01-19 22:08:21 +0000601
602
603.. cfunction:: PyObject* PyUnicode_DecodeUTF16(const char *s, Py_ssize_t size, const char *errors, int *byteorder)
604
Ezio Melotti020f6502011-04-14 07:39:06 +0300605 Decode *size* bytes from a UTF-16 encoded buffer string and return the
Georg Brandlf6842722008-01-19 22:08:21 +0000606 corresponding Unicode object. *errors* (if non-*NULL*) defines the error
607 handling. It defaults to "strict".
608
609 If *byteorder* is non-*NULL*, the decoder starts decoding using the given byte
610 order::
611
612 *byteorder == -1: little endian
613 *byteorder == 0: native order
614 *byteorder == 1: big endian
615
Georg Brandl579a3582009-09-18 21:35:59 +0000616 If ``*byteorder`` is zero, and the first two bytes of the input data are a
617 byte order mark (BOM), the decoder switches to this byte order and the BOM is
618 not copied into the resulting Unicode string. If ``*byteorder`` is ``-1`` or
619 ``1``, any byte order mark is copied to the output (where it will result in
620 either a ``\ufeff`` or a ``\ufffe`` character).
621
622 After completion, *\*byteorder* is set to the current byte order at the end
623 of input data.
Georg Brandlf6842722008-01-19 22:08:21 +0000624
625 If *byteorder* is *NULL*, the codec starts in native order mode.
626
627 Return *NULL* if an exception was raised by the codec.
628
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +0000629 .. versionchanged:: 2.5
630 This function used an :ctype:`int` type for *size*. This might require
631 changes in your code for properly supporting 64-bit systems.
632
Georg Brandlf6842722008-01-19 22:08:21 +0000633
634.. cfunction:: PyObject* PyUnicode_DecodeUTF16Stateful(const char *s, Py_ssize_t size, const char *errors, int *byteorder, Py_ssize_t *consumed)
635
636 If *consumed* is *NULL*, behave like :cfunc:`PyUnicode_DecodeUTF16`. If
637 *consumed* is not *NULL*, :cfunc:`PyUnicode_DecodeUTF16Stateful` will not treat
638 trailing incomplete UTF-16 byte sequences (such as an odd number of bytes or a
639 split surrogate pair) as an error. Those bytes will not be decoded and the
640 number of bytes that have been decoded will be stored in *consumed*.
641
642 .. versionadded:: 2.4
643
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +0000644 .. versionchanged:: 2.5
645 This function used an :ctype:`int` type for *size* and an :ctype:`int *`
646 type for *consumed*. This might require changes in your code for
647 properly supporting 64-bit systems.
648
Georg Brandlf6842722008-01-19 22:08:21 +0000649
650.. cfunction:: PyObject* PyUnicode_EncodeUTF16(const Py_UNICODE *s, Py_ssize_t size, const char *errors, int byteorder)
651
652 Return a Python string object holding the UTF-16 encoded value of the Unicode
Georg Brandl579a3582009-09-18 21:35:59 +0000653 data in *s*. Output is written according to the following byte order::
Georg Brandlf6842722008-01-19 22:08:21 +0000654
655 byteorder == -1: little endian
656 byteorder == 0: native byte order (writes a BOM mark)
657 byteorder == 1: big endian
658
659 If byteorder is ``0``, the output string will always start with the Unicode BOM
660 mark (U+FEFF). In the other two modes, no BOM mark is prepended.
661
662 If *Py_UNICODE_WIDE* is defined, a single :ctype:`Py_UNICODE` value may get
663 represented as a surrogate pair. If it is not defined, each :ctype:`Py_UNICODE`
664 values is interpreted as an UCS-2 character.
665
666 Return *NULL* if an exception was raised by the codec.
667
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +0000668 .. versionchanged:: 2.5
669 This function used an :ctype:`int` type for *size*. This might require
670 changes in your code for properly supporting 64-bit systems.
671
Georg Brandlf6842722008-01-19 22:08:21 +0000672
673.. cfunction:: PyObject* PyUnicode_AsUTF16String(PyObject *unicode)
674
675 Return a Python string using the UTF-16 encoding in native byte order. The
676 string always starts with a BOM mark. Error handling is "strict". Return
677 *NULL* if an exception was raised by the codec.
678
Georg Brandlf6842722008-01-19 22:08:21 +0000679
Georg Brandl7d4bfb32010-08-02 21:44:25 +0000680UTF-7 Codecs
681""""""""""""
682
683These are the UTF-7 codec APIs:
684
685
686.. cfunction:: PyObject* PyUnicode_DecodeUTF7(const char *s, Py_ssize_t size, const char *errors)
687
688 Create a Unicode object by decoding *size* bytes of the UTF-7 encoded string
689 *s*. Return *NULL* if an exception was raised by the codec.
690
691
Georg Brandl21946af2010-10-06 09:28:45 +0000692.. cfunction:: PyObject* PyUnicode_DecodeUTF7Stateful(const char *s, Py_ssize_t size, const char *errors, Py_ssize_t *consumed)
Georg Brandl7d4bfb32010-08-02 21:44:25 +0000693
694 If *consumed* is *NULL*, behave like :cfunc:`PyUnicode_DecodeUTF7`. If
695 *consumed* is not *NULL*, trailing incomplete UTF-7 base-64 sections will not
696 be treated as an error. Those bytes will not be decoded and the number of
697 bytes that have been decoded will be stored in *consumed*.
698
699
700.. cfunction:: PyObject* PyUnicode_EncodeUTF7(const Py_UNICODE *s, Py_ssize_t size, int base64SetO, int base64WhiteSpace, const char *errors)
701
702 Encode the :ctype:`Py_UNICODE` buffer of the given size using UTF-7 and
703 return a Python bytes object. Return *NULL* if an exception was raised by
704 the codec.
705
706 If *base64SetO* is nonzero, "Set O" (punctuation that has no otherwise
707 special meaning) will be encoded in base-64. If *base64WhiteSpace* is
708 nonzero, whitespace will be encoded in base-64. Both are set to zero for the
709 Python "utf-7" codec.
710
711
Victor Stinner5f8aae02010-05-14 15:53:20 +0000712Unicode-Escape Codecs
713"""""""""""""""""""""
714
715These are the "Unicode Escape" codec APIs:
Georg Brandlf6842722008-01-19 22:08:21 +0000716
717
718.. cfunction:: PyObject* PyUnicode_DecodeUnicodeEscape(const char *s, Py_ssize_t size, const char *errors)
719
720 Create a Unicode object by decoding *size* bytes of the Unicode-Escape encoded
721 string *s*. Return *NULL* if an exception was raised by the codec.
722
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +0000723 .. versionchanged:: 2.5
724 This function used an :ctype:`int` type for *size*. This might require
725 changes in your code for properly supporting 64-bit systems.
726
Georg Brandlf6842722008-01-19 22:08:21 +0000727
728.. cfunction:: PyObject* PyUnicode_EncodeUnicodeEscape(const Py_UNICODE *s, Py_ssize_t size)
729
Ezio Melotti020f6502011-04-14 07:39:06 +0300730 Encode the :ctype:`Py_UNICODE` buffer of the given *size* using Unicode-Escape and
Georg Brandlf6842722008-01-19 22:08:21 +0000731 return a Python string object. Return *NULL* if an exception was raised by the
732 codec.
733
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +0000734 .. versionchanged:: 2.5
735 This function used an :ctype:`int` type for *size*. This might require
736 changes in your code for properly supporting 64-bit systems.
737
Georg Brandlf6842722008-01-19 22:08:21 +0000738
739.. cfunction:: PyObject* PyUnicode_AsUnicodeEscapeString(PyObject *unicode)
740
741 Encode a Unicode object using Unicode-Escape and return the result as Python
742 string object. Error handling is "strict". Return *NULL* if an exception was
743 raised by the codec.
744
Georg Brandlf6842722008-01-19 22:08:21 +0000745
Victor Stinner5f8aae02010-05-14 15:53:20 +0000746Raw-Unicode-Escape Codecs
747"""""""""""""""""""""""""
748
749These are the "Raw Unicode Escape" codec APIs:
Georg Brandlf6842722008-01-19 22:08:21 +0000750
751
752.. cfunction:: PyObject* PyUnicode_DecodeRawUnicodeEscape(const char *s, Py_ssize_t size, const char *errors)
753
754 Create a Unicode object by decoding *size* bytes of the Raw-Unicode-Escape
755 encoded string *s*. Return *NULL* if an exception was raised by the codec.
756
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +0000757 .. versionchanged:: 2.5
758 This function used an :ctype:`int` type for *size*. This might require
759 changes in your code for properly supporting 64-bit systems.
760
Georg Brandlf6842722008-01-19 22:08:21 +0000761
762.. cfunction:: PyObject* PyUnicode_EncodeRawUnicodeEscape(const Py_UNICODE *s, Py_ssize_t size, const char *errors)
763
Ezio Melotti020f6502011-04-14 07:39:06 +0300764 Encode the :ctype:`Py_UNICODE` buffer of the given *size* using Raw-Unicode-Escape
Georg Brandlf6842722008-01-19 22:08:21 +0000765 and return a Python string object. Return *NULL* if an exception was raised by
766 the codec.
767
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +0000768 .. versionchanged:: 2.5
769 This function used an :ctype:`int` type for *size*. This might require
770 changes in your code for properly supporting 64-bit systems.
771
Georg Brandlf6842722008-01-19 22:08:21 +0000772
773.. cfunction:: PyObject* PyUnicode_AsRawUnicodeEscapeString(PyObject *unicode)
774
775 Encode a Unicode object using Raw-Unicode-Escape and return the result as
776 Python string object. Error handling is "strict". Return *NULL* if an exception
777 was raised by the codec.
778
Victor Stinner5f8aae02010-05-14 15:53:20 +0000779
780Latin-1 Codecs
781""""""""""""""
782
Georg Brandlf6842722008-01-19 22:08:21 +0000783These are the Latin-1 codec APIs: Latin-1 corresponds to the first 256 Unicode
784ordinals and only these are accepted by the codecs during encoding.
785
Georg Brandlf6842722008-01-19 22:08:21 +0000786
787.. cfunction:: PyObject* PyUnicode_DecodeLatin1(const char *s, Py_ssize_t size, const char *errors)
788
789 Create a Unicode object by decoding *size* bytes of the Latin-1 encoded string
790 *s*. Return *NULL* if an exception was raised by the codec.
791
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +0000792 .. versionchanged:: 2.5
793 This function used an :ctype:`int` type for *size*. This might require
794 changes in your code for properly supporting 64-bit systems.
795
Georg Brandlf6842722008-01-19 22:08:21 +0000796
797.. cfunction:: PyObject* PyUnicode_EncodeLatin1(const Py_UNICODE *s, Py_ssize_t size, const char *errors)
798
Ezio Melotti020f6502011-04-14 07:39:06 +0300799 Encode the :ctype:`Py_UNICODE` buffer of the given *size* using Latin-1 and return
Georg Brandlf6842722008-01-19 22:08:21 +0000800 a Python string object. Return *NULL* if an exception was raised by the codec.
801
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +0000802 .. versionchanged:: 2.5
803 This function used an :ctype:`int` type for *size*. This might require
804 changes in your code for properly supporting 64-bit systems.
805
Georg Brandlf6842722008-01-19 22:08:21 +0000806
807.. cfunction:: PyObject* PyUnicode_AsLatin1String(PyObject *unicode)
808
809 Encode a Unicode object using Latin-1 and return the result as Python string
810 object. Error handling is "strict". Return *NULL* if an exception was raised
811 by the codec.
812
Victor Stinner5f8aae02010-05-14 15:53:20 +0000813
814ASCII Codecs
815""""""""""""
816
Georg Brandlf6842722008-01-19 22:08:21 +0000817These are the ASCII codec APIs. Only 7-bit ASCII data is accepted. All other
818codes generate errors.
819
Georg Brandlf6842722008-01-19 22:08:21 +0000820
821.. cfunction:: PyObject* PyUnicode_DecodeASCII(const char *s, Py_ssize_t size, const char *errors)
822
823 Create a Unicode object by decoding *size* bytes of the ASCII encoded string
824 *s*. Return *NULL* if an exception was raised by the codec.
825
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +0000826 .. versionchanged:: 2.5
827 This function used an :ctype:`int` type for *size*. This might require
828 changes in your code for properly supporting 64-bit systems.
829
Georg Brandlf6842722008-01-19 22:08:21 +0000830
831.. cfunction:: PyObject* PyUnicode_EncodeASCII(const Py_UNICODE *s, Py_ssize_t size, const char *errors)
832
Ezio Melotti020f6502011-04-14 07:39:06 +0300833 Encode the :ctype:`Py_UNICODE` buffer of the given *size* using ASCII and return a
Georg Brandlf6842722008-01-19 22:08:21 +0000834 Python string object. Return *NULL* if an exception was raised by the codec.
835
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +0000836 .. versionchanged:: 2.5
837 This function used an :ctype:`int` type for *size*. This might require
838 changes in your code for properly supporting 64-bit systems.
839
Georg Brandlf6842722008-01-19 22:08:21 +0000840
841.. cfunction:: PyObject* PyUnicode_AsASCIIString(PyObject *unicode)
842
843 Encode a Unicode object using ASCII and return the result as Python string
844 object. Error handling is "strict". Return *NULL* if an exception was raised
845 by the codec.
846
Georg Brandlf6842722008-01-19 22:08:21 +0000847
Victor Stinner5f8aae02010-05-14 15:53:20 +0000848Character Map Codecs
849""""""""""""""""""""
850
Georg Brandlf6842722008-01-19 22:08:21 +0000851This codec is special in that it can be used to implement many different codecs
852(and this is in fact what was done to obtain most of the standard codecs
853included in the :mod:`encodings` package). The codec uses mapping to encode and
854decode characters.
855
856Decoding mappings must map single string characters to single Unicode
857characters, integers (which are then interpreted as Unicode ordinals) or None
858(meaning "undefined mapping" and causing an error).
859
860Encoding mappings must map single Unicode characters to single string
861characters, integers (which are then interpreted as Latin-1 ordinals) or None
862(meaning "undefined mapping" and causing an error).
863
864The mapping objects provided must only support the __getitem__ mapping
865interface.
866
867If a character lookup fails with a LookupError, the character is copied as-is
868meaning that its ordinal value will be interpreted as Unicode or Latin-1 ordinal
869resp. Because of this, mappings only need to contain those mappings which map
870characters to different code points.
871
Ezio Melotti020f6502011-04-14 07:39:06 +0300872These are the mapping codec APIs:
Georg Brandlf6842722008-01-19 22:08:21 +0000873
874.. cfunction:: PyObject* PyUnicode_DecodeCharmap(const char *s, Py_ssize_t size, PyObject *mapping, const char *errors)
875
876 Create a Unicode object by decoding *size* bytes of the encoded string *s* using
877 the given *mapping* object. Return *NULL* if an exception was raised by the
878 codec. If *mapping* is *NULL* latin-1 decoding will be done. Else it can be a
879 dictionary mapping byte or a unicode string, which is treated as a lookup table.
880 Byte values greater that the length of the string and U+FFFE "characters" are
881 treated as "undefined mapping".
882
883 .. versionchanged:: 2.4
884 Allowed unicode string as mapping argument.
885
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +0000886 .. versionchanged:: 2.5
887 This function used an :ctype:`int` type for *size*. This might require
888 changes in your code for properly supporting 64-bit systems.
889
Georg Brandlf6842722008-01-19 22:08:21 +0000890
891.. cfunction:: PyObject* PyUnicode_EncodeCharmap(const Py_UNICODE *s, Py_ssize_t size, PyObject *mapping, const char *errors)
892
Ezio Melotti020f6502011-04-14 07:39:06 +0300893 Encode the :ctype:`Py_UNICODE` buffer of the given *size* using the given
Georg Brandlf6842722008-01-19 22:08:21 +0000894 *mapping* object and return a Python string object. Return *NULL* if an
895 exception was raised by the codec.
896
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +0000897 .. versionchanged:: 2.5
898 This function used an :ctype:`int` type for *size*. This might require
899 changes in your code for properly supporting 64-bit systems.
900
Georg Brandlf6842722008-01-19 22:08:21 +0000901
902.. cfunction:: PyObject* PyUnicode_AsCharmapString(PyObject *unicode, PyObject *mapping)
903
904 Encode a Unicode object using the given *mapping* object and return the result
905 as Python string object. Error handling is "strict". Return *NULL* if an
906 exception was raised by the codec.
907
908The following codec API is special in that maps Unicode to Unicode.
909
910
911.. cfunction:: PyObject* PyUnicode_TranslateCharmap(const Py_UNICODE *s, Py_ssize_t size, PyObject *table, const char *errors)
912
Ezio Melotti020f6502011-04-14 07:39:06 +0300913 Translate a :ctype:`Py_UNICODE` buffer of the given *size* by applying a
Georg Brandlf6842722008-01-19 22:08:21 +0000914 character mapping *table* to it and return the resulting Unicode object. Return
915 *NULL* when an exception was raised by the codec.
916
917 The *mapping* table must map Unicode ordinal integers to Unicode ordinal
918 integers or None (causing deletion of the character).
919
920 Mapping tables need only provide the :meth:`__getitem__` interface; dictionaries
921 and sequences work well. Unmapped character ordinals (ones which cause a
922 :exc:`LookupError`) are left untouched and are copied as-is.
923
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +0000924 .. versionchanged:: 2.5
925 This function used an :ctype:`int` type for *size*. This might require
926 changes in your code for properly supporting 64-bit systems.
927
Ezio Melotti020f6502011-04-14 07:39:06 +0300928
929MBCS codecs for Windows
930"""""""""""""""""""""""
931
Georg Brandlf6842722008-01-19 22:08:21 +0000932These are the MBCS codec APIs. They are currently only available on Windows and
933use the Win32 MBCS converters to implement the conversions. Note that MBCS (or
934DBCS) is a class of encodings, not just one. The target encoding is defined by
935the user settings on the machine running the codec.
936
Victor Stinner5f8aae02010-05-14 15:53:20 +0000937
Georg Brandlf6842722008-01-19 22:08:21 +0000938.. cfunction:: PyObject* PyUnicode_DecodeMBCS(const char *s, Py_ssize_t size, const char *errors)
939
940 Create a Unicode object by decoding *size* bytes of the MBCS encoded string *s*.
941 Return *NULL* if an exception was raised by the codec.
942
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +0000943 .. versionchanged:: 2.5
944 This function used an :ctype:`int` type for *size*. This might require
945 changes in your code for properly supporting 64-bit systems.
946
Georg Brandlf6842722008-01-19 22:08:21 +0000947
948.. cfunction:: PyObject* PyUnicode_DecodeMBCSStateful(const char *s, int size, const char *errors, int *consumed)
949
950 If *consumed* is *NULL*, behave like :cfunc:`PyUnicode_DecodeMBCS`. If
951 *consumed* is not *NULL*, :cfunc:`PyUnicode_DecodeMBCSStateful` will not decode
952 trailing lead byte and the number of bytes that have been decoded will be stored
953 in *consumed*.
954
955 .. versionadded:: 2.5
956
957
958.. cfunction:: PyObject* PyUnicode_EncodeMBCS(const Py_UNICODE *s, Py_ssize_t size, const char *errors)
959
Ezio Melotti020f6502011-04-14 07:39:06 +0300960 Encode the :ctype:`Py_UNICODE` buffer of the given *size* using MBCS and return a
Georg Brandlf6842722008-01-19 22:08:21 +0000961 Python string object. Return *NULL* if an exception was raised by the codec.
962
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +0000963 .. versionchanged:: 2.5
964 This function used an :ctype:`int` type for *size*. This might require
965 changes in your code for properly supporting 64-bit systems.
966
Georg Brandlf6842722008-01-19 22:08:21 +0000967
968.. cfunction:: PyObject* PyUnicode_AsMBCSString(PyObject *unicode)
969
970 Encode a Unicode object using MBCS and return the result as Python string
971 object. Error handling is "strict". Return *NULL* if an exception was raised
972 by the codec.
973
Georg Brandlf6842722008-01-19 22:08:21 +0000974
Victor Stinner5f8aae02010-05-14 15:53:20 +0000975Methods & Slots
976"""""""""""""""
Georg Brandlf6842722008-01-19 22:08:21 +0000977
978.. _unicodemethodsandslots:
979
980Methods and Slot Functions
981^^^^^^^^^^^^^^^^^^^^^^^^^^
982
983The following APIs are capable of handling Unicode objects and strings on input
984(we refer to them as strings in the descriptions) and return Unicode objects or
985integers as appropriate.
986
987They all return *NULL* or ``-1`` if an exception occurs.
988
989
990.. cfunction:: PyObject* PyUnicode_Concat(PyObject *left, PyObject *right)
991
992 Concat two strings giving a new Unicode string.
993
994
995.. cfunction:: PyObject* PyUnicode_Split(PyObject *s, PyObject *sep, Py_ssize_t maxsplit)
996
Ezio Melotti020f6502011-04-14 07:39:06 +0300997 Split a string giving a list of Unicode strings. If *sep* is *NULL*, splitting
Georg Brandlf6842722008-01-19 22:08:21 +0000998 will be done at all whitespace substrings. Otherwise, splits occur at the given
999 separator. At most *maxsplit* splits will be done. If negative, no limit is
1000 set. Separators are not included in the resulting list.
1001
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +00001002 .. versionchanged:: 2.5
1003 This function used an :ctype:`int` type for *maxsplit*. This might require
1004 changes in your code for properly supporting 64-bit systems.
1005
Georg Brandlf6842722008-01-19 22:08:21 +00001006
1007.. cfunction:: PyObject* PyUnicode_Splitlines(PyObject *s, int keepend)
1008
1009 Split a Unicode string at line breaks, returning a list of Unicode strings.
1010 CRLF is considered to be one line break. If *keepend* is 0, the Line break
1011 characters are not included in the resulting strings.
1012
1013
1014.. cfunction:: PyObject* PyUnicode_Translate(PyObject *str, PyObject *table, const char *errors)
1015
1016 Translate a string by applying a character mapping table to it and return the
1017 resulting Unicode object.
1018
1019 The mapping table must map Unicode ordinal integers to Unicode ordinal integers
1020 or None (causing deletion of the character).
1021
1022 Mapping tables need only provide the :meth:`__getitem__` interface; dictionaries
1023 and sequences work well. Unmapped character ordinals (ones which cause a
1024 :exc:`LookupError`) are left untouched and are copied as-is.
1025
1026 *errors* has the usual meaning for codecs. It may be *NULL* which indicates to
1027 use the default error handling.
1028
1029
1030.. cfunction:: PyObject* PyUnicode_Join(PyObject *separator, PyObject *seq)
1031
Ezio Melotti020f6502011-04-14 07:39:06 +03001032 Join a sequence of strings using the given *separator* and return the resulting
Georg Brandlf6842722008-01-19 22:08:21 +00001033 Unicode string.
1034
1035
1036.. cfunction:: int PyUnicode_Tailmatch(PyObject *str, PyObject *substr, Py_ssize_t start, Py_ssize_t end, int direction)
1037
Ezio Melotti020f6502011-04-14 07:39:06 +03001038 Return 1 if *substr* matches ``str[start:end]`` at the given tail end
Georg Brandlf6842722008-01-19 22:08:21 +00001039 (*direction* == -1 means to do a prefix match, *direction* == 1 a suffix match),
1040 0 otherwise. Return ``-1`` if an error occurred.
1041
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +00001042 .. versionchanged:: 2.5
1043 This function used an :ctype:`int` type for *start* and *end*. This
1044 might require changes in your code for properly supporting 64-bit
1045 systems.
1046
Georg Brandlf6842722008-01-19 22:08:21 +00001047
1048.. cfunction:: Py_ssize_t PyUnicode_Find(PyObject *str, PyObject *substr, Py_ssize_t start, Py_ssize_t end, int direction)
1049
Ezio Melotti020f6502011-04-14 07:39:06 +03001050 Return the first position of *substr* in ``str[start:end]`` using the given
Georg Brandlf6842722008-01-19 22:08:21 +00001051 *direction* (*direction* == 1 means to do a forward search, *direction* == -1 a
1052 backward search). The return value is the index of the first match; a value of
1053 ``-1`` indicates that no match was found, and ``-2`` indicates that an error
1054 occurred and an exception has been set.
1055
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +00001056 .. versionchanged:: 2.5
1057 This function used an :ctype:`int` type for *start* and *end*. This
1058 might require changes in your code for properly supporting 64-bit
1059 systems.
1060
Georg Brandlf6842722008-01-19 22:08:21 +00001061
1062.. cfunction:: Py_ssize_t PyUnicode_Count(PyObject *str, PyObject *substr, Py_ssize_t start, Py_ssize_t end)
1063
1064 Return the number of non-overlapping occurrences of *substr* in
1065 ``str[start:end]``. Return ``-1`` if an error occurred.
1066
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +00001067 .. versionchanged:: 2.5
1068 This function returned an :ctype:`int` type and used an :ctype:`int`
1069 type for *start* and *end*. This might require changes in your code for
1070 properly supporting 64-bit systems.
1071
Georg Brandlf6842722008-01-19 22:08:21 +00001072
1073.. cfunction:: PyObject* PyUnicode_Replace(PyObject *str, PyObject *substr, PyObject *replstr, Py_ssize_t maxcount)
1074
1075 Replace at most *maxcount* occurrences of *substr* in *str* with *replstr* and
1076 return the resulting Unicode object. *maxcount* == -1 means replace all
1077 occurrences.
1078
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +00001079 .. versionchanged:: 2.5
1080 This function used an :ctype:`int` type for *maxcount*. This might
1081 require changes in your code for properly supporting 64-bit systems.
1082
Georg Brandlf6842722008-01-19 22:08:21 +00001083
1084.. cfunction:: int PyUnicode_Compare(PyObject *left, PyObject *right)
1085
1086 Compare two strings and return -1, 0, 1 for less than, equal, and greater than,
1087 respectively.
1088
1089
1090.. cfunction:: int PyUnicode_RichCompare(PyObject *left, PyObject *right, int op)
1091
1092 Rich compare two unicode strings and return one of the following:
1093
1094 * ``NULL`` in case an exception was raised
1095 * :const:`Py_True` or :const:`Py_False` for successful comparisons
1096 * :const:`Py_NotImplemented` in case the type combination is unknown
1097
1098 Note that :const:`Py_EQ` and :const:`Py_NE` comparisons can cause a
1099 :exc:`UnicodeWarning` in case the conversion of the arguments to Unicode fails
1100 with a :exc:`UnicodeDecodeError`.
1101
1102 Possible values for *op* are :const:`Py_GT`, :const:`Py_GE`, :const:`Py_EQ`,
1103 :const:`Py_NE`, :const:`Py_LT`, and :const:`Py_LE`.
1104
1105
1106.. cfunction:: PyObject* PyUnicode_Format(PyObject *format, PyObject *args)
1107
1108 Return a new string object from *format* and *args*; this is analogous to
1109 ``format % args``. The *args* argument must be a tuple.
1110
1111
1112.. cfunction:: int PyUnicode_Contains(PyObject *container, PyObject *element)
1113
1114 Check whether *element* is contained in *container* and return true or false
1115 accordingly.
1116
1117 *element* has to coerce to a one element Unicode string. ``-1`` is returned if
1118 there was an error.