blob: 001192c948c6a0bee1a57a44c34dea11cfc9fc7d [file] [log] [blame]
Georg Brandlf6842722008-01-19 22:08:21 +00001.. highlightlang:: c
2
3.. _unicodeobjects:
4
5Unicode Objects and Codecs
6--------------------------
7
8.. sectionauthor:: Marc-Andre Lemburg <mal@lemburg.com>
9
10Unicode Objects
11^^^^^^^^^^^^^^^
12
13
Victor Stinner5f8aae02010-05-14 15:53:20 +000014Unicode Type
15""""""""""""
16
Georg Brandlf6842722008-01-19 22:08:21 +000017These are the basic Unicode object types used for the Unicode implementation in
18Python:
19
Georg Brandlf6842722008-01-19 22:08:21 +000020
21.. ctype:: Py_UNICODE
22
23 This type represents the storage type which is used by Python internally as
24 basis for holding Unicode ordinals. Python's default builds use a 16-bit type
25 for :ctype:`Py_UNICODE` and store Unicode values internally as UCS2. It is also
26 possible to build a UCS4 version of Python (most recent Linux distributions come
27 with UCS4 builds of Python). These builds then use a 32-bit type for
28 :ctype:`Py_UNICODE` and store Unicode data internally as UCS4. On platforms
29 where :ctype:`wchar_t` is available and compatible with the chosen Python
30 Unicode build variant, :ctype:`Py_UNICODE` is a typedef alias for
31 :ctype:`wchar_t` to enhance native platform compatibility. On all other
32 platforms, :ctype:`Py_UNICODE` is a typedef alias for either :ctype:`unsigned
33 short` (UCS2) or :ctype:`unsigned long` (UCS4).
34
35Note that UCS2 and UCS4 Python builds are not binary compatible. Please keep
36this in mind when writing extensions or interfaces.
37
38
39.. ctype:: PyUnicodeObject
40
41 This subtype of :ctype:`PyObject` represents a Python Unicode object.
42
43
44.. cvar:: PyTypeObject PyUnicode_Type
45
46 This instance of :ctype:`PyTypeObject` represents the Python Unicode type. It
47 is exposed to Python code as ``unicode`` and ``types.UnicodeType``.
48
49The following APIs are really C macros and can be used to do fast checks and to
50access internal read-only data of Unicode objects:
51
52
53.. cfunction:: int PyUnicode_Check(PyObject *o)
54
55 Return true if the object *o* is a Unicode object or an instance of a Unicode
56 subtype.
57
58 .. versionchanged:: 2.2
59 Allowed subtypes to be accepted.
60
61
62.. cfunction:: int PyUnicode_CheckExact(PyObject *o)
63
64 Return true if the object *o* is a Unicode object, but not an instance of a
65 subtype.
66
67 .. versionadded:: 2.2
68
69
70.. cfunction:: Py_ssize_t PyUnicode_GET_SIZE(PyObject *o)
71
72 Return the size of the object. *o* has to be a :ctype:`PyUnicodeObject` (not
73 checked).
74
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +000075 .. versionchanged:: 2.5
76 This function returned an :ctype:`int` type. This might require changes
77 in your code for properly supporting 64-bit systems.
78
Georg Brandlf6842722008-01-19 22:08:21 +000079
80.. cfunction:: Py_ssize_t PyUnicode_GET_DATA_SIZE(PyObject *o)
81
82 Return the size of the object's internal buffer in bytes. *o* has to be a
83 :ctype:`PyUnicodeObject` (not checked).
84
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +000085 .. versionchanged:: 2.5
86 This function returned an :ctype:`int` type. This might require changes
87 in your code for properly supporting 64-bit systems.
88
Georg Brandlf6842722008-01-19 22:08:21 +000089
90.. cfunction:: Py_UNICODE* PyUnicode_AS_UNICODE(PyObject *o)
91
92 Return a pointer to the internal :ctype:`Py_UNICODE` buffer of the object. *o*
93 has to be a :ctype:`PyUnicodeObject` (not checked).
94
95
96.. cfunction:: const char* PyUnicode_AS_DATA(PyObject *o)
97
98 Return a pointer to the internal buffer of the object. *o* has to be a
99 :ctype:`PyUnicodeObject` (not checked).
100
Christian Heimes3b718a72008-02-14 12:47:33 +0000101
Georg Brandl36b30b52009-07-24 16:46:38 +0000102.. cfunction:: int PyUnicode_ClearFreeList()
Christian Heimes3b718a72008-02-14 12:47:33 +0000103
104 Clear the free list. Return the total number of freed items.
105
106 .. versionadded:: 2.6
107
Georg Brandl36b30b52009-07-24 16:46:38 +0000108
Victor Stinner5f8aae02010-05-14 15:53:20 +0000109Unicode Character Properties
110""""""""""""""""""""""""""""
111
Georg Brandlf6842722008-01-19 22:08:21 +0000112Unicode provides many different character properties. The most often needed ones
113are available through these macros which are mapped to C functions depending on
114the Python configuration.
115
Georg Brandlf6842722008-01-19 22:08:21 +0000116
117.. cfunction:: int Py_UNICODE_ISSPACE(Py_UNICODE ch)
118
119 Return 1 or 0 depending on whether *ch* is a whitespace character.
120
121
122.. cfunction:: int Py_UNICODE_ISLOWER(Py_UNICODE ch)
123
124 Return 1 or 0 depending on whether *ch* is a lowercase character.
125
126
127.. cfunction:: int Py_UNICODE_ISUPPER(Py_UNICODE ch)
128
129 Return 1 or 0 depending on whether *ch* is an uppercase character.
130
131
132.. cfunction:: int Py_UNICODE_ISTITLE(Py_UNICODE ch)
133
134 Return 1 or 0 depending on whether *ch* is a titlecase character.
135
136
137.. cfunction:: int Py_UNICODE_ISLINEBREAK(Py_UNICODE ch)
138
139 Return 1 or 0 depending on whether *ch* is a linebreak character.
140
141
142.. cfunction:: int Py_UNICODE_ISDECIMAL(Py_UNICODE ch)
143
144 Return 1 or 0 depending on whether *ch* is a decimal character.
145
146
147.. cfunction:: int Py_UNICODE_ISDIGIT(Py_UNICODE ch)
148
149 Return 1 or 0 depending on whether *ch* is a digit character.
150
151
152.. cfunction:: int Py_UNICODE_ISNUMERIC(Py_UNICODE ch)
153
154 Return 1 or 0 depending on whether *ch* is a numeric character.
155
156
157.. cfunction:: int Py_UNICODE_ISALPHA(Py_UNICODE ch)
158
159 Return 1 or 0 depending on whether *ch* is an alphabetic character.
160
161
162.. cfunction:: int Py_UNICODE_ISALNUM(Py_UNICODE ch)
163
164 Return 1 or 0 depending on whether *ch* is an alphanumeric character.
165
166These APIs can be used for fast direct character conversions:
167
168
169.. cfunction:: Py_UNICODE Py_UNICODE_TOLOWER(Py_UNICODE ch)
170
171 Return the character *ch* converted to lower case.
172
173
174.. cfunction:: Py_UNICODE Py_UNICODE_TOUPPER(Py_UNICODE ch)
175
176 Return the character *ch* converted to upper case.
177
178
179.. cfunction:: Py_UNICODE Py_UNICODE_TOTITLE(Py_UNICODE ch)
180
181 Return the character *ch* converted to title case.
182
183
184.. cfunction:: int Py_UNICODE_TODECIMAL(Py_UNICODE ch)
185
186 Return the character *ch* converted to a decimal positive integer. Return
187 ``-1`` if this is not possible. This macro does not raise exceptions.
188
189
190.. cfunction:: int Py_UNICODE_TODIGIT(Py_UNICODE ch)
191
192 Return the character *ch* converted to a single digit integer. Return ``-1`` if
193 this is not possible. This macro does not raise exceptions.
194
195
196.. cfunction:: double Py_UNICODE_TONUMERIC(Py_UNICODE ch)
197
198 Return the character *ch* converted to a double. Return ``-1.0`` if this is not
199 possible. This macro does not raise exceptions.
200
Victor Stinner5f8aae02010-05-14 15:53:20 +0000201
202Plain Py_UNICODE
203""""""""""""""""
204
Georg Brandlf6842722008-01-19 22:08:21 +0000205To create Unicode objects and access their basic sequence properties, use these
206APIs:
207
Georg Brandlf6842722008-01-19 22:08:21 +0000208
209.. cfunction:: PyObject* PyUnicode_FromUnicode(const Py_UNICODE *u, Py_ssize_t size)
210
Georg Brandlb8d0e362010-11-26 07:53:50 +0000211 Create a Unicode object from the Py_UNICODE buffer *u* of the given size. *u*
Georg Brandlf6842722008-01-19 22:08:21 +0000212 may be *NULL* which causes the contents to be undefined. It is the user's
213 responsibility to fill in the needed data. The buffer is copied into the new
214 object. If the buffer is not *NULL*, the return value might be a shared object.
215 Therefore, modification of the resulting Unicode object is only allowed when *u*
216 is *NULL*.
217
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +0000218 .. versionchanged:: 2.5
219 This function used an :ctype:`int` type for *size*. This might require
220 changes in your code for properly supporting 64-bit systems.
221
Georg Brandlf6842722008-01-19 22:08:21 +0000222
Georg Brandl79cdff02010-10-17 10:54:57 +0000223.. cfunction:: PyObject* PyUnicode_FromStringAndSize(const char *u, Py_ssize_t size)
224
Georg Brandlb8d0e362010-11-26 07:53:50 +0000225 Create a Unicode object from the char buffer *u*. The bytes will be interpreted
Georg Brandl79cdff02010-10-17 10:54:57 +0000226 as being UTF-8 encoded. *u* may also be *NULL* which
227 causes the contents to be undefined. It is the user's responsibility to fill in
228 the needed data. The buffer is copied into the new object. If the buffer is not
229 *NULL*, the return value might be a shared object. Therefore, modification of
230 the resulting Unicode object is only allowed when *u* is *NULL*.
231
232 .. versionadded:: 2.6
233
234
235.. cfunction:: PyObject *PyUnicode_FromString(const char *u)
236
237 Create a Unicode object from an UTF-8 encoded null-terminated char buffer
238 *u*.
239
240 .. versionadded:: 2.6
241
242
243.. cfunction:: PyObject* PyUnicode_FromFormat(const char *format, ...)
244
245 Take a C :cfunc:`printf`\ -style *format* string and a variable number of
246 arguments, calculate the size of the resulting Python unicode string and return
247 a string with the values formatted into it. The variable arguments must be C
248 types and must correspond exactly to the format characters in the *format*
249 string. The following format characters are allowed:
250
251 .. % The descriptions for %zd and %zu are wrong, but the truth is complicated
252 .. % because not all compilers support the %z width modifier -- we fake it
253 .. % when necessary via interpolating PY_FORMAT_SIZE_T.
254
255 +-------------------+---------------------+--------------------------------+
256 | Format Characters | Type | Comment |
257 +===================+=====================+================================+
258 | :attr:`%%` | *n/a* | The literal % character. |
259 +-------------------+---------------------+--------------------------------+
260 | :attr:`%c` | int | A single character, |
261 | | | represented as an C int. |
262 +-------------------+---------------------+--------------------------------+
263 | :attr:`%d` | int | Exactly equivalent to |
264 | | | ``printf("%d")``. |
265 +-------------------+---------------------+--------------------------------+
266 | :attr:`%u` | unsigned int | Exactly equivalent to |
267 | | | ``printf("%u")``. |
268 +-------------------+---------------------+--------------------------------+
269 | :attr:`%ld` | long | Exactly equivalent to |
270 | | | ``printf("%ld")``. |
271 +-------------------+---------------------+--------------------------------+
272 | :attr:`%lu` | unsigned long | Exactly equivalent to |
273 | | | ``printf("%lu")``. |
274 +-------------------+---------------------+--------------------------------+
275 | :attr:`%zd` | Py_ssize_t | Exactly equivalent to |
276 | | | ``printf("%zd")``. |
277 +-------------------+---------------------+--------------------------------+
278 | :attr:`%zu` | size_t | Exactly equivalent to |
279 | | | ``printf("%zu")``. |
280 +-------------------+---------------------+--------------------------------+
281 | :attr:`%i` | int | Exactly equivalent to |
282 | | | ``printf("%i")``. |
283 +-------------------+---------------------+--------------------------------+
284 | :attr:`%x` | int | Exactly equivalent to |
285 | | | ``printf("%x")``. |
286 +-------------------+---------------------+--------------------------------+
287 | :attr:`%s` | char\* | A null-terminated C character |
288 | | | array. |
289 +-------------------+---------------------+--------------------------------+
290 | :attr:`%p` | void\* | The hex representation of a C |
291 | | | pointer. Mostly equivalent to |
292 | | | ``printf("%p")`` except that |
293 | | | it is guaranteed to start with |
294 | | | the literal ``0x`` regardless |
295 | | | of what the platform's |
296 | | | ``printf`` yields. |
297 +-------------------+---------------------+--------------------------------+
298 | :attr:`%U` | PyObject\* | A unicode object. |
299 +-------------------+---------------------+--------------------------------+
300 | :attr:`%V` | PyObject\*, char \* | A unicode object (which may be |
301 | | | *NULL*) and a null-terminated |
302 | | | C character array as a second |
303 | | | parameter (which will be used, |
304 | | | if the first parameter is |
305 | | | *NULL*). |
306 +-------------------+---------------------+--------------------------------+
307 | :attr:`%S` | PyObject\* | The result of calling |
308 | | | :func:`PyObject_Unicode`. |
309 +-------------------+---------------------+--------------------------------+
310 | :attr:`%R` | PyObject\* | The result of calling |
311 | | | :func:`PyObject_Repr`. |
312 +-------------------+---------------------+--------------------------------+
313
314 An unrecognized format character causes all the rest of the format string to be
315 copied as-is to the result string, and any extra arguments discarded.
316
317 .. versionadded:: 2.6
318
319
320.. cfunction:: PyObject* PyUnicode_FromFormatV(const char *format, va_list vargs)
321
322 Identical to :func:`PyUnicode_FromFormat` except that it takes exactly two
323 arguments.
324
325 .. versionadded:: 2.6
326
327
Georg Brandlf6842722008-01-19 22:08:21 +0000328.. cfunction:: Py_UNICODE* PyUnicode_AsUnicode(PyObject *unicode)
329
330 Return a read-only pointer to the Unicode object's internal :ctype:`Py_UNICODE`
331 buffer, *NULL* if *unicode* is not a Unicode object.
332
333
334.. cfunction:: Py_ssize_t PyUnicode_GetSize(PyObject *unicode)
335
336 Return the length of the Unicode object.
337
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +0000338 .. versionchanged:: 2.5
339 This function returned an :ctype:`int` type. This might require changes
340 in your code for properly supporting 64-bit systems.
341
Georg Brandlf6842722008-01-19 22:08:21 +0000342
343.. cfunction:: PyObject* PyUnicode_FromEncodedObject(PyObject *obj, const char *encoding, const char *errors)
344
345 Coerce an encoded object *obj* to an Unicode object and return a reference with
346 incremented refcount.
347
348 String and other char buffer compatible objects are decoded according to the
349 given encoding and using the error handling defined by errors. Both can be
350 *NULL* to have the interface use the default values (see the next section for
351 details).
352
353 All other objects, including Unicode objects, cause a :exc:`TypeError` to be
354 set.
355
356 The API returns *NULL* if there was an error. The caller is responsible for
357 decref'ing the returned objects.
358
359
360.. cfunction:: PyObject* PyUnicode_FromObject(PyObject *obj)
361
362 Shortcut for ``PyUnicode_FromEncodedObject(obj, NULL, "strict")`` which is used
363 throughout the interpreter whenever coercion to Unicode is needed.
364
365If the platform supports :ctype:`wchar_t` and provides a header file wchar.h,
366Python can interface directly to this type using the following functions.
367Support is optimized if Python's own :ctype:`Py_UNICODE` type is identical to
368the system's :ctype:`wchar_t`.
369
Georg Brandlf6842722008-01-19 22:08:21 +0000370
Victor Stinner5f8aae02010-05-14 15:53:20 +0000371wchar_t Support
372"""""""""""""""
373
Ezio Melotti020f6502011-04-14 07:39:06 +0300374:ctype:`wchar_t` support for platforms which support it:
Georg Brandlf6842722008-01-19 22:08:21 +0000375
376.. cfunction:: PyObject* PyUnicode_FromWideChar(const wchar_t *w, Py_ssize_t size)
377
Ezio Melotti020f6502011-04-14 07:39:06 +0300378 Create a Unicode object from the :ctype:`wchar_t` buffer *w* of the given *size*.
Georg Brandlf6842722008-01-19 22:08:21 +0000379 Return *NULL* on failure.
380
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +0000381 .. versionchanged:: 2.5
382 This function used an :ctype:`int` type for *size*. This might require
383 changes in your code for properly supporting 64-bit systems.
384
Georg Brandlf6842722008-01-19 22:08:21 +0000385
386.. cfunction:: Py_ssize_t PyUnicode_AsWideChar(PyUnicodeObject *unicode, wchar_t *w, Py_ssize_t size)
387
388 Copy the Unicode object contents into the :ctype:`wchar_t` buffer *w*. At most
389 *size* :ctype:`wchar_t` characters are copied (excluding a possibly trailing
390 0-termination character). Return the number of :ctype:`wchar_t` characters
391 copied or -1 in case of an error. Note that the resulting :ctype:`wchar_t`
392 string may or may not be 0-terminated. It is the responsibility of the caller
393 to make sure that the :ctype:`wchar_t` string is 0-terminated in case this is
394 required by the application.
395
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +0000396 .. versionchanged:: 2.5
397 This function returned an :ctype:`int` type and used an :ctype:`int`
398 type for *size*. This might require changes in your code for properly
399 supporting 64-bit systems.
400
Georg Brandlf6842722008-01-19 22:08:21 +0000401
402.. _builtincodecs:
403
404Built-in Codecs
405^^^^^^^^^^^^^^^
406
Georg Brandld7d4fd72009-07-26 14:37:28 +0000407Python provides a set of built-in codecs which are written in C for speed. All of
Georg Brandlf6842722008-01-19 22:08:21 +0000408these codecs are directly usable via the following functions.
409
Ezio Melotti020f6502011-04-14 07:39:06 +0300410Many of the following APIs take two arguments encoding and errors, and they
411have the same semantics as the ones of the built-in :func:`unicode` Unicode
412object constructor.
Georg Brandlf6842722008-01-19 22:08:21 +0000413
414Setting encoding to *NULL* causes the default encoding to be used which is
415ASCII. The file system calls should use :cdata:`Py_FileSystemDefaultEncoding`
Ezio Melotti020f6502011-04-14 07:39:06 +0300416as the encoding for file names. This variable should be treated as read-only: on
Georg Brandlf6842722008-01-19 22:08:21 +0000417some systems, it will be a pointer to a static string, on others, it will change
418at run-time (such as when the application invokes setlocale).
419
420Error handling is set by errors which may also be set to *NULL* meaning to use
421the default handling defined for the codec. Default error handling for all
Georg Brandld7d4fd72009-07-26 14:37:28 +0000422built-in codecs is "strict" (:exc:`ValueError` is raised).
Georg Brandlf6842722008-01-19 22:08:21 +0000423
424The codecs all use a similar interface. Only deviation from the following
425generic ones are documented for simplicity.
426
Georg Brandlf6842722008-01-19 22:08:21 +0000427
Victor Stinner5f8aae02010-05-14 15:53:20 +0000428Generic Codecs
429""""""""""""""
430
431These are the generic codec APIs:
Georg Brandlf6842722008-01-19 22:08:21 +0000432
433
434.. cfunction:: PyObject* PyUnicode_Decode(const char *s, Py_ssize_t size, const char *encoding, const char *errors)
435
436 Create a Unicode object by decoding *size* bytes of the encoded string *s*.
437 *encoding* and *errors* have the same meaning as the parameters of the same name
Georg Brandld7d4fd72009-07-26 14:37:28 +0000438 in the :func:`unicode` built-in function. The codec to be used is looked up
Georg Brandlf6842722008-01-19 22:08:21 +0000439 using the Python codec registry. Return *NULL* if an exception was raised by
440 the codec.
441
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +0000442 .. versionchanged:: 2.5
443 This function used an :ctype:`int` type for *size*. This might require
444 changes in your code for properly supporting 64-bit systems.
445
Georg Brandlf6842722008-01-19 22:08:21 +0000446
447.. cfunction:: PyObject* PyUnicode_Encode(const Py_UNICODE *s, Py_ssize_t size, const char *encoding, const char *errors)
448
Ezio Melotti020f6502011-04-14 07:39:06 +0300449 Encode the :ctype:`Py_UNICODE` buffer *s* of the given *size* and return a Python
Georg Brandlf6842722008-01-19 22:08:21 +0000450 string object. *encoding* and *errors* have the same meaning as the parameters
451 of the same name in the Unicode :meth:`encode` method. The codec to be used is
452 looked up using the Python codec registry. Return *NULL* if an exception was
453 raised by the codec.
454
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +0000455 .. versionchanged:: 2.5
456 This function used an :ctype:`int` type for *size*. This might require
457 changes in your code for properly supporting 64-bit systems.
458
Georg Brandlf6842722008-01-19 22:08:21 +0000459
460.. cfunction:: PyObject* PyUnicode_AsEncodedString(PyObject *unicode, const char *encoding, const char *errors)
461
462 Encode a Unicode object and return the result as Python string object.
463 *encoding* and *errors* have the same meaning as the parameters of the same name
464 in the Unicode :meth:`encode` method. The codec to be used is looked up using
465 the Python codec registry. Return *NULL* if an exception was raised by the
466 codec.
467
Georg Brandlf6842722008-01-19 22:08:21 +0000468
Victor Stinner5f8aae02010-05-14 15:53:20 +0000469UTF-8 Codecs
470""""""""""""
471
472These are the UTF-8 codec APIs:
Georg Brandlf6842722008-01-19 22:08:21 +0000473
474
475.. cfunction:: PyObject* PyUnicode_DecodeUTF8(const char *s, Py_ssize_t size, const char *errors)
476
477 Create a Unicode object by decoding *size* bytes of the UTF-8 encoded string
478 *s*. Return *NULL* if an exception was raised by the codec.
479
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +0000480 .. versionchanged:: 2.5
481 This function used an :ctype:`int` type for *size*. This might require
482 changes in your code for properly supporting 64-bit systems.
483
Georg Brandlf6842722008-01-19 22:08:21 +0000484
485.. cfunction:: PyObject* PyUnicode_DecodeUTF8Stateful(const char *s, Py_ssize_t size, const char *errors, Py_ssize_t *consumed)
486
487 If *consumed* is *NULL*, behave like :cfunc:`PyUnicode_DecodeUTF8`. If
488 *consumed* is not *NULL*, trailing incomplete UTF-8 byte sequences will not be
489 treated as an error. Those bytes will not be decoded and the number of bytes
490 that have been decoded will be stored in *consumed*.
491
492 .. versionadded:: 2.4
493
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +0000494 .. versionchanged:: 2.5
495 This function used an :ctype:`int` type for *size*. This might require
496 changes in your code for properly supporting 64-bit systems.
497
Georg Brandlf6842722008-01-19 22:08:21 +0000498
499.. cfunction:: PyObject* PyUnicode_EncodeUTF8(const Py_UNICODE *s, Py_ssize_t size, const char *errors)
500
Ezio Melotti020f6502011-04-14 07:39:06 +0300501 Encode the :ctype:`Py_UNICODE` buffer *s* of the given *size* using UTF-8 and return a
Georg Brandlf6842722008-01-19 22:08:21 +0000502 Python string object. Return *NULL* if an exception was raised by the codec.
503
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +0000504 .. versionchanged:: 2.5
505 This function used an :ctype:`int` type for *size*. This might require
506 changes in your code for properly supporting 64-bit systems.
507
Georg Brandlf6842722008-01-19 22:08:21 +0000508
509.. cfunction:: PyObject* PyUnicode_AsUTF8String(PyObject *unicode)
510
511 Encode a Unicode object using UTF-8 and return the result as Python string
512 object. Error handling is "strict". Return *NULL* if an exception was raised
513 by the codec.
514
Georg Brandlf6842722008-01-19 22:08:21 +0000515
Victor Stinner5f8aae02010-05-14 15:53:20 +0000516UTF-32 Codecs
517"""""""""""""
518
519These are the UTF-32 codec APIs:
Georg Brandlf6842722008-01-19 22:08:21 +0000520
521
522.. cfunction:: PyObject* PyUnicode_DecodeUTF32(const char *s, Py_ssize_t size, const char *errors, int *byteorder)
523
Ezio Melotti020f6502011-04-14 07:39:06 +0300524 Decode *size* bytes from a UTF-32 encoded buffer string and return the
Georg Brandlf6842722008-01-19 22:08:21 +0000525 corresponding Unicode object. *errors* (if non-*NULL*) defines the error
526 handling. It defaults to "strict".
527
528 If *byteorder* is non-*NULL*, the decoder starts decoding using the given byte
529 order::
530
531 *byteorder == -1: little endian
532 *byteorder == 0: native order
533 *byteorder == 1: big endian
534
Georg Brandl579a3582009-09-18 21:35:59 +0000535 If ``*byteorder`` is zero, and the first four bytes of the input data are a
536 byte order mark (BOM), the decoder switches to this byte order and the BOM is
537 not copied into the resulting Unicode string. If ``*byteorder`` is ``-1`` or
538 ``1``, any byte order mark is copied to the output.
539
540 After completion, *\*byteorder* is set to the current byte order at the end
541 of input data.
Georg Brandlf6842722008-01-19 22:08:21 +0000542
543 In a narrow build codepoints outside the BMP will be decoded as surrogate pairs.
544
545 If *byteorder* is *NULL*, the codec starts in native order mode.
546
547 Return *NULL* if an exception was raised by the codec.
548
549 .. versionadded:: 2.6
550
551
552.. cfunction:: PyObject* PyUnicode_DecodeUTF32Stateful(const char *s, Py_ssize_t size, const char *errors, int *byteorder, Py_ssize_t *consumed)
553
554 If *consumed* is *NULL*, behave like :cfunc:`PyUnicode_DecodeUTF32`. If
555 *consumed* is not *NULL*, :cfunc:`PyUnicode_DecodeUTF32Stateful` will not treat
556 trailing incomplete UTF-32 byte sequences (such as a number of bytes not divisible
557 by four) as an error. Those bytes will not be decoded and the number of bytes
558 that have been decoded will be stored in *consumed*.
559
560 .. versionadded:: 2.6
561
562
563.. cfunction:: PyObject* PyUnicode_EncodeUTF32(const Py_UNICODE *s, Py_ssize_t size, const char *errors, int byteorder)
564
565 Return a Python bytes object holding the UTF-32 encoded value of the Unicode
Georg Brandl579a3582009-09-18 21:35:59 +0000566 data in *s*. Output is written according to the following byte order::
Georg Brandlf6842722008-01-19 22:08:21 +0000567
568 byteorder == -1: little endian
569 byteorder == 0: native byte order (writes a BOM mark)
570 byteorder == 1: big endian
571
572 If byteorder is ``0``, the output string will always start with the Unicode BOM
573 mark (U+FEFF). In the other two modes, no BOM mark is prepended.
574
575 If *Py_UNICODE_WIDE* is not defined, surrogate pairs will be output
576 as a single codepoint.
577
578 Return *NULL* if an exception was raised by the codec.
579
580 .. versionadded:: 2.6
581
582
583.. cfunction:: PyObject* PyUnicode_AsUTF32String(PyObject *unicode)
584
585 Return a Python string using the UTF-32 encoding in native byte order. The
586 string always starts with a BOM mark. Error handling is "strict". Return
587 *NULL* if an exception was raised by the codec.
588
589 .. versionadded:: 2.6
590
591
Victor Stinner5f8aae02010-05-14 15:53:20 +0000592UTF-16 Codecs
593"""""""""""""
Georg Brandlf6842722008-01-19 22:08:21 +0000594
Victor Stinner5f8aae02010-05-14 15:53:20 +0000595These are the UTF-16 codec APIs:
Georg Brandlf6842722008-01-19 22:08:21 +0000596
597
598.. cfunction:: PyObject* PyUnicode_DecodeUTF16(const char *s, Py_ssize_t size, const char *errors, int *byteorder)
599
Ezio Melotti020f6502011-04-14 07:39:06 +0300600 Decode *size* bytes from a UTF-16 encoded buffer string and return the
Georg Brandlf6842722008-01-19 22:08:21 +0000601 corresponding Unicode object. *errors* (if non-*NULL*) defines the error
602 handling. It defaults to "strict".
603
604 If *byteorder* is non-*NULL*, the decoder starts decoding using the given byte
605 order::
606
607 *byteorder == -1: little endian
608 *byteorder == 0: native order
609 *byteorder == 1: big endian
610
Georg Brandl579a3582009-09-18 21:35:59 +0000611 If ``*byteorder`` is zero, and the first two bytes of the input data are a
612 byte order mark (BOM), the decoder switches to this byte order and the BOM is
613 not copied into the resulting Unicode string. If ``*byteorder`` is ``-1`` or
614 ``1``, any byte order mark is copied to the output (where it will result in
615 either a ``\ufeff`` or a ``\ufffe`` character).
616
617 After completion, *\*byteorder* is set to the current byte order at the end
618 of input data.
Georg Brandlf6842722008-01-19 22:08:21 +0000619
620 If *byteorder* is *NULL*, the codec starts in native order mode.
621
622 Return *NULL* if an exception was raised by the codec.
623
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +0000624 .. versionchanged:: 2.5
625 This function used an :ctype:`int` type for *size*. This might require
626 changes in your code for properly supporting 64-bit systems.
627
Georg Brandlf6842722008-01-19 22:08:21 +0000628
629.. cfunction:: PyObject* PyUnicode_DecodeUTF16Stateful(const char *s, Py_ssize_t size, const char *errors, int *byteorder, Py_ssize_t *consumed)
630
631 If *consumed* is *NULL*, behave like :cfunc:`PyUnicode_DecodeUTF16`. If
632 *consumed* is not *NULL*, :cfunc:`PyUnicode_DecodeUTF16Stateful` will not treat
633 trailing incomplete UTF-16 byte sequences (such as an odd number of bytes or a
634 split surrogate pair) as an error. Those bytes will not be decoded and the
635 number of bytes that have been decoded will be stored in *consumed*.
636
637 .. versionadded:: 2.4
638
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +0000639 .. versionchanged:: 2.5
640 This function used an :ctype:`int` type for *size* and an :ctype:`int *`
641 type for *consumed*. This might require changes in your code for
642 properly supporting 64-bit systems.
643
Georg Brandlf6842722008-01-19 22:08:21 +0000644
645.. cfunction:: PyObject* PyUnicode_EncodeUTF16(const Py_UNICODE *s, Py_ssize_t size, const char *errors, int byteorder)
646
647 Return a Python string object holding the UTF-16 encoded value of the Unicode
Georg Brandl579a3582009-09-18 21:35:59 +0000648 data in *s*. Output is written according to the following byte order::
Georg Brandlf6842722008-01-19 22:08:21 +0000649
650 byteorder == -1: little endian
651 byteorder == 0: native byte order (writes a BOM mark)
652 byteorder == 1: big endian
653
654 If byteorder is ``0``, the output string will always start with the Unicode BOM
655 mark (U+FEFF). In the other two modes, no BOM mark is prepended.
656
657 If *Py_UNICODE_WIDE* is defined, a single :ctype:`Py_UNICODE` value may get
658 represented as a surrogate pair. If it is not defined, each :ctype:`Py_UNICODE`
659 values is interpreted as an UCS-2 character.
660
661 Return *NULL* if an exception was raised by the codec.
662
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +0000663 .. versionchanged:: 2.5
664 This function used an :ctype:`int` type for *size*. This might require
665 changes in your code for properly supporting 64-bit systems.
666
Georg Brandlf6842722008-01-19 22:08:21 +0000667
668.. cfunction:: PyObject* PyUnicode_AsUTF16String(PyObject *unicode)
669
670 Return a Python string using the UTF-16 encoding in native byte order. The
671 string always starts with a BOM mark. Error handling is "strict". Return
672 *NULL* if an exception was raised by the codec.
673
Georg Brandlf6842722008-01-19 22:08:21 +0000674
Georg Brandl7d4bfb32010-08-02 21:44:25 +0000675UTF-7 Codecs
676""""""""""""
677
678These are the UTF-7 codec APIs:
679
680
681.. cfunction:: PyObject* PyUnicode_DecodeUTF7(const char *s, Py_ssize_t size, const char *errors)
682
683 Create a Unicode object by decoding *size* bytes of the UTF-7 encoded string
684 *s*. Return *NULL* if an exception was raised by the codec.
685
686
Georg Brandl21946af2010-10-06 09:28:45 +0000687.. cfunction:: PyObject* PyUnicode_DecodeUTF7Stateful(const char *s, Py_ssize_t size, const char *errors, Py_ssize_t *consumed)
Georg Brandl7d4bfb32010-08-02 21:44:25 +0000688
689 If *consumed* is *NULL*, behave like :cfunc:`PyUnicode_DecodeUTF7`. If
690 *consumed* is not *NULL*, trailing incomplete UTF-7 base-64 sections will not
691 be treated as an error. Those bytes will not be decoded and the number of
692 bytes that have been decoded will be stored in *consumed*.
693
694
695.. cfunction:: PyObject* PyUnicode_EncodeUTF7(const Py_UNICODE *s, Py_ssize_t size, int base64SetO, int base64WhiteSpace, const char *errors)
696
697 Encode the :ctype:`Py_UNICODE` buffer of the given size using UTF-7 and
698 return a Python bytes object. Return *NULL* if an exception was raised by
699 the codec.
700
701 If *base64SetO* is nonzero, "Set O" (punctuation that has no otherwise
702 special meaning) will be encoded in base-64. If *base64WhiteSpace* is
703 nonzero, whitespace will be encoded in base-64. Both are set to zero for the
704 Python "utf-7" codec.
705
706
Victor Stinner5f8aae02010-05-14 15:53:20 +0000707Unicode-Escape Codecs
708"""""""""""""""""""""
709
710These are the "Unicode Escape" codec APIs:
Georg Brandlf6842722008-01-19 22:08:21 +0000711
712
713.. cfunction:: PyObject* PyUnicode_DecodeUnicodeEscape(const char *s, Py_ssize_t size, const char *errors)
714
715 Create a Unicode object by decoding *size* bytes of the Unicode-Escape encoded
716 string *s*. Return *NULL* if an exception was raised by the codec.
717
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +0000718 .. versionchanged:: 2.5
719 This function used an :ctype:`int` type for *size*. This might require
720 changes in your code for properly supporting 64-bit systems.
721
Georg Brandlf6842722008-01-19 22:08:21 +0000722
723.. cfunction:: PyObject* PyUnicode_EncodeUnicodeEscape(const Py_UNICODE *s, Py_ssize_t size)
724
Ezio Melotti020f6502011-04-14 07:39:06 +0300725 Encode the :ctype:`Py_UNICODE` buffer of the given *size* using Unicode-Escape and
Georg Brandlf6842722008-01-19 22:08:21 +0000726 return a Python string object. Return *NULL* if an exception was raised by the
727 codec.
728
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +0000729 .. versionchanged:: 2.5
730 This function used an :ctype:`int` type for *size*. This might require
731 changes in your code for properly supporting 64-bit systems.
732
Georg Brandlf6842722008-01-19 22:08:21 +0000733
734.. cfunction:: PyObject* PyUnicode_AsUnicodeEscapeString(PyObject *unicode)
735
736 Encode a Unicode object using Unicode-Escape and return the result as Python
737 string object. Error handling is "strict". Return *NULL* if an exception was
738 raised by the codec.
739
Georg Brandlf6842722008-01-19 22:08:21 +0000740
Victor Stinner5f8aae02010-05-14 15:53:20 +0000741Raw-Unicode-Escape Codecs
742"""""""""""""""""""""""""
743
744These are the "Raw Unicode Escape" codec APIs:
Georg Brandlf6842722008-01-19 22:08:21 +0000745
746
747.. cfunction:: PyObject* PyUnicode_DecodeRawUnicodeEscape(const char *s, Py_ssize_t size, const char *errors)
748
749 Create a Unicode object by decoding *size* bytes of the Raw-Unicode-Escape
750 encoded string *s*. Return *NULL* if an exception was raised by the codec.
751
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +0000752 .. versionchanged:: 2.5
753 This function used an :ctype:`int` type for *size*. This might require
754 changes in your code for properly supporting 64-bit systems.
755
Georg Brandlf6842722008-01-19 22:08:21 +0000756
757.. cfunction:: PyObject* PyUnicode_EncodeRawUnicodeEscape(const Py_UNICODE *s, Py_ssize_t size, const char *errors)
758
Ezio Melotti020f6502011-04-14 07:39:06 +0300759 Encode the :ctype:`Py_UNICODE` buffer of the given *size* using Raw-Unicode-Escape
Georg Brandlf6842722008-01-19 22:08:21 +0000760 and return a Python string object. Return *NULL* if an exception was raised by
761 the codec.
762
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +0000763 .. versionchanged:: 2.5
764 This function used an :ctype:`int` type for *size*. This might require
765 changes in your code for properly supporting 64-bit systems.
766
Georg Brandlf6842722008-01-19 22:08:21 +0000767
768.. cfunction:: PyObject* PyUnicode_AsRawUnicodeEscapeString(PyObject *unicode)
769
770 Encode a Unicode object using Raw-Unicode-Escape and return the result as
771 Python string object. Error handling is "strict". Return *NULL* if an exception
772 was raised by the codec.
773
Victor Stinner5f8aae02010-05-14 15:53:20 +0000774
775Latin-1 Codecs
776""""""""""""""
777
Georg Brandlf6842722008-01-19 22:08:21 +0000778These are the Latin-1 codec APIs: Latin-1 corresponds to the first 256 Unicode
779ordinals and only these are accepted by the codecs during encoding.
780
Georg Brandlf6842722008-01-19 22:08:21 +0000781
782.. cfunction:: PyObject* PyUnicode_DecodeLatin1(const char *s, Py_ssize_t size, const char *errors)
783
784 Create a Unicode object by decoding *size* bytes of the Latin-1 encoded string
785 *s*. Return *NULL* if an exception was raised by the codec.
786
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +0000787 .. versionchanged:: 2.5
788 This function used an :ctype:`int` type for *size*. This might require
789 changes in your code for properly supporting 64-bit systems.
790
Georg Brandlf6842722008-01-19 22:08:21 +0000791
792.. cfunction:: PyObject* PyUnicode_EncodeLatin1(const Py_UNICODE *s, Py_ssize_t size, const char *errors)
793
Ezio Melotti020f6502011-04-14 07:39:06 +0300794 Encode the :ctype:`Py_UNICODE` buffer of the given *size* using Latin-1 and return
Georg Brandlf6842722008-01-19 22:08:21 +0000795 a Python string object. Return *NULL* if an exception was raised by the codec.
796
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +0000797 .. versionchanged:: 2.5
798 This function used an :ctype:`int` type for *size*. This might require
799 changes in your code for properly supporting 64-bit systems.
800
Georg Brandlf6842722008-01-19 22:08:21 +0000801
802.. cfunction:: PyObject* PyUnicode_AsLatin1String(PyObject *unicode)
803
804 Encode a Unicode object using Latin-1 and return the result as Python string
805 object. Error handling is "strict". Return *NULL* if an exception was raised
806 by the codec.
807
Victor Stinner5f8aae02010-05-14 15:53:20 +0000808
809ASCII Codecs
810""""""""""""
811
Georg Brandlf6842722008-01-19 22:08:21 +0000812These are the ASCII codec APIs. Only 7-bit ASCII data is accepted. All other
813codes generate errors.
814
Georg Brandlf6842722008-01-19 22:08:21 +0000815
816.. cfunction:: PyObject* PyUnicode_DecodeASCII(const char *s, Py_ssize_t size, const char *errors)
817
818 Create a Unicode object by decoding *size* bytes of the ASCII encoded string
819 *s*. Return *NULL* if an exception was raised by the codec.
820
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +0000821 .. versionchanged:: 2.5
822 This function used an :ctype:`int` type for *size*. This might require
823 changes in your code for properly supporting 64-bit systems.
824
Georg Brandlf6842722008-01-19 22:08:21 +0000825
826.. cfunction:: PyObject* PyUnicode_EncodeASCII(const Py_UNICODE *s, Py_ssize_t size, const char *errors)
827
Ezio Melotti020f6502011-04-14 07:39:06 +0300828 Encode the :ctype:`Py_UNICODE` buffer of the given *size* using ASCII and return a
Georg Brandlf6842722008-01-19 22:08:21 +0000829 Python string object. Return *NULL* if an exception was raised by the codec.
830
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +0000831 .. versionchanged:: 2.5
832 This function used an :ctype:`int` type for *size*. This might require
833 changes in your code for properly supporting 64-bit systems.
834
Georg Brandlf6842722008-01-19 22:08:21 +0000835
836.. cfunction:: PyObject* PyUnicode_AsASCIIString(PyObject *unicode)
837
838 Encode a Unicode object using ASCII and return the result as Python string
839 object. Error handling is "strict". Return *NULL* if an exception was raised
840 by the codec.
841
Georg Brandlf6842722008-01-19 22:08:21 +0000842
Victor Stinner5f8aae02010-05-14 15:53:20 +0000843Character Map Codecs
844""""""""""""""""""""
845
Georg Brandlf6842722008-01-19 22:08:21 +0000846This codec is special in that it can be used to implement many different codecs
847(and this is in fact what was done to obtain most of the standard codecs
848included in the :mod:`encodings` package). The codec uses mapping to encode and
849decode characters.
850
851Decoding mappings must map single string characters to single Unicode
852characters, integers (which are then interpreted as Unicode ordinals) or None
853(meaning "undefined mapping" and causing an error).
854
855Encoding mappings must map single Unicode characters to single string
856characters, integers (which are then interpreted as Latin-1 ordinals) or None
857(meaning "undefined mapping" and causing an error).
858
859The mapping objects provided must only support the __getitem__ mapping
860interface.
861
862If a character lookup fails with a LookupError, the character is copied as-is
863meaning that its ordinal value will be interpreted as Unicode or Latin-1 ordinal
864resp. Because of this, mappings only need to contain those mappings which map
865characters to different code points.
866
Ezio Melotti020f6502011-04-14 07:39:06 +0300867These are the mapping codec APIs:
Georg Brandlf6842722008-01-19 22:08:21 +0000868
869.. cfunction:: PyObject* PyUnicode_DecodeCharmap(const char *s, Py_ssize_t size, PyObject *mapping, const char *errors)
870
871 Create a Unicode object by decoding *size* bytes of the encoded string *s* using
872 the given *mapping* object. Return *NULL* if an exception was raised by the
873 codec. If *mapping* is *NULL* latin-1 decoding will be done. Else it can be a
874 dictionary mapping byte or a unicode string, which is treated as a lookup table.
875 Byte values greater that the length of the string and U+FFFE "characters" are
876 treated as "undefined mapping".
877
878 .. versionchanged:: 2.4
879 Allowed unicode string as mapping argument.
880
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +0000881 .. versionchanged:: 2.5
882 This function used an :ctype:`int` type for *size*. This might require
883 changes in your code for properly supporting 64-bit systems.
884
Georg Brandlf6842722008-01-19 22:08:21 +0000885
886.. cfunction:: PyObject* PyUnicode_EncodeCharmap(const Py_UNICODE *s, Py_ssize_t size, PyObject *mapping, const char *errors)
887
Ezio Melotti020f6502011-04-14 07:39:06 +0300888 Encode the :ctype:`Py_UNICODE` buffer of the given *size* using the given
Georg Brandlf6842722008-01-19 22:08:21 +0000889 *mapping* object and return a Python string object. Return *NULL* if an
890 exception was raised by the codec.
891
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +0000892 .. versionchanged:: 2.5
893 This function used an :ctype:`int` type for *size*. This might require
894 changes in your code for properly supporting 64-bit systems.
895
Georg Brandlf6842722008-01-19 22:08:21 +0000896
897.. cfunction:: PyObject* PyUnicode_AsCharmapString(PyObject *unicode, PyObject *mapping)
898
899 Encode a Unicode object using the given *mapping* object and return the result
900 as Python string object. Error handling is "strict". Return *NULL* if an
901 exception was raised by the codec.
902
903The following codec API is special in that maps Unicode to Unicode.
904
905
906.. cfunction:: PyObject* PyUnicode_TranslateCharmap(const Py_UNICODE *s, Py_ssize_t size, PyObject *table, const char *errors)
907
Ezio Melotti020f6502011-04-14 07:39:06 +0300908 Translate a :ctype:`Py_UNICODE` buffer of the given *size* by applying a
Georg Brandlf6842722008-01-19 22:08:21 +0000909 character mapping *table* to it and return the resulting Unicode object. Return
910 *NULL* when an exception was raised by the codec.
911
912 The *mapping* table must map Unicode ordinal integers to Unicode ordinal
913 integers or None (causing deletion of the character).
914
915 Mapping tables need only provide the :meth:`__getitem__` interface; dictionaries
916 and sequences work well. Unmapped character ordinals (ones which cause a
917 :exc:`LookupError`) are left untouched and are copied as-is.
918
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +0000919 .. versionchanged:: 2.5
920 This function used an :ctype:`int` type for *size*. This might require
921 changes in your code for properly supporting 64-bit systems.
922
Ezio Melotti020f6502011-04-14 07:39:06 +0300923
924MBCS codecs for Windows
925"""""""""""""""""""""""
926
Georg Brandlf6842722008-01-19 22:08:21 +0000927These are the MBCS codec APIs. They are currently only available on Windows and
928use the Win32 MBCS converters to implement the conversions. Note that MBCS (or
929DBCS) is a class of encodings, not just one. The target encoding is defined by
930the user settings on the machine running the codec.
931
Victor Stinner5f8aae02010-05-14 15:53:20 +0000932
Georg Brandlf6842722008-01-19 22:08:21 +0000933.. cfunction:: PyObject* PyUnicode_DecodeMBCS(const char *s, Py_ssize_t size, const char *errors)
934
935 Create a Unicode object by decoding *size* bytes of the MBCS encoded string *s*.
936 Return *NULL* if an exception was raised by the codec.
937
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +0000938 .. versionchanged:: 2.5
939 This function used an :ctype:`int` type for *size*. This might require
940 changes in your code for properly supporting 64-bit systems.
941
Georg Brandlf6842722008-01-19 22:08:21 +0000942
943.. cfunction:: PyObject* PyUnicode_DecodeMBCSStateful(const char *s, int size, const char *errors, int *consumed)
944
945 If *consumed* is *NULL*, behave like :cfunc:`PyUnicode_DecodeMBCS`. If
946 *consumed* is not *NULL*, :cfunc:`PyUnicode_DecodeMBCSStateful` will not decode
947 trailing lead byte and the number of bytes that have been decoded will be stored
948 in *consumed*.
949
950 .. versionadded:: 2.5
951
952
953.. cfunction:: PyObject* PyUnicode_EncodeMBCS(const Py_UNICODE *s, Py_ssize_t size, const char *errors)
954
Ezio Melotti020f6502011-04-14 07:39:06 +0300955 Encode the :ctype:`Py_UNICODE` buffer of the given *size* using MBCS and return a
Georg Brandlf6842722008-01-19 22:08:21 +0000956 Python string object. Return *NULL* if an exception was raised by the codec.
957
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +0000958 .. versionchanged:: 2.5
959 This function used an :ctype:`int` type for *size*. This might require
960 changes in your code for properly supporting 64-bit systems.
961
Georg Brandlf6842722008-01-19 22:08:21 +0000962
963.. cfunction:: PyObject* PyUnicode_AsMBCSString(PyObject *unicode)
964
965 Encode a Unicode object using MBCS and return the result as Python string
966 object. Error handling is "strict". Return *NULL* if an exception was raised
967 by the codec.
968
Georg Brandlf6842722008-01-19 22:08:21 +0000969
Victor Stinner5f8aae02010-05-14 15:53:20 +0000970Methods & Slots
971"""""""""""""""
Georg Brandlf6842722008-01-19 22:08:21 +0000972
973.. _unicodemethodsandslots:
974
975Methods and Slot Functions
976^^^^^^^^^^^^^^^^^^^^^^^^^^
977
978The following APIs are capable of handling Unicode objects and strings on input
979(we refer to them as strings in the descriptions) and return Unicode objects or
980integers as appropriate.
981
982They all return *NULL* or ``-1`` if an exception occurs.
983
984
985.. cfunction:: PyObject* PyUnicode_Concat(PyObject *left, PyObject *right)
986
987 Concat two strings giving a new Unicode string.
988
989
990.. cfunction:: PyObject* PyUnicode_Split(PyObject *s, PyObject *sep, Py_ssize_t maxsplit)
991
Ezio Melotti020f6502011-04-14 07:39:06 +0300992 Split a string giving a list of Unicode strings. If *sep* is *NULL*, splitting
Georg Brandlf6842722008-01-19 22:08:21 +0000993 will be done at all whitespace substrings. Otherwise, splits occur at the given
994 separator. At most *maxsplit* splits will be done. If negative, no limit is
995 set. Separators are not included in the resulting list.
996
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +0000997 .. versionchanged:: 2.5
998 This function used an :ctype:`int` type for *maxsplit*. This might require
999 changes in your code for properly supporting 64-bit systems.
1000
Georg Brandlf6842722008-01-19 22:08:21 +00001001
1002.. cfunction:: PyObject* PyUnicode_Splitlines(PyObject *s, int keepend)
1003
1004 Split a Unicode string at line breaks, returning a list of Unicode strings.
1005 CRLF is considered to be one line break. If *keepend* is 0, the Line break
1006 characters are not included in the resulting strings.
1007
1008
1009.. cfunction:: PyObject* PyUnicode_Translate(PyObject *str, PyObject *table, const char *errors)
1010
1011 Translate a string by applying a character mapping table to it and return the
1012 resulting Unicode object.
1013
1014 The mapping table must map Unicode ordinal integers to Unicode ordinal integers
1015 or None (causing deletion of the character).
1016
1017 Mapping tables need only provide the :meth:`__getitem__` interface; dictionaries
1018 and sequences work well. Unmapped character ordinals (ones which cause a
1019 :exc:`LookupError`) are left untouched and are copied as-is.
1020
1021 *errors* has the usual meaning for codecs. It may be *NULL* which indicates to
1022 use the default error handling.
1023
1024
1025.. cfunction:: PyObject* PyUnicode_Join(PyObject *separator, PyObject *seq)
1026
Ezio Melotti020f6502011-04-14 07:39:06 +03001027 Join a sequence of strings using the given *separator* and return the resulting
Georg Brandlf6842722008-01-19 22:08:21 +00001028 Unicode string.
1029
1030
1031.. cfunction:: int PyUnicode_Tailmatch(PyObject *str, PyObject *substr, Py_ssize_t start, Py_ssize_t end, int direction)
1032
Ezio Melotti020f6502011-04-14 07:39:06 +03001033 Return 1 if *substr* matches ``str[start:end]`` at the given tail end
Georg Brandlf6842722008-01-19 22:08:21 +00001034 (*direction* == -1 means to do a prefix match, *direction* == 1 a suffix match),
1035 0 otherwise. Return ``-1`` if an error occurred.
1036
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +00001037 .. versionchanged:: 2.5
1038 This function used an :ctype:`int` type for *start* and *end*. This
1039 might require changes in your code for properly supporting 64-bit
1040 systems.
1041
Georg Brandlf6842722008-01-19 22:08:21 +00001042
1043.. cfunction:: Py_ssize_t PyUnicode_Find(PyObject *str, PyObject *substr, Py_ssize_t start, Py_ssize_t end, int direction)
1044
Ezio Melotti020f6502011-04-14 07:39:06 +03001045 Return the first position of *substr* in ``str[start:end]`` using the given
Georg Brandlf6842722008-01-19 22:08:21 +00001046 *direction* (*direction* == 1 means to do a forward search, *direction* == -1 a
1047 backward search). The return value is the index of the first match; a value of
1048 ``-1`` indicates that no match was found, and ``-2`` indicates that an error
1049 occurred and an exception has been set.
1050
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +00001051 .. versionchanged:: 2.5
1052 This function used an :ctype:`int` type for *start* and *end*. This
1053 might require changes in your code for properly supporting 64-bit
1054 systems.
1055
Georg Brandlf6842722008-01-19 22:08:21 +00001056
1057.. cfunction:: Py_ssize_t PyUnicode_Count(PyObject *str, PyObject *substr, Py_ssize_t start, Py_ssize_t end)
1058
1059 Return the number of non-overlapping occurrences of *substr* in
1060 ``str[start:end]``. Return ``-1`` if an error occurred.
1061
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +00001062 .. versionchanged:: 2.5
1063 This function returned an :ctype:`int` type and used an :ctype:`int`
1064 type for *start* and *end*. This might require changes in your code for
1065 properly supporting 64-bit systems.
1066
Georg Brandlf6842722008-01-19 22:08:21 +00001067
1068.. cfunction:: PyObject* PyUnicode_Replace(PyObject *str, PyObject *substr, PyObject *replstr, Py_ssize_t maxcount)
1069
1070 Replace at most *maxcount* occurrences of *substr* in *str* with *replstr* and
1071 return the resulting Unicode object. *maxcount* == -1 means replace all
1072 occurrences.
1073
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +00001074 .. versionchanged:: 2.5
1075 This function used an :ctype:`int` type for *maxcount*. This might
1076 require changes in your code for properly supporting 64-bit systems.
1077
Georg Brandlf6842722008-01-19 22:08:21 +00001078
1079.. cfunction:: int PyUnicode_Compare(PyObject *left, PyObject *right)
1080
1081 Compare two strings and return -1, 0, 1 for less than, equal, and greater than,
1082 respectively.
1083
1084
1085.. cfunction:: int PyUnicode_RichCompare(PyObject *left, PyObject *right, int op)
1086
1087 Rich compare two unicode strings and return one of the following:
1088
1089 * ``NULL`` in case an exception was raised
1090 * :const:`Py_True` or :const:`Py_False` for successful comparisons
1091 * :const:`Py_NotImplemented` in case the type combination is unknown
1092
1093 Note that :const:`Py_EQ` and :const:`Py_NE` comparisons can cause a
1094 :exc:`UnicodeWarning` in case the conversion of the arguments to Unicode fails
1095 with a :exc:`UnicodeDecodeError`.
1096
1097 Possible values for *op* are :const:`Py_GT`, :const:`Py_GE`, :const:`Py_EQ`,
1098 :const:`Py_NE`, :const:`Py_LT`, and :const:`Py_LE`.
1099
1100
1101.. cfunction:: PyObject* PyUnicode_Format(PyObject *format, PyObject *args)
1102
1103 Return a new string object from *format* and *args*; this is analogous to
1104 ``format % args``. The *args* argument must be a tuple.
1105
1106
1107.. cfunction:: int PyUnicode_Contains(PyObject *container, PyObject *element)
1108
1109 Check whether *element* is contained in *container* and return true or false
1110 accordingly.
1111
1112 *element* has to coerce to a one element Unicode string. ``-1`` is returned if
1113 there was an error.