blob: 964d2c86bd37fa6e8eb083db44bc44aeff9f982b [file] [log] [blame]
Georg Brandlf6842722008-01-19 22:08:21 +00001.. highlightlang:: c
2
3.. _unicodeobjects:
4
5Unicode Objects and Codecs
6--------------------------
7
8.. sectionauthor:: Marc-Andre Lemburg <mal@lemburg.com>
9
10Unicode Objects
11^^^^^^^^^^^^^^^
12
13
Victor Stinner5f8aae02010-05-14 15:53:20 +000014Unicode Type
15""""""""""""
16
Georg Brandlf6842722008-01-19 22:08:21 +000017These are the basic Unicode object types used for the Unicode implementation in
18Python:
19
Georg Brandlf6842722008-01-19 22:08:21 +000020
21.. ctype:: Py_UNICODE
22
23 This type represents the storage type which is used by Python internally as
24 basis for holding Unicode ordinals. Python's default builds use a 16-bit type
25 for :ctype:`Py_UNICODE` and store Unicode values internally as UCS2. It is also
26 possible to build a UCS4 version of Python (most recent Linux distributions come
27 with UCS4 builds of Python). These builds then use a 32-bit type for
28 :ctype:`Py_UNICODE` and store Unicode data internally as UCS4. On platforms
29 where :ctype:`wchar_t` is available and compatible with the chosen Python
30 Unicode build variant, :ctype:`Py_UNICODE` is a typedef alias for
31 :ctype:`wchar_t` to enhance native platform compatibility. On all other
32 platforms, :ctype:`Py_UNICODE` is a typedef alias for either :ctype:`unsigned
33 short` (UCS2) or :ctype:`unsigned long` (UCS4).
34
35Note that UCS2 and UCS4 Python builds are not binary compatible. Please keep
36this in mind when writing extensions or interfaces.
37
38
39.. ctype:: PyUnicodeObject
40
41 This subtype of :ctype:`PyObject` represents a Python Unicode object.
42
43
44.. cvar:: PyTypeObject PyUnicode_Type
45
46 This instance of :ctype:`PyTypeObject` represents the Python Unicode type. It
47 is exposed to Python code as ``unicode`` and ``types.UnicodeType``.
48
49The following APIs are really C macros and can be used to do fast checks and to
50access internal read-only data of Unicode objects:
51
52
53.. cfunction:: int PyUnicode_Check(PyObject *o)
54
55 Return true if the object *o* is a Unicode object or an instance of a Unicode
56 subtype.
57
58 .. versionchanged:: 2.2
59 Allowed subtypes to be accepted.
60
61
62.. cfunction:: int PyUnicode_CheckExact(PyObject *o)
63
64 Return true if the object *o* is a Unicode object, but not an instance of a
65 subtype.
66
67 .. versionadded:: 2.2
68
69
70.. cfunction:: Py_ssize_t PyUnicode_GET_SIZE(PyObject *o)
71
72 Return the size of the object. *o* has to be a :ctype:`PyUnicodeObject` (not
73 checked).
74
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +000075 .. versionchanged:: 2.5
76 This function returned an :ctype:`int` type. This might require changes
77 in your code for properly supporting 64-bit systems.
78
Georg Brandlf6842722008-01-19 22:08:21 +000079
80.. cfunction:: Py_ssize_t PyUnicode_GET_DATA_SIZE(PyObject *o)
81
82 Return the size of the object's internal buffer in bytes. *o* has to be a
83 :ctype:`PyUnicodeObject` (not checked).
84
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +000085 .. versionchanged:: 2.5
86 This function returned an :ctype:`int` type. This might require changes
87 in your code for properly supporting 64-bit systems.
88
Georg Brandlf6842722008-01-19 22:08:21 +000089
90.. cfunction:: Py_UNICODE* PyUnicode_AS_UNICODE(PyObject *o)
91
92 Return a pointer to the internal :ctype:`Py_UNICODE` buffer of the object. *o*
93 has to be a :ctype:`PyUnicodeObject` (not checked).
94
95
96.. cfunction:: const char* PyUnicode_AS_DATA(PyObject *o)
97
98 Return a pointer to the internal buffer of the object. *o* has to be a
99 :ctype:`PyUnicodeObject` (not checked).
100
Christian Heimes3b718a72008-02-14 12:47:33 +0000101
Georg Brandl36b30b52009-07-24 16:46:38 +0000102.. cfunction:: int PyUnicode_ClearFreeList()
Christian Heimes3b718a72008-02-14 12:47:33 +0000103
104 Clear the free list. Return the total number of freed items.
105
106 .. versionadded:: 2.6
107
Georg Brandl36b30b52009-07-24 16:46:38 +0000108
Victor Stinner5f8aae02010-05-14 15:53:20 +0000109Unicode Character Properties
110""""""""""""""""""""""""""""
111
Georg Brandlf6842722008-01-19 22:08:21 +0000112Unicode provides many different character properties. The most often needed ones
113are available through these macros which are mapped to C functions depending on
114the Python configuration.
115
Georg Brandlf6842722008-01-19 22:08:21 +0000116
117.. cfunction:: int Py_UNICODE_ISSPACE(Py_UNICODE ch)
118
119 Return 1 or 0 depending on whether *ch* is a whitespace character.
120
121
122.. cfunction:: int Py_UNICODE_ISLOWER(Py_UNICODE ch)
123
124 Return 1 or 0 depending on whether *ch* is a lowercase character.
125
126
127.. cfunction:: int Py_UNICODE_ISUPPER(Py_UNICODE ch)
128
129 Return 1 or 0 depending on whether *ch* is an uppercase character.
130
131
132.. cfunction:: int Py_UNICODE_ISTITLE(Py_UNICODE ch)
133
134 Return 1 or 0 depending on whether *ch* is a titlecase character.
135
136
137.. cfunction:: int Py_UNICODE_ISLINEBREAK(Py_UNICODE ch)
138
139 Return 1 or 0 depending on whether *ch* is a linebreak character.
140
141
142.. cfunction:: int Py_UNICODE_ISDECIMAL(Py_UNICODE ch)
143
144 Return 1 or 0 depending on whether *ch* is a decimal character.
145
146
147.. cfunction:: int Py_UNICODE_ISDIGIT(Py_UNICODE ch)
148
149 Return 1 or 0 depending on whether *ch* is a digit character.
150
151
152.. cfunction:: int Py_UNICODE_ISNUMERIC(Py_UNICODE ch)
153
154 Return 1 or 0 depending on whether *ch* is a numeric character.
155
156
157.. cfunction:: int Py_UNICODE_ISALPHA(Py_UNICODE ch)
158
159 Return 1 or 0 depending on whether *ch* is an alphabetic character.
160
161
162.. cfunction:: int Py_UNICODE_ISALNUM(Py_UNICODE ch)
163
164 Return 1 or 0 depending on whether *ch* is an alphanumeric character.
165
166These APIs can be used for fast direct character conversions:
167
168
169.. cfunction:: Py_UNICODE Py_UNICODE_TOLOWER(Py_UNICODE ch)
170
171 Return the character *ch* converted to lower case.
172
173
174.. cfunction:: Py_UNICODE Py_UNICODE_TOUPPER(Py_UNICODE ch)
175
176 Return the character *ch* converted to upper case.
177
178
179.. cfunction:: Py_UNICODE Py_UNICODE_TOTITLE(Py_UNICODE ch)
180
181 Return the character *ch* converted to title case.
182
183
184.. cfunction:: int Py_UNICODE_TODECIMAL(Py_UNICODE ch)
185
186 Return the character *ch* converted to a decimal positive integer. Return
187 ``-1`` if this is not possible. This macro does not raise exceptions.
188
189
190.. cfunction:: int Py_UNICODE_TODIGIT(Py_UNICODE ch)
191
192 Return the character *ch* converted to a single digit integer. Return ``-1`` if
193 this is not possible. This macro does not raise exceptions.
194
195
196.. cfunction:: double Py_UNICODE_TONUMERIC(Py_UNICODE ch)
197
198 Return the character *ch* converted to a double. Return ``-1.0`` if this is not
199 possible. This macro does not raise exceptions.
200
Victor Stinner5f8aae02010-05-14 15:53:20 +0000201
202Plain Py_UNICODE
203""""""""""""""""
204
Georg Brandlf6842722008-01-19 22:08:21 +0000205To create Unicode objects and access their basic sequence properties, use these
206APIs:
207
Georg Brandlf6842722008-01-19 22:08:21 +0000208
209.. cfunction:: PyObject* PyUnicode_FromUnicode(const Py_UNICODE *u, Py_ssize_t size)
210
Georg Brandlb8d0e362010-11-26 07:53:50 +0000211 Create a Unicode object from the Py_UNICODE buffer *u* of the given size. *u*
Georg Brandlf6842722008-01-19 22:08:21 +0000212 may be *NULL* which causes the contents to be undefined. It is the user's
213 responsibility to fill in the needed data. The buffer is copied into the new
214 object. If the buffer is not *NULL*, the return value might be a shared object.
215 Therefore, modification of the resulting Unicode object is only allowed when *u*
216 is *NULL*.
217
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +0000218 .. versionchanged:: 2.5
219 This function used an :ctype:`int` type for *size*. This might require
220 changes in your code for properly supporting 64-bit systems.
221
Georg Brandlf6842722008-01-19 22:08:21 +0000222
Georg Brandl79cdff02010-10-17 10:54:57 +0000223.. cfunction:: PyObject* PyUnicode_FromStringAndSize(const char *u, Py_ssize_t size)
224
Georg Brandlb8d0e362010-11-26 07:53:50 +0000225 Create a Unicode object from the char buffer *u*. The bytes will be interpreted
Georg Brandl79cdff02010-10-17 10:54:57 +0000226 as being UTF-8 encoded. *u* may also be *NULL* which
227 causes the contents to be undefined. It is the user's responsibility to fill in
228 the needed data. The buffer is copied into the new object. If the buffer is not
229 *NULL*, the return value might be a shared object. Therefore, modification of
230 the resulting Unicode object is only allowed when *u* is *NULL*.
231
232 .. versionadded:: 2.6
233
234
235.. cfunction:: PyObject *PyUnicode_FromString(const char *u)
236
237 Create a Unicode object from an UTF-8 encoded null-terminated char buffer
238 *u*.
239
240 .. versionadded:: 2.6
241
242
243.. cfunction:: PyObject* PyUnicode_FromFormat(const char *format, ...)
244
245 Take a C :cfunc:`printf`\ -style *format* string and a variable number of
246 arguments, calculate the size of the resulting Python unicode string and return
247 a string with the values formatted into it. The variable arguments must be C
248 types and must correspond exactly to the format characters in the *format*
249 string. The following format characters are allowed:
250
251 .. % The descriptions for %zd and %zu are wrong, but the truth is complicated
252 .. % because not all compilers support the %z width modifier -- we fake it
253 .. % when necessary via interpolating PY_FORMAT_SIZE_T.
254
255 +-------------------+---------------------+--------------------------------+
256 | Format Characters | Type | Comment |
257 +===================+=====================+================================+
258 | :attr:`%%` | *n/a* | The literal % character. |
259 +-------------------+---------------------+--------------------------------+
260 | :attr:`%c` | int | A single character, |
261 | | | represented as an C int. |
262 +-------------------+---------------------+--------------------------------+
263 | :attr:`%d` | int | Exactly equivalent to |
264 | | | ``printf("%d")``. |
265 +-------------------+---------------------+--------------------------------+
266 | :attr:`%u` | unsigned int | Exactly equivalent to |
267 | | | ``printf("%u")``. |
268 +-------------------+---------------------+--------------------------------+
269 | :attr:`%ld` | long | Exactly equivalent to |
270 | | | ``printf("%ld")``. |
271 +-------------------+---------------------+--------------------------------+
272 | :attr:`%lu` | unsigned long | Exactly equivalent to |
273 | | | ``printf("%lu")``. |
274 +-------------------+---------------------+--------------------------------+
275 | :attr:`%zd` | Py_ssize_t | Exactly equivalent to |
276 | | | ``printf("%zd")``. |
277 +-------------------+---------------------+--------------------------------+
278 | :attr:`%zu` | size_t | Exactly equivalent to |
279 | | | ``printf("%zu")``. |
280 +-------------------+---------------------+--------------------------------+
281 | :attr:`%i` | int | Exactly equivalent to |
282 | | | ``printf("%i")``. |
283 +-------------------+---------------------+--------------------------------+
284 | :attr:`%x` | int | Exactly equivalent to |
285 | | | ``printf("%x")``. |
286 +-------------------+---------------------+--------------------------------+
287 | :attr:`%s` | char\* | A null-terminated C character |
288 | | | array. |
289 +-------------------+---------------------+--------------------------------+
290 | :attr:`%p` | void\* | The hex representation of a C |
291 | | | pointer. Mostly equivalent to |
292 | | | ``printf("%p")`` except that |
293 | | | it is guaranteed to start with |
294 | | | the literal ``0x`` regardless |
295 | | | of what the platform's |
296 | | | ``printf`` yields. |
297 +-------------------+---------------------+--------------------------------+
298 | :attr:`%U` | PyObject\* | A unicode object. |
299 +-------------------+---------------------+--------------------------------+
300 | :attr:`%V` | PyObject\*, char \* | A unicode object (which may be |
301 | | | *NULL*) and a null-terminated |
302 | | | C character array as a second |
303 | | | parameter (which will be used, |
304 | | | if the first parameter is |
305 | | | *NULL*). |
306 +-------------------+---------------------+--------------------------------+
307 | :attr:`%S` | PyObject\* | The result of calling |
308 | | | :func:`PyObject_Unicode`. |
309 +-------------------+---------------------+--------------------------------+
310 | :attr:`%R` | PyObject\* | The result of calling |
311 | | | :func:`PyObject_Repr`. |
312 +-------------------+---------------------+--------------------------------+
313
314 An unrecognized format character causes all the rest of the format string to be
315 copied as-is to the result string, and any extra arguments discarded.
316
317 .. versionadded:: 2.6
318
319
320.. cfunction:: PyObject* PyUnicode_FromFormatV(const char *format, va_list vargs)
321
322 Identical to :func:`PyUnicode_FromFormat` except that it takes exactly two
323 arguments.
324
325 .. versionadded:: 2.6
326
327
Georg Brandlf6842722008-01-19 22:08:21 +0000328.. cfunction:: Py_UNICODE* PyUnicode_AsUnicode(PyObject *unicode)
329
330 Return a read-only pointer to the Unicode object's internal :ctype:`Py_UNICODE`
331 buffer, *NULL* if *unicode* is not a Unicode object.
332
333
334.. cfunction:: Py_ssize_t PyUnicode_GetSize(PyObject *unicode)
335
336 Return the length of the Unicode object.
337
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +0000338 .. versionchanged:: 2.5
339 This function returned an :ctype:`int` type. This might require changes
340 in your code for properly supporting 64-bit systems.
341
Georg Brandlf6842722008-01-19 22:08:21 +0000342
343.. cfunction:: PyObject* PyUnicode_FromEncodedObject(PyObject *obj, const char *encoding, const char *errors)
344
345 Coerce an encoded object *obj* to an Unicode object and return a reference with
346 incremented refcount.
347
348 String and other char buffer compatible objects are decoded according to the
349 given encoding and using the error handling defined by errors. Both can be
350 *NULL* to have the interface use the default values (see the next section for
351 details).
352
353 All other objects, including Unicode objects, cause a :exc:`TypeError` to be
354 set.
355
356 The API returns *NULL* if there was an error. The caller is responsible for
357 decref'ing the returned objects.
358
359
360.. cfunction:: PyObject* PyUnicode_FromObject(PyObject *obj)
361
362 Shortcut for ``PyUnicode_FromEncodedObject(obj, NULL, "strict")`` which is used
363 throughout the interpreter whenever coercion to Unicode is needed.
364
365If the platform supports :ctype:`wchar_t` and provides a header file wchar.h,
366Python can interface directly to this type using the following functions.
367Support is optimized if Python's own :ctype:`Py_UNICODE` type is identical to
368the system's :ctype:`wchar_t`.
369
Georg Brandlf6842722008-01-19 22:08:21 +0000370
Victor Stinner5f8aae02010-05-14 15:53:20 +0000371wchar_t Support
372"""""""""""""""
373
374wchar_t support for platforms which support it:
Georg Brandlf6842722008-01-19 22:08:21 +0000375
376.. cfunction:: PyObject* PyUnicode_FromWideChar(const wchar_t *w, Py_ssize_t size)
377
378 Create a Unicode object from the :ctype:`wchar_t` buffer *w* of the given size.
379 Return *NULL* on failure.
380
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +0000381 .. versionchanged:: 2.5
382 This function used an :ctype:`int` type for *size*. This might require
383 changes in your code for properly supporting 64-bit systems.
384
Georg Brandlf6842722008-01-19 22:08:21 +0000385
386.. cfunction:: Py_ssize_t PyUnicode_AsWideChar(PyUnicodeObject *unicode, wchar_t *w, Py_ssize_t size)
387
388 Copy the Unicode object contents into the :ctype:`wchar_t` buffer *w*. At most
389 *size* :ctype:`wchar_t` characters are copied (excluding a possibly trailing
390 0-termination character). Return the number of :ctype:`wchar_t` characters
391 copied or -1 in case of an error. Note that the resulting :ctype:`wchar_t`
392 string may or may not be 0-terminated. It is the responsibility of the caller
393 to make sure that the :ctype:`wchar_t` string is 0-terminated in case this is
394 required by the application.
395
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +0000396 .. versionchanged:: 2.5
397 This function returned an :ctype:`int` type and used an :ctype:`int`
398 type for *size*. This might require changes in your code for properly
399 supporting 64-bit systems.
400
Georg Brandlf6842722008-01-19 22:08:21 +0000401
402.. _builtincodecs:
403
404Built-in Codecs
405^^^^^^^^^^^^^^^
406
Georg Brandld7d4fd72009-07-26 14:37:28 +0000407Python provides a set of built-in codecs which are written in C for speed. All of
Georg Brandlf6842722008-01-19 22:08:21 +0000408these codecs are directly usable via the following functions.
409
410Many of the following APIs take two arguments encoding and errors. These
411parameters encoding and errors have the same semantics as the ones of the
Georg Brandld7d4fd72009-07-26 14:37:28 +0000412built-in :func:`unicode` Unicode object constructor.
Georg Brandlf6842722008-01-19 22:08:21 +0000413
414Setting encoding to *NULL* causes the default encoding to be used which is
415ASCII. The file system calls should use :cdata:`Py_FileSystemDefaultEncoding`
416as the encoding for file names. This variable should be treated as read-only: On
417some systems, it will be a pointer to a static string, on others, it will change
418at run-time (such as when the application invokes setlocale).
419
420Error handling is set by errors which may also be set to *NULL* meaning to use
421the default handling defined for the codec. Default error handling for all
Georg Brandld7d4fd72009-07-26 14:37:28 +0000422built-in codecs is "strict" (:exc:`ValueError` is raised).
Georg Brandlf6842722008-01-19 22:08:21 +0000423
424The codecs all use a similar interface. Only deviation from the following
425generic ones are documented for simplicity.
426
Georg Brandlf6842722008-01-19 22:08:21 +0000427
Victor Stinner5f8aae02010-05-14 15:53:20 +0000428Generic Codecs
429""""""""""""""
430
431These are the generic codec APIs:
Georg Brandlf6842722008-01-19 22:08:21 +0000432
433
434.. cfunction:: PyObject* PyUnicode_Decode(const char *s, Py_ssize_t size, const char *encoding, const char *errors)
435
436 Create a Unicode object by decoding *size* bytes of the encoded string *s*.
437 *encoding* and *errors* have the same meaning as the parameters of the same name
Georg Brandld7d4fd72009-07-26 14:37:28 +0000438 in the :func:`unicode` built-in function. The codec to be used is looked up
Georg Brandlf6842722008-01-19 22:08:21 +0000439 using the Python codec registry. Return *NULL* if an exception was raised by
440 the codec.
441
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +0000442 .. versionchanged:: 2.5
443 This function used an :ctype:`int` type for *size*. This might require
444 changes in your code for properly supporting 64-bit systems.
445
Georg Brandlf6842722008-01-19 22:08:21 +0000446
447.. cfunction:: PyObject* PyUnicode_Encode(const Py_UNICODE *s, Py_ssize_t size, const char *encoding, const char *errors)
448
449 Encode the :ctype:`Py_UNICODE` buffer of the given size and return a Python
450 string object. *encoding* and *errors* have the same meaning as the parameters
451 of the same name in the Unicode :meth:`encode` method. The codec to be used is
452 looked up using the Python codec registry. Return *NULL* if an exception was
453 raised by the codec.
454
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +0000455 .. versionchanged:: 2.5
456 This function used an :ctype:`int` type for *size*. This might require
457 changes in your code for properly supporting 64-bit systems.
458
Georg Brandlf6842722008-01-19 22:08:21 +0000459
460.. cfunction:: PyObject* PyUnicode_AsEncodedString(PyObject *unicode, const char *encoding, const char *errors)
461
462 Encode a Unicode object and return the result as Python string object.
463 *encoding* and *errors* have the same meaning as the parameters of the same name
464 in the Unicode :meth:`encode` method. The codec to be used is looked up using
465 the Python codec registry. Return *NULL* if an exception was raised by the
466 codec.
467
Georg Brandlf6842722008-01-19 22:08:21 +0000468
Victor Stinner5f8aae02010-05-14 15:53:20 +0000469UTF-8 Codecs
470""""""""""""
471
472These are the UTF-8 codec APIs:
Georg Brandlf6842722008-01-19 22:08:21 +0000473
474
475.. cfunction:: PyObject* PyUnicode_DecodeUTF8(const char *s, Py_ssize_t size, const char *errors)
476
477 Create a Unicode object by decoding *size* bytes of the UTF-8 encoded string
478 *s*. Return *NULL* if an exception was raised by the codec.
479
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +0000480 .. versionchanged:: 2.5
481 This function used an :ctype:`int` type for *size*. This might require
482 changes in your code for properly supporting 64-bit systems.
483
Georg Brandlf6842722008-01-19 22:08:21 +0000484
485.. cfunction:: PyObject* PyUnicode_DecodeUTF8Stateful(const char *s, Py_ssize_t size, const char *errors, Py_ssize_t *consumed)
486
487 If *consumed* is *NULL*, behave like :cfunc:`PyUnicode_DecodeUTF8`. If
488 *consumed* is not *NULL*, trailing incomplete UTF-8 byte sequences will not be
489 treated as an error. Those bytes will not be decoded and the number of bytes
490 that have been decoded will be stored in *consumed*.
491
492 .. versionadded:: 2.4
493
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +0000494 .. versionchanged:: 2.5
495 This function used an :ctype:`int` type for *size*. This might require
496 changes in your code for properly supporting 64-bit systems.
497
Georg Brandlf6842722008-01-19 22:08:21 +0000498
499.. cfunction:: PyObject* PyUnicode_EncodeUTF8(const Py_UNICODE *s, Py_ssize_t size, const char *errors)
500
501 Encode the :ctype:`Py_UNICODE` buffer of the given size using UTF-8 and return a
502 Python string object. Return *NULL* if an exception was raised by the codec.
503
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +0000504 .. versionchanged:: 2.5
505 This function used an :ctype:`int` type for *size*. This might require
506 changes in your code for properly supporting 64-bit systems.
507
Georg Brandlf6842722008-01-19 22:08:21 +0000508
509.. cfunction:: PyObject* PyUnicode_AsUTF8String(PyObject *unicode)
510
511 Encode a Unicode object using UTF-8 and return the result as Python string
512 object. Error handling is "strict". Return *NULL* if an exception was raised
513 by the codec.
514
Georg Brandlf6842722008-01-19 22:08:21 +0000515
Victor Stinner5f8aae02010-05-14 15:53:20 +0000516UTF-32 Codecs
517"""""""""""""
518
519These are the UTF-32 codec APIs:
Georg Brandlf6842722008-01-19 22:08:21 +0000520
521
522.. cfunction:: PyObject* PyUnicode_DecodeUTF32(const char *s, Py_ssize_t size, const char *errors, int *byteorder)
523
524 Decode *length* bytes from a UTF-32 encoded buffer string and return the
525 corresponding Unicode object. *errors* (if non-*NULL*) defines the error
526 handling. It defaults to "strict".
527
528 If *byteorder* is non-*NULL*, the decoder starts decoding using the given byte
529 order::
530
531 *byteorder == -1: little endian
532 *byteorder == 0: native order
533 *byteorder == 1: big endian
534
Georg Brandl579a3582009-09-18 21:35:59 +0000535 If ``*byteorder`` is zero, and the first four bytes of the input data are a
536 byte order mark (BOM), the decoder switches to this byte order and the BOM is
537 not copied into the resulting Unicode string. If ``*byteorder`` is ``-1`` or
538 ``1``, any byte order mark is copied to the output.
539
540 After completion, *\*byteorder* is set to the current byte order at the end
541 of input data.
Georg Brandlf6842722008-01-19 22:08:21 +0000542
543 In a narrow build codepoints outside the BMP will be decoded as surrogate pairs.
544
545 If *byteorder* is *NULL*, the codec starts in native order mode.
546
547 Return *NULL* if an exception was raised by the codec.
548
549 .. versionadded:: 2.6
550
551
552.. cfunction:: PyObject* PyUnicode_DecodeUTF32Stateful(const char *s, Py_ssize_t size, const char *errors, int *byteorder, Py_ssize_t *consumed)
553
554 If *consumed* is *NULL*, behave like :cfunc:`PyUnicode_DecodeUTF32`. If
555 *consumed* is not *NULL*, :cfunc:`PyUnicode_DecodeUTF32Stateful` will not treat
556 trailing incomplete UTF-32 byte sequences (such as a number of bytes not divisible
557 by four) as an error. Those bytes will not be decoded and the number of bytes
558 that have been decoded will be stored in *consumed*.
559
560 .. versionadded:: 2.6
561
562
563.. cfunction:: PyObject* PyUnicode_EncodeUTF32(const Py_UNICODE *s, Py_ssize_t size, const char *errors, int byteorder)
564
565 Return a Python bytes object holding the UTF-32 encoded value of the Unicode
Georg Brandl579a3582009-09-18 21:35:59 +0000566 data in *s*. Output is written according to the following byte order::
Georg Brandlf6842722008-01-19 22:08:21 +0000567
568 byteorder == -1: little endian
569 byteorder == 0: native byte order (writes a BOM mark)
570 byteorder == 1: big endian
571
572 If byteorder is ``0``, the output string will always start with the Unicode BOM
573 mark (U+FEFF). In the other two modes, no BOM mark is prepended.
574
575 If *Py_UNICODE_WIDE* is not defined, surrogate pairs will be output
576 as a single codepoint.
577
578 Return *NULL* if an exception was raised by the codec.
579
580 .. versionadded:: 2.6
581
582
583.. cfunction:: PyObject* PyUnicode_AsUTF32String(PyObject *unicode)
584
585 Return a Python string using the UTF-32 encoding in native byte order. The
586 string always starts with a BOM mark. Error handling is "strict". Return
587 *NULL* if an exception was raised by the codec.
588
589 .. versionadded:: 2.6
590
591
Victor Stinner5f8aae02010-05-14 15:53:20 +0000592UTF-16 Codecs
593"""""""""""""
Georg Brandlf6842722008-01-19 22:08:21 +0000594
Victor Stinner5f8aae02010-05-14 15:53:20 +0000595These are the UTF-16 codec APIs:
Georg Brandlf6842722008-01-19 22:08:21 +0000596
597
598.. cfunction:: PyObject* PyUnicode_DecodeUTF16(const char *s, Py_ssize_t size, const char *errors, int *byteorder)
599
600 Decode *length* bytes from a UTF-16 encoded buffer string and return the
601 corresponding Unicode object. *errors* (if non-*NULL*) defines the error
602 handling. It defaults to "strict".
603
604 If *byteorder* is non-*NULL*, the decoder starts decoding using the given byte
605 order::
606
607 *byteorder == -1: little endian
608 *byteorder == 0: native order
609 *byteorder == 1: big endian
610
Georg Brandl579a3582009-09-18 21:35:59 +0000611 If ``*byteorder`` is zero, and the first two bytes of the input data are a
612 byte order mark (BOM), the decoder switches to this byte order and the BOM is
613 not copied into the resulting Unicode string. If ``*byteorder`` is ``-1`` or
614 ``1``, any byte order mark is copied to the output (where it will result in
615 either a ``\ufeff`` or a ``\ufffe`` character).
616
617 After completion, *\*byteorder* is set to the current byte order at the end
618 of input data.
Georg Brandlf6842722008-01-19 22:08:21 +0000619
620 If *byteorder* is *NULL*, the codec starts in native order mode.
621
622 Return *NULL* if an exception was raised by the codec.
623
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +0000624 .. versionchanged:: 2.5
625 This function used an :ctype:`int` type for *size*. This might require
626 changes in your code for properly supporting 64-bit systems.
627
Georg Brandlf6842722008-01-19 22:08:21 +0000628
629.. cfunction:: PyObject* PyUnicode_DecodeUTF16Stateful(const char *s, Py_ssize_t size, const char *errors, int *byteorder, Py_ssize_t *consumed)
630
631 If *consumed* is *NULL*, behave like :cfunc:`PyUnicode_DecodeUTF16`. If
632 *consumed* is not *NULL*, :cfunc:`PyUnicode_DecodeUTF16Stateful` will not treat
633 trailing incomplete UTF-16 byte sequences (such as an odd number of bytes or a
634 split surrogate pair) as an error. Those bytes will not be decoded and the
635 number of bytes that have been decoded will be stored in *consumed*.
636
637 .. versionadded:: 2.4
638
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +0000639 .. versionchanged:: 2.5
640 This function used an :ctype:`int` type for *size* and an :ctype:`int *`
641 type for *consumed*. This might require changes in your code for
642 properly supporting 64-bit systems.
643
Georg Brandlf6842722008-01-19 22:08:21 +0000644
645.. cfunction:: PyObject* PyUnicode_EncodeUTF16(const Py_UNICODE *s, Py_ssize_t size, const char *errors, int byteorder)
646
647 Return a Python string object holding the UTF-16 encoded value of the Unicode
Georg Brandl579a3582009-09-18 21:35:59 +0000648 data in *s*. Output is written according to the following byte order::
Georg Brandlf6842722008-01-19 22:08:21 +0000649
650 byteorder == -1: little endian
651 byteorder == 0: native byte order (writes a BOM mark)
652 byteorder == 1: big endian
653
654 If byteorder is ``0``, the output string will always start with the Unicode BOM
655 mark (U+FEFF). In the other two modes, no BOM mark is prepended.
656
657 If *Py_UNICODE_WIDE* is defined, a single :ctype:`Py_UNICODE` value may get
658 represented as a surrogate pair. If it is not defined, each :ctype:`Py_UNICODE`
659 values is interpreted as an UCS-2 character.
660
661 Return *NULL* if an exception was raised by the codec.
662
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +0000663 .. versionchanged:: 2.5
664 This function used an :ctype:`int` type for *size*. This might require
665 changes in your code for properly supporting 64-bit systems.
666
Georg Brandlf6842722008-01-19 22:08:21 +0000667
668.. cfunction:: PyObject* PyUnicode_AsUTF16String(PyObject *unicode)
669
670 Return a Python string using the UTF-16 encoding in native byte order. The
671 string always starts with a BOM mark. Error handling is "strict". Return
672 *NULL* if an exception was raised by the codec.
673
Georg Brandlf6842722008-01-19 22:08:21 +0000674
Georg Brandl7d4bfb32010-08-02 21:44:25 +0000675UTF-7 Codecs
676""""""""""""
677
678These are the UTF-7 codec APIs:
679
680
681.. cfunction:: PyObject* PyUnicode_DecodeUTF7(const char *s, Py_ssize_t size, const char *errors)
682
683 Create a Unicode object by decoding *size* bytes of the UTF-7 encoded string
684 *s*. Return *NULL* if an exception was raised by the codec.
685
686
Georg Brandl21946af2010-10-06 09:28:45 +0000687.. cfunction:: PyObject* PyUnicode_DecodeUTF7Stateful(const char *s, Py_ssize_t size, const char *errors, Py_ssize_t *consumed)
Georg Brandl7d4bfb32010-08-02 21:44:25 +0000688
689 If *consumed* is *NULL*, behave like :cfunc:`PyUnicode_DecodeUTF7`. If
690 *consumed* is not *NULL*, trailing incomplete UTF-7 base-64 sections will not
691 be treated as an error. Those bytes will not be decoded and the number of
692 bytes that have been decoded will be stored in *consumed*.
693
694
695.. cfunction:: PyObject* PyUnicode_EncodeUTF7(const Py_UNICODE *s, Py_ssize_t size, int base64SetO, int base64WhiteSpace, const char *errors)
696
697 Encode the :ctype:`Py_UNICODE` buffer of the given size using UTF-7 and
698 return a Python bytes object. Return *NULL* if an exception was raised by
699 the codec.
700
701 If *base64SetO* is nonzero, "Set O" (punctuation that has no otherwise
702 special meaning) will be encoded in base-64. If *base64WhiteSpace* is
703 nonzero, whitespace will be encoded in base-64. Both are set to zero for the
704 Python "utf-7" codec.
705
706
Victor Stinner5f8aae02010-05-14 15:53:20 +0000707Unicode-Escape Codecs
708"""""""""""""""""""""
709
710These are the "Unicode Escape" codec APIs:
Georg Brandlf6842722008-01-19 22:08:21 +0000711
712
713.. cfunction:: PyObject* PyUnicode_DecodeUnicodeEscape(const char *s, Py_ssize_t size, const char *errors)
714
715 Create a Unicode object by decoding *size* bytes of the Unicode-Escape encoded
716 string *s*. Return *NULL* if an exception was raised by the codec.
717
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +0000718 .. versionchanged:: 2.5
719 This function used an :ctype:`int` type for *size*. This might require
720 changes in your code for properly supporting 64-bit systems.
721
Georg Brandlf6842722008-01-19 22:08:21 +0000722
723.. cfunction:: PyObject* PyUnicode_EncodeUnicodeEscape(const Py_UNICODE *s, Py_ssize_t size)
724
725 Encode the :ctype:`Py_UNICODE` buffer of the given size using Unicode-Escape and
726 return a Python string object. Return *NULL* if an exception was raised by the
727 codec.
728
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +0000729 .. versionchanged:: 2.5
730 This function used an :ctype:`int` type for *size*. This might require
731 changes in your code for properly supporting 64-bit systems.
732
Georg Brandlf6842722008-01-19 22:08:21 +0000733
734.. cfunction:: PyObject* PyUnicode_AsUnicodeEscapeString(PyObject *unicode)
735
736 Encode a Unicode object using Unicode-Escape and return the result as Python
737 string object. Error handling is "strict". Return *NULL* if an exception was
738 raised by the codec.
739
Georg Brandlf6842722008-01-19 22:08:21 +0000740
Victor Stinner5f8aae02010-05-14 15:53:20 +0000741Raw-Unicode-Escape Codecs
742"""""""""""""""""""""""""
743
744These are the "Raw Unicode Escape" codec APIs:
Georg Brandlf6842722008-01-19 22:08:21 +0000745
746
747.. cfunction:: PyObject* PyUnicode_DecodeRawUnicodeEscape(const char *s, Py_ssize_t size, const char *errors)
748
749 Create a Unicode object by decoding *size* bytes of the Raw-Unicode-Escape
750 encoded string *s*. Return *NULL* if an exception was raised by the codec.
751
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +0000752 .. versionchanged:: 2.5
753 This function used an :ctype:`int` type for *size*. This might require
754 changes in your code for properly supporting 64-bit systems.
755
Georg Brandlf6842722008-01-19 22:08:21 +0000756
757.. cfunction:: PyObject* PyUnicode_EncodeRawUnicodeEscape(const Py_UNICODE *s, Py_ssize_t size, const char *errors)
758
759 Encode the :ctype:`Py_UNICODE` buffer of the given size using Raw-Unicode-Escape
760 and return a Python string object. Return *NULL* if an exception was raised by
761 the codec.
762
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +0000763 .. versionchanged:: 2.5
764 This function used an :ctype:`int` type for *size*. This might require
765 changes in your code for properly supporting 64-bit systems.
766
Georg Brandlf6842722008-01-19 22:08:21 +0000767
768.. cfunction:: PyObject* PyUnicode_AsRawUnicodeEscapeString(PyObject *unicode)
769
770 Encode a Unicode object using Raw-Unicode-Escape and return the result as
771 Python string object. Error handling is "strict". Return *NULL* if an exception
772 was raised by the codec.
773
Victor Stinner5f8aae02010-05-14 15:53:20 +0000774
775Latin-1 Codecs
776""""""""""""""
777
Georg Brandlf6842722008-01-19 22:08:21 +0000778These are the Latin-1 codec APIs: Latin-1 corresponds to the first 256 Unicode
779ordinals and only these are accepted by the codecs during encoding.
780
Georg Brandlf6842722008-01-19 22:08:21 +0000781
782.. cfunction:: PyObject* PyUnicode_DecodeLatin1(const char *s, Py_ssize_t size, const char *errors)
783
784 Create a Unicode object by decoding *size* bytes of the Latin-1 encoded string
785 *s*. Return *NULL* if an exception was raised by the codec.
786
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +0000787 .. versionchanged:: 2.5
788 This function used an :ctype:`int` type for *size*. This might require
789 changes in your code for properly supporting 64-bit systems.
790
Georg Brandlf6842722008-01-19 22:08:21 +0000791
792.. cfunction:: PyObject* PyUnicode_EncodeLatin1(const Py_UNICODE *s, Py_ssize_t size, const char *errors)
793
794 Encode the :ctype:`Py_UNICODE` buffer of the given size using Latin-1 and return
795 a Python string object. Return *NULL* if an exception was raised by the codec.
796
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +0000797 .. versionchanged:: 2.5
798 This function used an :ctype:`int` type for *size*. This might require
799 changes in your code for properly supporting 64-bit systems.
800
Georg Brandlf6842722008-01-19 22:08:21 +0000801
802.. cfunction:: PyObject* PyUnicode_AsLatin1String(PyObject *unicode)
803
804 Encode a Unicode object using Latin-1 and return the result as Python string
805 object. Error handling is "strict". Return *NULL* if an exception was raised
806 by the codec.
807
Victor Stinner5f8aae02010-05-14 15:53:20 +0000808
809ASCII Codecs
810""""""""""""
811
Georg Brandlf6842722008-01-19 22:08:21 +0000812These are the ASCII codec APIs. Only 7-bit ASCII data is accepted. All other
813codes generate errors.
814
Georg Brandlf6842722008-01-19 22:08:21 +0000815
816.. cfunction:: PyObject* PyUnicode_DecodeASCII(const char *s, Py_ssize_t size, const char *errors)
817
818 Create a Unicode object by decoding *size* bytes of the ASCII encoded string
819 *s*. Return *NULL* if an exception was raised by the codec.
820
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +0000821 .. versionchanged:: 2.5
822 This function used an :ctype:`int` type for *size*. This might require
823 changes in your code for properly supporting 64-bit systems.
824
Georg Brandlf6842722008-01-19 22:08:21 +0000825
826.. cfunction:: PyObject* PyUnicode_EncodeASCII(const Py_UNICODE *s, Py_ssize_t size, const char *errors)
827
828 Encode the :ctype:`Py_UNICODE` buffer of the given size using ASCII and return a
829 Python string object. Return *NULL* if an exception was raised by the codec.
830
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +0000831 .. versionchanged:: 2.5
832 This function used an :ctype:`int` type for *size*. This might require
833 changes in your code for properly supporting 64-bit systems.
834
Georg Brandlf6842722008-01-19 22:08:21 +0000835
836.. cfunction:: PyObject* PyUnicode_AsASCIIString(PyObject *unicode)
837
838 Encode a Unicode object using ASCII and return the result as Python string
839 object. Error handling is "strict". Return *NULL* if an exception was raised
840 by the codec.
841
Georg Brandlf6842722008-01-19 22:08:21 +0000842
Victor Stinner5f8aae02010-05-14 15:53:20 +0000843Character Map Codecs
844""""""""""""""""""""
845
846These are the mapping codec APIs:
Georg Brandlf6842722008-01-19 22:08:21 +0000847
848This codec is special in that it can be used to implement many different codecs
849(and this is in fact what was done to obtain most of the standard codecs
850included in the :mod:`encodings` package). The codec uses mapping to encode and
851decode characters.
852
853Decoding mappings must map single string characters to single Unicode
854characters, integers (which are then interpreted as Unicode ordinals) or None
855(meaning "undefined mapping" and causing an error).
856
857Encoding mappings must map single Unicode characters to single string
858characters, integers (which are then interpreted as Latin-1 ordinals) or None
859(meaning "undefined mapping" and causing an error).
860
861The mapping objects provided must only support the __getitem__ mapping
862interface.
863
864If a character lookup fails with a LookupError, the character is copied as-is
865meaning that its ordinal value will be interpreted as Unicode or Latin-1 ordinal
866resp. Because of this, mappings only need to contain those mappings which map
867characters to different code points.
868
869
870.. cfunction:: PyObject* PyUnicode_DecodeCharmap(const char *s, Py_ssize_t size, PyObject *mapping, const char *errors)
871
872 Create a Unicode object by decoding *size* bytes of the encoded string *s* using
873 the given *mapping* object. Return *NULL* if an exception was raised by the
874 codec. If *mapping* is *NULL* latin-1 decoding will be done. Else it can be a
875 dictionary mapping byte or a unicode string, which is treated as a lookup table.
876 Byte values greater that the length of the string and U+FFFE "characters" are
877 treated as "undefined mapping".
878
879 .. versionchanged:: 2.4
880 Allowed unicode string as mapping argument.
881
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +0000882 .. versionchanged:: 2.5
883 This function used an :ctype:`int` type for *size*. This might require
884 changes in your code for properly supporting 64-bit systems.
885
Georg Brandlf6842722008-01-19 22:08:21 +0000886
887.. cfunction:: PyObject* PyUnicode_EncodeCharmap(const Py_UNICODE *s, Py_ssize_t size, PyObject *mapping, const char *errors)
888
889 Encode the :ctype:`Py_UNICODE` buffer of the given size using the given
890 *mapping* object and return a Python string object. Return *NULL* if an
891 exception was raised by the codec.
892
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +0000893 .. versionchanged:: 2.5
894 This function used an :ctype:`int` type for *size*. This might require
895 changes in your code for properly supporting 64-bit systems.
896
Georg Brandlf6842722008-01-19 22:08:21 +0000897
898.. cfunction:: PyObject* PyUnicode_AsCharmapString(PyObject *unicode, PyObject *mapping)
899
900 Encode a Unicode object using the given *mapping* object and return the result
901 as Python string object. Error handling is "strict". Return *NULL* if an
902 exception was raised by the codec.
903
904The following codec API is special in that maps Unicode to Unicode.
905
906
907.. cfunction:: PyObject* PyUnicode_TranslateCharmap(const Py_UNICODE *s, Py_ssize_t size, PyObject *table, const char *errors)
908
909 Translate a :ctype:`Py_UNICODE` buffer of the given length by applying a
910 character mapping *table* to it and return the resulting Unicode object. Return
911 *NULL* when an exception was raised by the codec.
912
913 The *mapping* table must map Unicode ordinal integers to Unicode ordinal
914 integers or None (causing deletion of the character).
915
916 Mapping tables need only provide the :meth:`__getitem__` interface; dictionaries
917 and sequences work well. Unmapped character ordinals (ones which cause a
918 :exc:`LookupError`) are left untouched and are copied as-is.
919
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +0000920 .. versionchanged:: 2.5
921 This function used an :ctype:`int` type for *size*. This might require
922 changes in your code for properly supporting 64-bit systems.
923
Georg Brandlf6842722008-01-19 22:08:21 +0000924These are the MBCS codec APIs. They are currently only available on Windows and
925use the Win32 MBCS converters to implement the conversions. Note that MBCS (or
926DBCS) is a class of encodings, not just one. The target encoding is defined by
927the user settings on the machine running the codec.
928
Victor Stinner5f8aae02010-05-14 15:53:20 +0000929
930MBCS codecs for Windows
931"""""""""""""""""""""""
Georg Brandlf6842722008-01-19 22:08:21 +0000932
933
934.. cfunction:: PyObject* PyUnicode_DecodeMBCS(const char *s, Py_ssize_t size, const char *errors)
935
936 Create a Unicode object by decoding *size* bytes of the MBCS encoded string *s*.
937 Return *NULL* if an exception was raised by the codec.
938
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +0000939 .. versionchanged:: 2.5
940 This function used an :ctype:`int` type for *size*. This might require
941 changes in your code for properly supporting 64-bit systems.
942
Georg Brandlf6842722008-01-19 22:08:21 +0000943
944.. cfunction:: PyObject* PyUnicode_DecodeMBCSStateful(const char *s, int size, const char *errors, int *consumed)
945
946 If *consumed* is *NULL*, behave like :cfunc:`PyUnicode_DecodeMBCS`. If
947 *consumed* is not *NULL*, :cfunc:`PyUnicode_DecodeMBCSStateful` will not decode
948 trailing lead byte and the number of bytes that have been decoded will be stored
949 in *consumed*.
950
951 .. versionadded:: 2.5
952
953
954.. cfunction:: PyObject* PyUnicode_EncodeMBCS(const Py_UNICODE *s, Py_ssize_t size, const char *errors)
955
956 Encode the :ctype:`Py_UNICODE` buffer of the given size using MBCS and return a
957 Python string object. Return *NULL* if an exception was raised by the codec.
958
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +0000959 .. versionchanged:: 2.5
960 This function used an :ctype:`int` type for *size*. This might require
961 changes in your code for properly supporting 64-bit systems.
962
Georg Brandlf6842722008-01-19 22:08:21 +0000963
964.. cfunction:: PyObject* PyUnicode_AsMBCSString(PyObject *unicode)
965
966 Encode a Unicode object using MBCS and return the result as Python string
967 object. Error handling is "strict". Return *NULL* if an exception was raised
968 by the codec.
969
Georg Brandlf6842722008-01-19 22:08:21 +0000970
Victor Stinner5f8aae02010-05-14 15:53:20 +0000971Methods & Slots
972"""""""""""""""
Georg Brandlf6842722008-01-19 22:08:21 +0000973
974.. _unicodemethodsandslots:
975
976Methods and Slot Functions
977^^^^^^^^^^^^^^^^^^^^^^^^^^
978
979The following APIs are capable of handling Unicode objects and strings on input
980(we refer to them as strings in the descriptions) and return Unicode objects or
981integers as appropriate.
982
983They all return *NULL* or ``-1`` if an exception occurs.
984
985
986.. cfunction:: PyObject* PyUnicode_Concat(PyObject *left, PyObject *right)
987
988 Concat two strings giving a new Unicode string.
989
990
991.. cfunction:: PyObject* PyUnicode_Split(PyObject *s, PyObject *sep, Py_ssize_t maxsplit)
992
993 Split a string giving a list of Unicode strings. If sep is *NULL*, splitting
994 will be done at all whitespace substrings. Otherwise, splits occur at the given
995 separator. At most *maxsplit* splits will be done. If negative, no limit is
996 set. Separators are not included in the resulting list.
997
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +0000998 .. versionchanged:: 2.5
999 This function used an :ctype:`int` type for *maxsplit*. This might require
1000 changes in your code for properly supporting 64-bit systems.
1001
Georg Brandlf6842722008-01-19 22:08:21 +00001002
1003.. cfunction:: PyObject* PyUnicode_Splitlines(PyObject *s, int keepend)
1004
1005 Split a Unicode string at line breaks, returning a list of Unicode strings.
1006 CRLF is considered to be one line break. If *keepend* is 0, the Line break
1007 characters are not included in the resulting strings.
1008
1009
1010.. cfunction:: PyObject* PyUnicode_Translate(PyObject *str, PyObject *table, const char *errors)
1011
1012 Translate a string by applying a character mapping table to it and return the
1013 resulting Unicode object.
1014
1015 The mapping table must map Unicode ordinal integers to Unicode ordinal integers
1016 or None (causing deletion of the character).
1017
1018 Mapping tables need only provide the :meth:`__getitem__` interface; dictionaries
1019 and sequences work well. Unmapped character ordinals (ones which cause a
1020 :exc:`LookupError`) are left untouched and are copied as-is.
1021
1022 *errors* has the usual meaning for codecs. It may be *NULL* which indicates to
1023 use the default error handling.
1024
1025
1026.. cfunction:: PyObject* PyUnicode_Join(PyObject *separator, PyObject *seq)
1027
1028 Join a sequence of strings using the given separator and return the resulting
1029 Unicode string.
1030
1031
1032.. cfunction:: int PyUnicode_Tailmatch(PyObject *str, PyObject *substr, Py_ssize_t start, Py_ssize_t end, int direction)
1033
1034 Return 1 if *substr* matches *str*[*start*:*end*] at the given tail end
1035 (*direction* == -1 means to do a prefix match, *direction* == 1 a suffix match),
1036 0 otherwise. Return ``-1`` if an error occurred.
1037
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +00001038 .. versionchanged:: 2.5
1039 This function used an :ctype:`int` type for *start* and *end*. This
1040 might require changes in your code for properly supporting 64-bit
1041 systems.
1042
Georg Brandlf6842722008-01-19 22:08:21 +00001043
1044.. cfunction:: Py_ssize_t PyUnicode_Find(PyObject *str, PyObject *substr, Py_ssize_t start, Py_ssize_t end, int direction)
1045
1046 Return the first position of *substr* in *str*[*start*:*end*] using the given
1047 *direction* (*direction* == 1 means to do a forward search, *direction* == -1 a
1048 backward search). The return value is the index of the first match; a value of
1049 ``-1`` indicates that no match was found, and ``-2`` indicates that an error
1050 occurred and an exception has been set.
1051
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +00001052 .. versionchanged:: 2.5
1053 This function used an :ctype:`int` type for *start* and *end*. This
1054 might require changes in your code for properly supporting 64-bit
1055 systems.
1056
Georg Brandlf6842722008-01-19 22:08:21 +00001057
1058.. cfunction:: Py_ssize_t PyUnicode_Count(PyObject *str, PyObject *substr, Py_ssize_t start, Py_ssize_t end)
1059
1060 Return the number of non-overlapping occurrences of *substr* in
1061 ``str[start:end]``. Return ``-1`` if an error occurred.
1062
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +00001063 .. versionchanged:: 2.5
1064 This function returned an :ctype:`int` type and used an :ctype:`int`
1065 type for *start* and *end*. This might require changes in your code for
1066 properly supporting 64-bit systems.
1067
Georg Brandlf6842722008-01-19 22:08:21 +00001068
1069.. cfunction:: PyObject* PyUnicode_Replace(PyObject *str, PyObject *substr, PyObject *replstr, Py_ssize_t maxcount)
1070
1071 Replace at most *maxcount* occurrences of *substr* in *str* with *replstr* and
1072 return the resulting Unicode object. *maxcount* == -1 means replace all
1073 occurrences.
1074
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +00001075 .. versionchanged:: 2.5
1076 This function used an :ctype:`int` type for *maxcount*. This might
1077 require changes in your code for properly supporting 64-bit systems.
1078
Georg Brandlf6842722008-01-19 22:08:21 +00001079
1080.. cfunction:: int PyUnicode_Compare(PyObject *left, PyObject *right)
1081
1082 Compare two strings and return -1, 0, 1 for less than, equal, and greater than,
1083 respectively.
1084
1085
1086.. cfunction:: int PyUnicode_RichCompare(PyObject *left, PyObject *right, int op)
1087
1088 Rich compare two unicode strings and return one of the following:
1089
1090 * ``NULL`` in case an exception was raised
1091 * :const:`Py_True` or :const:`Py_False` for successful comparisons
1092 * :const:`Py_NotImplemented` in case the type combination is unknown
1093
1094 Note that :const:`Py_EQ` and :const:`Py_NE` comparisons can cause a
1095 :exc:`UnicodeWarning` in case the conversion of the arguments to Unicode fails
1096 with a :exc:`UnicodeDecodeError`.
1097
1098 Possible values for *op* are :const:`Py_GT`, :const:`Py_GE`, :const:`Py_EQ`,
1099 :const:`Py_NE`, :const:`Py_LT`, and :const:`Py_LE`.
1100
1101
1102.. cfunction:: PyObject* PyUnicode_Format(PyObject *format, PyObject *args)
1103
1104 Return a new string object from *format* and *args*; this is analogous to
1105 ``format % args``. The *args* argument must be a tuple.
1106
1107
1108.. cfunction:: int PyUnicode_Contains(PyObject *container, PyObject *element)
1109
1110 Check whether *element* is contained in *container* and return true or false
1111 accordingly.
1112
1113 *element* has to coerce to a one element Unicode string. ``-1`` is returned if
1114 there was an error.