blob: 4ab1c21bfe26f3001564d7547e0a148fdb63bd84 [file] [log] [blame]
Georg Brandlf6842722008-01-19 22:08:21 +00001.. highlightlang:: c
2
3.. _unicodeobjects:
4
5Unicode Objects and Codecs
6--------------------------
7
8.. sectionauthor:: Marc-Andre Lemburg <mal@lemburg.com>
9
10Unicode Objects
11^^^^^^^^^^^^^^^
12
13
14These are the basic Unicode object types used for the Unicode implementation in
15Python:
16
17.. % --- Unicode Type -------------------------------------------------------
18
19
20.. ctype:: Py_UNICODE
21
22 This type represents the storage type which is used by Python internally as
23 basis for holding Unicode ordinals. Python's default builds use a 16-bit type
24 for :ctype:`Py_UNICODE` and store Unicode values internally as UCS2. It is also
25 possible to build a UCS4 version of Python (most recent Linux distributions come
26 with UCS4 builds of Python). These builds then use a 32-bit type for
27 :ctype:`Py_UNICODE` and store Unicode data internally as UCS4. On platforms
28 where :ctype:`wchar_t` is available and compatible with the chosen Python
29 Unicode build variant, :ctype:`Py_UNICODE` is a typedef alias for
30 :ctype:`wchar_t` to enhance native platform compatibility. On all other
31 platforms, :ctype:`Py_UNICODE` is a typedef alias for either :ctype:`unsigned
32 short` (UCS2) or :ctype:`unsigned long` (UCS4).
33
34Note that UCS2 and UCS4 Python builds are not binary compatible. Please keep
35this in mind when writing extensions or interfaces.
36
37
38.. ctype:: PyUnicodeObject
39
40 This subtype of :ctype:`PyObject` represents a Python Unicode object.
41
42
43.. cvar:: PyTypeObject PyUnicode_Type
44
45 This instance of :ctype:`PyTypeObject` represents the Python Unicode type. It
46 is exposed to Python code as ``unicode`` and ``types.UnicodeType``.
47
48The following APIs are really C macros and can be used to do fast checks and to
49access internal read-only data of Unicode objects:
50
51
52.. cfunction:: int PyUnicode_Check(PyObject *o)
53
54 Return true if the object *o* is a Unicode object or an instance of a Unicode
55 subtype.
56
57 .. versionchanged:: 2.2
58 Allowed subtypes to be accepted.
59
60
61.. cfunction:: int PyUnicode_CheckExact(PyObject *o)
62
63 Return true if the object *o* is a Unicode object, but not an instance of a
64 subtype.
65
66 .. versionadded:: 2.2
67
68
69.. cfunction:: Py_ssize_t PyUnicode_GET_SIZE(PyObject *o)
70
71 Return the size of the object. *o* has to be a :ctype:`PyUnicodeObject` (not
72 checked).
73
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +000074 .. versionchanged:: 2.5
75 This function returned an :ctype:`int` type. This might require changes
76 in your code for properly supporting 64-bit systems.
77
Georg Brandlf6842722008-01-19 22:08:21 +000078
79.. cfunction:: Py_ssize_t PyUnicode_GET_DATA_SIZE(PyObject *o)
80
81 Return the size of the object's internal buffer in bytes. *o* has to be a
82 :ctype:`PyUnicodeObject` (not checked).
83
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +000084 .. versionchanged:: 2.5
85 This function returned an :ctype:`int` type. This might require changes
86 in your code for properly supporting 64-bit systems.
87
Georg Brandlf6842722008-01-19 22:08:21 +000088
89.. cfunction:: Py_UNICODE* PyUnicode_AS_UNICODE(PyObject *o)
90
91 Return a pointer to the internal :ctype:`Py_UNICODE` buffer of the object. *o*
92 has to be a :ctype:`PyUnicodeObject` (not checked).
93
94
95.. cfunction:: const char* PyUnicode_AS_DATA(PyObject *o)
96
97 Return a pointer to the internal buffer of the object. *o* has to be a
98 :ctype:`PyUnicodeObject` (not checked).
99
Christian Heimes3b718a72008-02-14 12:47:33 +0000100
Georg Brandl36b30b52009-07-24 16:46:38 +0000101.. cfunction:: int PyUnicode_ClearFreeList()
Christian Heimes3b718a72008-02-14 12:47:33 +0000102
103 Clear the free list. Return the total number of freed items.
104
105 .. versionadded:: 2.6
106
Georg Brandl36b30b52009-07-24 16:46:38 +0000107
Georg Brandlf6842722008-01-19 22:08:21 +0000108Unicode provides many different character properties. The most often needed ones
109are available through these macros which are mapped to C functions depending on
110the Python configuration.
111
112.. % --- Unicode character properties ---------------------------------------
113
114
115.. cfunction:: int Py_UNICODE_ISSPACE(Py_UNICODE ch)
116
117 Return 1 or 0 depending on whether *ch* is a whitespace character.
118
119
120.. cfunction:: int Py_UNICODE_ISLOWER(Py_UNICODE ch)
121
122 Return 1 or 0 depending on whether *ch* is a lowercase character.
123
124
125.. cfunction:: int Py_UNICODE_ISUPPER(Py_UNICODE ch)
126
127 Return 1 or 0 depending on whether *ch* is an uppercase character.
128
129
130.. cfunction:: int Py_UNICODE_ISTITLE(Py_UNICODE ch)
131
132 Return 1 or 0 depending on whether *ch* is a titlecase character.
133
134
135.. cfunction:: int Py_UNICODE_ISLINEBREAK(Py_UNICODE ch)
136
137 Return 1 or 0 depending on whether *ch* is a linebreak character.
138
139
140.. cfunction:: int Py_UNICODE_ISDECIMAL(Py_UNICODE ch)
141
142 Return 1 or 0 depending on whether *ch* is a decimal character.
143
144
145.. cfunction:: int Py_UNICODE_ISDIGIT(Py_UNICODE ch)
146
147 Return 1 or 0 depending on whether *ch* is a digit character.
148
149
150.. cfunction:: int Py_UNICODE_ISNUMERIC(Py_UNICODE ch)
151
152 Return 1 or 0 depending on whether *ch* is a numeric character.
153
154
155.. cfunction:: int Py_UNICODE_ISALPHA(Py_UNICODE ch)
156
157 Return 1 or 0 depending on whether *ch* is an alphabetic character.
158
159
160.. cfunction:: int Py_UNICODE_ISALNUM(Py_UNICODE ch)
161
162 Return 1 or 0 depending on whether *ch* is an alphanumeric character.
163
164These APIs can be used for fast direct character conversions:
165
166
167.. cfunction:: Py_UNICODE Py_UNICODE_TOLOWER(Py_UNICODE ch)
168
169 Return the character *ch* converted to lower case.
170
171
172.. cfunction:: Py_UNICODE Py_UNICODE_TOUPPER(Py_UNICODE ch)
173
174 Return the character *ch* converted to upper case.
175
176
177.. cfunction:: Py_UNICODE Py_UNICODE_TOTITLE(Py_UNICODE ch)
178
179 Return the character *ch* converted to title case.
180
181
182.. cfunction:: int Py_UNICODE_TODECIMAL(Py_UNICODE ch)
183
184 Return the character *ch* converted to a decimal positive integer. Return
185 ``-1`` if this is not possible. This macro does not raise exceptions.
186
187
188.. cfunction:: int Py_UNICODE_TODIGIT(Py_UNICODE ch)
189
190 Return the character *ch* converted to a single digit integer. Return ``-1`` if
191 this is not possible. This macro does not raise exceptions.
192
193
194.. cfunction:: double Py_UNICODE_TONUMERIC(Py_UNICODE ch)
195
196 Return the character *ch* converted to a double. Return ``-1.0`` if this is not
197 possible. This macro does not raise exceptions.
198
199To create Unicode objects and access their basic sequence properties, use these
200APIs:
201
202.. % --- Plain Py_UNICODE ---------------------------------------------------
203
204
205.. cfunction:: PyObject* PyUnicode_FromUnicode(const Py_UNICODE *u, Py_ssize_t size)
206
207 Create a Unicode Object from the Py_UNICODE buffer *u* of the given size. *u*
208 may be *NULL* which causes the contents to be undefined. It is the user's
209 responsibility to fill in the needed data. The buffer is copied into the new
210 object. If the buffer is not *NULL*, the return value might be a shared object.
211 Therefore, modification of the resulting Unicode object is only allowed when *u*
212 is *NULL*.
213
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +0000214 .. versionchanged:: 2.5
215 This function used an :ctype:`int` type for *size*. This might require
216 changes in your code for properly supporting 64-bit systems.
217
Georg Brandlf6842722008-01-19 22:08:21 +0000218
219.. cfunction:: Py_UNICODE* PyUnicode_AsUnicode(PyObject *unicode)
220
221 Return a read-only pointer to the Unicode object's internal :ctype:`Py_UNICODE`
222 buffer, *NULL* if *unicode* is not a Unicode object.
223
224
225.. cfunction:: Py_ssize_t PyUnicode_GetSize(PyObject *unicode)
226
227 Return the length of the Unicode object.
228
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +0000229 .. versionchanged:: 2.5
230 This function returned an :ctype:`int` type. This might require changes
231 in your code for properly supporting 64-bit systems.
232
Georg Brandlf6842722008-01-19 22:08:21 +0000233
234.. cfunction:: PyObject* PyUnicode_FromEncodedObject(PyObject *obj, const char *encoding, const char *errors)
235
236 Coerce an encoded object *obj* to an Unicode object and return a reference with
237 incremented refcount.
238
239 String and other char buffer compatible objects are decoded according to the
240 given encoding and using the error handling defined by errors. Both can be
241 *NULL* to have the interface use the default values (see the next section for
242 details).
243
244 All other objects, including Unicode objects, cause a :exc:`TypeError` to be
245 set.
246
247 The API returns *NULL* if there was an error. The caller is responsible for
248 decref'ing the returned objects.
249
250
251.. cfunction:: PyObject* PyUnicode_FromObject(PyObject *obj)
252
253 Shortcut for ``PyUnicode_FromEncodedObject(obj, NULL, "strict")`` which is used
254 throughout the interpreter whenever coercion to Unicode is needed.
255
256If the platform supports :ctype:`wchar_t` and provides a header file wchar.h,
257Python can interface directly to this type using the following functions.
258Support is optimized if Python's own :ctype:`Py_UNICODE` type is identical to
259the system's :ctype:`wchar_t`.
260
261.. % --- wchar_t support for platforms which support it ---------------------
262
263
264.. cfunction:: PyObject* PyUnicode_FromWideChar(const wchar_t *w, Py_ssize_t size)
265
266 Create a Unicode object from the :ctype:`wchar_t` buffer *w* of the given size.
267 Return *NULL* on failure.
268
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +0000269 .. versionchanged:: 2.5
270 This function used an :ctype:`int` type for *size*. This might require
271 changes in your code for properly supporting 64-bit systems.
272
Georg Brandlf6842722008-01-19 22:08:21 +0000273
274.. cfunction:: Py_ssize_t PyUnicode_AsWideChar(PyUnicodeObject *unicode, wchar_t *w, Py_ssize_t size)
275
276 Copy the Unicode object contents into the :ctype:`wchar_t` buffer *w*. At most
277 *size* :ctype:`wchar_t` characters are copied (excluding a possibly trailing
278 0-termination character). Return the number of :ctype:`wchar_t` characters
279 copied or -1 in case of an error. Note that the resulting :ctype:`wchar_t`
280 string may or may not be 0-terminated. It is the responsibility of the caller
281 to make sure that the :ctype:`wchar_t` string is 0-terminated in case this is
282 required by the application.
283
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +0000284 .. versionchanged:: 2.5
285 This function returned an :ctype:`int` type and used an :ctype:`int`
286 type for *size*. This might require changes in your code for properly
287 supporting 64-bit systems.
288
Georg Brandlf6842722008-01-19 22:08:21 +0000289
290.. _builtincodecs:
291
292Built-in Codecs
293^^^^^^^^^^^^^^^
294
Georg Brandld7d4fd72009-07-26 14:37:28 +0000295Python provides a set of built-in codecs which are written in C for speed. All of
Georg Brandlf6842722008-01-19 22:08:21 +0000296these codecs are directly usable via the following functions.
297
298Many of the following APIs take two arguments encoding and errors. These
299parameters encoding and errors have the same semantics as the ones of the
Georg Brandld7d4fd72009-07-26 14:37:28 +0000300built-in :func:`unicode` Unicode object constructor.
Georg Brandlf6842722008-01-19 22:08:21 +0000301
302Setting encoding to *NULL* causes the default encoding to be used which is
303ASCII. The file system calls should use :cdata:`Py_FileSystemDefaultEncoding`
304as the encoding for file names. This variable should be treated as read-only: On
305some systems, it will be a pointer to a static string, on others, it will change
306at run-time (such as when the application invokes setlocale).
307
308Error handling is set by errors which may also be set to *NULL* meaning to use
309the default handling defined for the codec. Default error handling for all
Georg Brandld7d4fd72009-07-26 14:37:28 +0000310built-in codecs is "strict" (:exc:`ValueError` is raised).
Georg Brandlf6842722008-01-19 22:08:21 +0000311
312The codecs all use a similar interface. Only deviation from the following
313generic ones are documented for simplicity.
314
315These are the generic codec APIs:
316
317.. % --- Generic Codecs -----------------------------------------------------
318
319
320.. cfunction:: PyObject* PyUnicode_Decode(const char *s, Py_ssize_t size, const char *encoding, const char *errors)
321
322 Create a Unicode object by decoding *size* bytes of the encoded string *s*.
323 *encoding* and *errors* have the same meaning as the parameters of the same name
Georg Brandld7d4fd72009-07-26 14:37:28 +0000324 in the :func:`unicode` built-in function. The codec to be used is looked up
Georg Brandlf6842722008-01-19 22:08:21 +0000325 using the Python codec registry. Return *NULL* if an exception was raised by
326 the codec.
327
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +0000328 .. versionchanged:: 2.5
329 This function used an :ctype:`int` type for *size*. This might require
330 changes in your code for properly supporting 64-bit systems.
331
Georg Brandlf6842722008-01-19 22:08:21 +0000332
333.. cfunction:: PyObject* PyUnicode_Encode(const Py_UNICODE *s, Py_ssize_t size, const char *encoding, const char *errors)
334
335 Encode the :ctype:`Py_UNICODE` buffer of the given size and return a Python
336 string object. *encoding* and *errors* have the same meaning as the parameters
337 of the same name in the Unicode :meth:`encode` method. The codec to be used is
338 looked up using the Python codec registry. Return *NULL* if an exception was
339 raised by the codec.
340
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +0000341 .. versionchanged:: 2.5
342 This function used an :ctype:`int` type for *size*. This might require
343 changes in your code for properly supporting 64-bit systems.
344
Georg Brandlf6842722008-01-19 22:08:21 +0000345
346.. cfunction:: PyObject* PyUnicode_AsEncodedString(PyObject *unicode, const char *encoding, const char *errors)
347
348 Encode a Unicode object and return the result as Python string object.
349 *encoding* and *errors* have the same meaning as the parameters of the same name
350 in the Unicode :meth:`encode` method. The codec to be used is looked up using
351 the Python codec registry. Return *NULL* if an exception was raised by the
352 codec.
353
354These are the UTF-8 codec APIs:
355
356.. % --- UTF-8 Codecs -------------------------------------------------------
357
358
359.. cfunction:: PyObject* PyUnicode_DecodeUTF8(const char *s, Py_ssize_t size, const char *errors)
360
361 Create a Unicode object by decoding *size* bytes of the UTF-8 encoded string
362 *s*. Return *NULL* if an exception was raised by the codec.
363
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +0000364 .. versionchanged:: 2.5
365 This function used an :ctype:`int` type for *size*. This might require
366 changes in your code for properly supporting 64-bit systems.
367
Georg Brandlf6842722008-01-19 22:08:21 +0000368
369.. cfunction:: PyObject* PyUnicode_DecodeUTF8Stateful(const char *s, Py_ssize_t size, const char *errors, Py_ssize_t *consumed)
370
371 If *consumed* is *NULL*, behave like :cfunc:`PyUnicode_DecodeUTF8`. If
372 *consumed* is not *NULL*, trailing incomplete UTF-8 byte sequences will not be
373 treated as an error. Those bytes will not be decoded and the number of bytes
374 that have been decoded will be stored in *consumed*.
375
376 .. versionadded:: 2.4
377
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +0000378 .. versionchanged:: 2.5
379 This function used an :ctype:`int` type for *size*. This might require
380 changes in your code for properly supporting 64-bit systems.
381
Georg Brandlf6842722008-01-19 22:08:21 +0000382
383.. cfunction:: PyObject* PyUnicode_EncodeUTF8(const Py_UNICODE *s, Py_ssize_t size, const char *errors)
384
385 Encode the :ctype:`Py_UNICODE` buffer of the given size using UTF-8 and return a
386 Python string object. Return *NULL* if an exception was raised by the codec.
387
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +0000388 .. versionchanged:: 2.5
389 This function used an :ctype:`int` type for *size*. This might require
390 changes in your code for properly supporting 64-bit systems.
391
Georg Brandlf6842722008-01-19 22:08:21 +0000392
393.. cfunction:: PyObject* PyUnicode_AsUTF8String(PyObject *unicode)
394
395 Encode a Unicode object using UTF-8 and return the result as Python string
396 object. Error handling is "strict". Return *NULL* if an exception was raised
397 by the codec.
398
399These are the UTF-32 codec APIs:
400
401.. % --- UTF-32 Codecs ------------------------------------------------------ */
402
403
404.. cfunction:: PyObject* PyUnicode_DecodeUTF32(const char *s, Py_ssize_t size, const char *errors, int *byteorder)
405
406 Decode *length* bytes from a UTF-32 encoded buffer string and return the
407 corresponding Unicode object. *errors* (if non-*NULL*) defines the error
408 handling. It defaults to "strict".
409
410 If *byteorder* is non-*NULL*, the decoder starts decoding using the given byte
411 order::
412
413 *byteorder == -1: little endian
414 *byteorder == 0: native order
415 *byteorder == 1: big endian
416
Georg Brandl579a3582009-09-18 21:35:59 +0000417 If ``*byteorder`` is zero, and the first four bytes of the input data are a
418 byte order mark (BOM), the decoder switches to this byte order and the BOM is
419 not copied into the resulting Unicode string. If ``*byteorder`` is ``-1`` or
420 ``1``, any byte order mark is copied to the output.
421
422 After completion, *\*byteorder* is set to the current byte order at the end
423 of input data.
Georg Brandlf6842722008-01-19 22:08:21 +0000424
425 In a narrow build codepoints outside the BMP will be decoded as surrogate pairs.
426
427 If *byteorder* is *NULL*, the codec starts in native order mode.
428
429 Return *NULL* if an exception was raised by the codec.
430
431 .. versionadded:: 2.6
432
433
434.. cfunction:: PyObject* PyUnicode_DecodeUTF32Stateful(const char *s, Py_ssize_t size, const char *errors, int *byteorder, Py_ssize_t *consumed)
435
436 If *consumed* is *NULL*, behave like :cfunc:`PyUnicode_DecodeUTF32`. If
437 *consumed* is not *NULL*, :cfunc:`PyUnicode_DecodeUTF32Stateful` will not treat
438 trailing incomplete UTF-32 byte sequences (such as a number of bytes not divisible
439 by four) as an error. Those bytes will not be decoded and the number of bytes
440 that have been decoded will be stored in *consumed*.
441
442 .. versionadded:: 2.6
443
444
445.. cfunction:: PyObject* PyUnicode_EncodeUTF32(const Py_UNICODE *s, Py_ssize_t size, const char *errors, int byteorder)
446
447 Return a Python bytes object holding the UTF-32 encoded value of the Unicode
Georg Brandl579a3582009-09-18 21:35:59 +0000448 data in *s*. Output is written according to the following byte order::
Georg Brandlf6842722008-01-19 22:08:21 +0000449
450 byteorder == -1: little endian
451 byteorder == 0: native byte order (writes a BOM mark)
452 byteorder == 1: big endian
453
454 If byteorder is ``0``, the output string will always start with the Unicode BOM
455 mark (U+FEFF). In the other two modes, no BOM mark is prepended.
456
457 If *Py_UNICODE_WIDE* is not defined, surrogate pairs will be output
458 as a single codepoint.
459
460 Return *NULL* if an exception was raised by the codec.
461
462 .. versionadded:: 2.6
463
464
465.. cfunction:: PyObject* PyUnicode_AsUTF32String(PyObject *unicode)
466
467 Return a Python string using the UTF-32 encoding in native byte order. The
468 string always starts with a BOM mark. Error handling is "strict". Return
469 *NULL* if an exception was raised by the codec.
470
471 .. versionadded:: 2.6
472
473
474These are the UTF-16 codec APIs:
475
476.. % --- UTF-16 Codecs ------------------------------------------------------ */
477
478
479.. cfunction:: PyObject* PyUnicode_DecodeUTF16(const char *s, Py_ssize_t size, const char *errors, int *byteorder)
480
481 Decode *length* bytes from a UTF-16 encoded buffer string and return the
482 corresponding Unicode object. *errors* (if non-*NULL*) defines the error
483 handling. It defaults to "strict".
484
485 If *byteorder* is non-*NULL*, the decoder starts decoding using the given byte
486 order::
487
488 *byteorder == -1: little endian
489 *byteorder == 0: native order
490 *byteorder == 1: big endian
491
Georg Brandl579a3582009-09-18 21:35:59 +0000492 If ``*byteorder`` is zero, and the first two bytes of the input data are a
493 byte order mark (BOM), the decoder switches to this byte order and the BOM is
494 not copied into the resulting Unicode string. If ``*byteorder`` is ``-1`` or
495 ``1``, any byte order mark is copied to the output (where it will result in
496 either a ``\ufeff`` or a ``\ufffe`` character).
497
498 After completion, *\*byteorder* is set to the current byte order at the end
499 of input data.
Georg Brandlf6842722008-01-19 22:08:21 +0000500
501 If *byteorder* is *NULL*, the codec starts in native order mode.
502
503 Return *NULL* if an exception was raised by the codec.
504
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +0000505 .. versionchanged:: 2.5
506 This function used an :ctype:`int` type for *size*. This might require
507 changes in your code for properly supporting 64-bit systems.
508
Georg Brandlf6842722008-01-19 22:08:21 +0000509
510.. cfunction:: PyObject* PyUnicode_DecodeUTF16Stateful(const char *s, Py_ssize_t size, const char *errors, int *byteorder, Py_ssize_t *consumed)
511
512 If *consumed* is *NULL*, behave like :cfunc:`PyUnicode_DecodeUTF16`. If
513 *consumed* is not *NULL*, :cfunc:`PyUnicode_DecodeUTF16Stateful` will not treat
514 trailing incomplete UTF-16 byte sequences (such as an odd number of bytes or a
515 split surrogate pair) as an error. Those bytes will not be decoded and the
516 number of bytes that have been decoded will be stored in *consumed*.
517
518 .. versionadded:: 2.4
519
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +0000520 .. versionchanged:: 2.5
521 This function used an :ctype:`int` type for *size* and an :ctype:`int *`
522 type for *consumed*. This might require changes in your code for
523 properly supporting 64-bit systems.
524
Georg Brandlf6842722008-01-19 22:08:21 +0000525
526.. cfunction:: PyObject* PyUnicode_EncodeUTF16(const Py_UNICODE *s, Py_ssize_t size, const char *errors, int byteorder)
527
528 Return a Python string object holding the UTF-16 encoded value of the Unicode
Georg Brandl579a3582009-09-18 21:35:59 +0000529 data in *s*. Output is written according to the following byte order::
Georg Brandlf6842722008-01-19 22:08:21 +0000530
531 byteorder == -1: little endian
532 byteorder == 0: native byte order (writes a BOM mark)
533 byteorder == 1: big endian
534
535 If byteorder is ``0``, the output string will always start with the Unicode BOM
536 mark (U+FEFF). In the other two modes, no BOM mark is prepended.
537
538 If *Py_UNICODE_WIDE* is defined, a single :ctype:`Py_UNICODE` value may get
539 represented as a surrogate pair. If it is not defined, each :ctype:`Py_UNICODE`
540 values is interpreted as an UCS-2 character.
541
542 Return *NULL* if an exception was raised by the codec.
543
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +0000544 .. versionchanged:: 2.5
545 This function used an :ctype:`int` type for *size*. This might require
546 changes in your code for properly supporting 64-bit systems.
547
Georg Brandlf6842722008-01-19 22:08:21 +0000548
549.. cfunction:: PyObject* PyUnicode_AsUTF16String(PyObject *unicode)
550
551 Return a Python string using the UTF-16 encoding in native byte order. The
552 string always starts with a BOM mark. Error handling is "strict". Return
553 *NULL* if an exception was raised by the codec.
554
555These are the "Unicode Escape" codec APIs:
556
557.. % --- Unicode-Escape Codecs ----------------------------------------------
558
559
560.. cfunction:: PyObject* PyUnicode_DecodeUnicodeEscape(const char *s, Py_ssize_t size, const char *errors)
561
562 Create a Unicode object by decoding *size* bytes of the Unicode-Escape encoded
563 string *s*. Return *NULL* if an exception was raised by the codec.
564
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +0000565 .. versionchanged:: 2.5
566 This function used an :ctype:`int` type for *size*. This might require
567 changes in your code for properly supporting 64-bit systems.
568
Georg Brandlf6842722008-01-19 22:08:21 +0000569
570.. cfunction:: PyObject* PyUnicode_EncodeUnicodeEscape(const Py_UNICODE *s, Py_ssize_t size)
571
572 Encode the :ctype:`Py_UNICODE` buffer of the given size using Unicode-Escape and
573 return a Python string object. Return *NULL* if an exception was raised by the
574 codec.
575
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +0000576 .. versionchanged:: 2.5
577 This function used an :ctype:`int` type for *size*. This might require
578 changes in your code for properly supporting 64-bit systems.
579
Georg Brandlf6842722008-01-19 22:08:21 +0000580
581.. cfunction:: PyObject* PyUnicode_AsUnicodeEscapeString(PyObject *unicode)
582
583 Encode a Unicode object using Unicode-Escape and return the result as Python
584 string object. Error handling is "strict". Return *NULL* if an exception was
585 raised by the codec.
586
587These are the "Raw Unicode Escape" codec APIs:
588
589.. % --- Raw-Unicode-Escape Codecs ------------------------------------------
590
591
592.. cfunction:: PyObject* PyUnicode_DecodeRawUnicodeEscape(const char *s, Py_ssize_t size, const char *errors)
593
594 Create a Unicode object by decoding *size* bytes of the Raw-Unicode-Escape
595 encoded string *s*. Return *NULL* if an exception was raised by the codec.
596
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +0000597 .. versionchanged:: 2.5
598 This function used an :ctype:`int` type for *size*. This might require
599 changes in your code for properly supporting 64-bit systems.
600
Georg Brandlf6842722008-01-19 22:08:21 +0000601
602.. cfunction:: PyObject* PyUnicode_EncodeRawUnicodeEscape(const Py_UNICODE *s, Py_ssize_t size, const char *errors)
603
604 Encode the :ctype:`Py_UNICODE` buffer of the given size using Raw-Unicode-Escape
605 and return a Python string object. Return *NULL* if an exception was raised by
606 the codec.
607
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +0000608 .. versionchanged:: 2.5
609 This function used an :ctype:`int` type for *size*. This might require
610 changes in your code for properly supporting 64-bit systems.
611
Georg Brandlf6842722008-01-19 22:08:21 +0000612
613.. cfunction:: PyObject* PyUnicode_AsRawUnicodeEscapeString(PyObject *unicode)
614
615 Encode a Unicode object using Raw-Unicode-Escape and return the result as
616 Python string object. Error handling is "strict". Return *NULL* if an exception
617 was raised by the codec.
618
619These are the Latin-1 codec APIs: Latin-1 corresponds to the first 256 Unicode
620ordinals and only these are accepted by the codecs during encoding.
621
622.. % --- Latin-1 Codecs -----------------------------------------------------
623
624
625.. cfunction:: PyObject* PyUnicode_DecodeLatin1(const char *s, Py_ssize_t size, const char *errors)
626
627 Create a Unicode object by decoding *size* bytes of the Latin-1 encoded string
628 *s*. Return *NULL* if an exception was raised by the codec.
629
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +0000630 .. versionchanged:: 2.5
631 This function used an :ctype:`int` type for *size*. This might require
632 changes in your code for properly supporting 64-bit systems.
633
Georg Brandlf6842722008-01-19 22:08:21 +0000634
635.. cfunction:: PyObject* PyUnicode_EncodeLatin1(const Py_UNICODE *s, Py_ssize_t size, const char *errors)
636
637 Encode the :ctype:`Py_UNICODE` buffer of the given size using Latin-1 and return
638 a Python string object. Return *NULL* if an exception was raised by the codec.
639
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +0000640 .. versionchanged:: 2.5
641 This function used an :ctype:`int` type for *size*. This might require
642 changes in your code for properly supporting 64-bit systems.
643
Georg Brandlf6842722008-01-19 22:08:21 +0000644
645.. cfunction:: PyObject* PyUnicode_AsLatin1String(PyObject *unicode)
646
647 Encode a Unicode object using Latin-1 and return the result as Python string
648 object. Error handling is "strict". Return *NULL* if an exception was raised
649 by the codec.
650
651These are the ASCII codec APIs. Only 7-bit ASCII data is accepted. All other
652codes generate errors.
653
654.. % --- ASCII Codecs -------------------------------------------------------
655
656
657.. cfunction:: PyObject* PyUnicode_DecodeASCII(const char *s, Py_ssize_t size, const char *errors)
658
659 Create a Unicode object by decoding *size* bytes of the ASCII encoded string
660 *s*. Return *NULL* if an exception was raised by the codec.
661
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +0000662 .. versionchanged:: 2.5
663 This function used an :ctype:`int` type for *size*. This might require
664 changes in your code for properly supporting 64-bit systems.
665
Georg Brandlf6842722008-01-19 22:08:21 +0000666
667.. cfunction:: PyObject* PyUnicode_EncodeASCII(const Py_UNICODE *s, Py_ssize_t size, const char *errors)
668
669 Encode the :ctype:`Py_UNICODE` buffer of the given size using ASCII and return a
670 Python string object. Return *NULL* if an exception was raised by the codec.
671
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +0000672 .. versionchanged:: 2.5
673 This function used an :ctype:`int` type for *size*. This might require
674 changes in your code for properly supporting 64-bit systems.
675
Georg Brandlf6842722008-01-19 22:08:21 +0000676
677.. cfunction:: PyObject* PyUnicode_AsASCIIString(PyObject *unicode)
678
679 Encode a Unicode object using ASCII and return the result as Python string
680 object. Error handling is "strict". Return *NULL* if an exception was raised
681 by the codec.
682
683These are the mapping codec APIs:
684
685.. % --- Character Map Codecs -----------------------------------------------
686
687This codec is special in that it can be used to implement many different codecs
688(and this is in fact what was done to obtain most of the standard codecs
689included in the :mod:`encodings` package). The codec uses mapping to encode and
690decode characters.
691
692Decoding mappings must map single string characters to single Unicode
693characters, integers (which are then interpreted as Unicode ordinals) or None
694(meaning "undefined mapping" and causing an error).
695
696Encoding mappings must map single Unicode characters to single string
697characters, integers (which are then interpreted as Latin-1 ordinals) or None
698(meaning "undefined mapping" and causing an error).
699
700The mapping objects provided must only support the __getitem__ mapping
701interface.
702
703If a character lookup fails with a LookupError, the character is copied as-is
704meaning that its ordinal value will be interpreted as Unicode or Latin-1 ordinal
705resp. Because of this, mappings only need to contain those mappings which map
706characters to different code points.
707
708
709.. cfunction:: PyObject* PyUnicode_DecodeCharmap(const char *s, Py_ssize_t size, PyObject *mapping, const char *errors)
710
711 Create a Unicode object by decoding *size* bytes of the encoded string *s* using
712 the given *mapping* object. Return *NULL* if an exception was raised by the
713 codec. If *mapping* is *NULL* latin-1 decoding will be done. Else it can be a
714 dictionary mapping byte or a unicode string, which is treated as a lookup table.
715 Byte values greater that the length of the string and U+FFFE "characters" are
716 treated as "undefined mapping".
717
718 .. versionchanged:: 2.4
719 Allowed unicode string as mapping argument.
720
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +0000721 .. versionchanged:: 2.5
722 This function used an :ctype:`int` type for *size*. This might require
723 changes in your code for properly supporting 64-bit systems.
724
Georg Brandlf6842722008-01-19 22:08:21 +0000725
726.. cfunction:: PyObject* PyUnicode_EncodeCharmap(const Py_UNICODE *s, Py_ssize_t size, PyObject *mapping, const char *errors)
727
728 Encode the :ctype:`Py_UNICODE` buffer of the given size using the given
729 *mapping* object and return a Python string object. Return *NULL* if an
730 exception was raised by the codec.
731
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +0000732 .. versionchanged:: 2.5
733 This function used an :ctype:`int` type for *size*. This might require
734 changes in your code for properly supporting 64-bit systems.
735
Georg Brandlf6842722008-01-19 22:08:21 +0000736
737.. cfunction:: PyObject* PyUnicode_AsCharmapString(PyObject *unicode, PyObject *mapping)
738
739 Encode a Unicode object using the given *mapping* object and return the result
740 as Python string object. Error handling is "strict". Return *NULL* if an
741 exception was raised by the codec.
742
743The following codec API is special in that maps Unicode to Unicode.
744
745
746.. cfunction:: PyObject* PyUnicode_TranslateCharmap(const Py_UNICODE *s, Py_ssize_t size, PyObject *table, const char *errors)
747
748 Translate a :ctype:`Py_UNICODE` buffer of the given length by applying a
749 character mapping *table* to it and return the resulting Unicode object. Return
750 *NULL* when an exception was raised by the codec.
751
752 The *mapping* table must map Unicode ordinal integers to Unicode ordinal
753 integers or None (causing deletion of the character).
754
755 Mapping tables need only provide the :meth:`__getitem__` interface; dictionaries
756 and sequences work well. Unmapped character ordinals (ones which cause a
757 :exc:`LookupError`) are left untouched and are copied as-is.
758
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +0000759 .. versionchanged:: 2.5
760 This function used an :ctype:`int` type for *size*. This might require
761 changes in your code for properly supporting 64-bit systems.
762
Georg Brandlf6842722008-01-19 22:08:21 +0000763These are the MBCS codec APIs. They are currently only available on Windows and
764use the Win32 MBCS converters to implement the conversions. Note that MBCS (or
765DBCS) is a class of encodings, not just one. The target encoding is defined by
766the user settings on the machine running the codec.
767
768.. % --- MBCS codecs for Windows --------------------------------------------
769
770
771.. cfunction:: PyObject* PyUnicode_DecodeMBCS(const char *s, Py_ssize_t size, const char *errors)
772
773 Create a Unicode object by decoding *size* bytes of the MBCS encoded string *s*.
774 Return *NULL* if an exception was raised by the codec.
775
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +0000776 .. versionchanged:: 2.5
777 This function used an :ctype:`int` type for *size*. This might require
778 changes in your code for properly supporting 64-bit systems.
779
Georg Brandlf6842722008-01-19 22:08:21 +0000780
781.. cfunction:: PyObject* PyUnicode_DecodeMBCSStateful(const char *s, int size, const char *errors, int *consumed)
782
783 If *consumed* is *NULL*, behave like :cfunc:`PyUnicode_DecodeMBCS`. If
784 *consumed* is not *NULL*, :cfunc:`PyUnicode_DecodeMBCSStateful` will not decode
785 trailing lead byte and the number of bytes that have been decoded will be stored
786 in *consumed*.
787
788 .. versionadded:: 2.5
789
790
791.. cfunction:: PyObject* PyUnicode_EncodeMBCS(const Py_UNICODE *s, Py_ssize_t size, const char *errors)
792
793 Encode the :ctype:`Py_UNICODE` buffer of the given size using MBCS and return a
794 Python string object. Return *NULL* if an exception was raised by the codec.
795
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +0000796 .. versionchanged:: 2.5
797 This function used an :ctype:`int` type for *size*. This might require
798 changes in your code for properly supporting 64-bit systems.
799
Georg Brandlf6842722008-01-19 22:08:21 +0000800
801.. cfunction:: PyObject* PyUnicode_AsMBCSString(PyObject *unicode)
802
803 Encode a Unicode object using MBCS and return the result as Python string
804 object. Error handling is "strict". Return *NULL* if an exception was raised
805 by the codec.
806
807.. % --- Methods & Slots ----------------------------------------------------
808
809
810.. _unicodemethodsandslots:
811
812Methods and Slot Functions
813^^^^^^^^^^^^^^^^^^^^^^^^^^
814
815The following APIs are capable of handling Unicode objects and strings on input
816(we refer to them as strings in the descriptions) and return Unicode objects or
817integers as appropriate.
818
819They all return *NULL* or ``-1`` if an exception occurs.
820
821
822.. cfunction:: PyObject* PyUnicode_Concat(PyObject *left, PyObject *right)
823
824 Concat two strings giving a new Unicode string.
825
826
827.. cfunction:: PyObject* PyUnicode_Split(PyObject *s, PyObject *sep, Py_ssize_t maxsplit)
828
829 Split a string giving a list of Unicode strings. If sep is *NULL*, splitting
830 will be done at all whitespace substrings. Otherwise, splits occur at the given
831 separator. At most *maxsplit* splits will be done. If negative, no limit is
832 set. Separators are not included in the resulting list.
833
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +0000834 .. versionchanged:: 2.5
835 This function used an :ctype:`int` type for *maxsplit*. This might require
836 changes in your code for properly supporting 64-bit systems.
837
Georg Brandlf6842722008-01-19 22:08:21 +0000838
839.. cfunction:: PyObject* PyUnicode_Splitlines(PyObject *s, int keepend)
840
841 Split a Unicode string at line breaks, returning a list of Unicode strings.
842 CRLF is considered to be one line break. If *keepend* is 0, the Line break
843 characters are not included in the resulting strings.
844
845
846.. cfunction:: PyObject* PyUnicode_Translate(PyObject *str, PyObject *table, const char *errors)
847
848 Translate a string by applying a character mapping table to it and return the
849 resulting Unicode object.
850
851 The mapping table must map Unicode ordinal integers to Unicode ordinal integers
852 or None (causing deletion of the character).
853
854 Mapping tables need only provide the :meth:`__getitem__` interface; dictionaries
855 and sequences work well. Unmapped character ordinals (ones which cause a
856 :exc:`LookupError`) are left untouched and are copied as-is.
857
858 *errors* has the usual meaning for codecs. It may be *NULL* which indicates to
859 use the default error handling.
860
861
862.. cfunction:: PyObject* PyUnicode_Join(PyObject *separator, PyObject *seq)
863
864 Join a sequence of strings using the given separator and return the resulting
865 Unicode string.
866
867
868.. cfunction:: int PyUnicode_Tailmatch(PyObject *str, PyObject *substr, Py_ssize_t start, Py_ssize_t end, int direction)
869
870 Return 1 if *substr* matches *str*[*start*:*end*] at the given tail end
871 (*direction* == -1 means to do a prefix match, *direction* == 1 a suffix match),
872 0 otherwise. Return ``-1`` if an error occurred.
873
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +0000874 .. versionchanged:: 2.5
875 This function used an :ctype:`int` type for *start* and *end*. This
876 might require changes in your code for properly supporting 64-bit
877 systems.
878
Georg Brandlf6842722008-01-19 22:08:21 +0000879
880.. cfunction:: Py_ssize_t PyUnicode_Find(PyObject *str, PyObject *substr, Py_ssize_t start, Py_ssize_t end, int direction)
881
882 Return the first position of *substr* in *str*[*start*:*end*] using the given
883 *direction* (*direction* == 1 means to do a forward search, *direction* == -1 a
884 backward search). The return value is the index of the first match; a value of
885 ``-1`` indicates that no match was found, and ``-2`` indicates that an error
886 occurred and an exception has been set.
887
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +0000888 .. versionchanged:: 2.5
889 This function used an :ctype:`int` type for *start* and *end*. This
890 might require changes in your code for properly supporting 64-bit
891 systems.
892
Georg Brandlf6842722008-01-19 22:08:21 +0000893
894.. cfunction:: Py_ssize_t PyUnicode_Count(PyObject *str, PyObject *substr, Py_ssize_t start, Py_ssize_t end)
895
896 Return the number of non-overlapping occurrences of *substr* in
897 ``str[start:end]``. Return ``-1`` if an error occurred.
898
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +0000899 .. versionchanged:: 2.5
900 This function returned an :ctype:`int` type and used an :ctype:`int`
901 type for *start* and *end*. This might require changes in your code for
902 properly supporting 64-bit systems.
903
Georg Brandlf6842722008-01-19 22:08:21 +0000904
905.. cfunction:: PyObject* PyUnicode_Replace(PyObject *str, PyObject *substr, PyObject *replstr, Py_ssize_t maxcount)
906
907 Replace at most *maxcount* occurrences of *substr* in *str* with *replstr* and
908 return the resulting Unicode object. *maxcount* == -1 means replace all
909 occurrences.
910
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +0000911 .. versionchanged:: 2.5
912 This function used an :ctype:`int` type for *maxcount*. This might
913 require changes in your code for properly supporting 64-bit systems.
914
Georg Brandlf6842722008-01-19 22:08:21 +0000915
916.. cfunction:: int PyUnicode_Compare(PyObject *left, PyObject *right)
917
918 Compare two strings and return -1, 0, 1 for less than, equal, and greater than,
919 respectively.
920
921
922.. cfunction:: int PyUnicode_RichCompare(PyObject *left, PyObject *right, int op)
923
924 Rich compare two unicode strings and return one of the following:
925
926 * ``NULL`` in case an exception was raised
927 * :const:`Py_True` or :const:`Py_False` for successful comparisons
928 * :const:`Py_NotImplemented` in case the type combination is unknown
929
930 Note that :const:`Py_EQ` and :const:`Py_NE` comparisons can cause a
931 :exc:`UnicodeWarning` in case the conversion of the arguments to Unicode fails
932 with a :exc:`UnicodeDecodeError`.
933
934 Possible values for *op* are :const:`Py_GT`, :const:`Py_GE`, :const:`Py_EQ`,
935 :const:`Py_NE`, :const:`Py_LT`, and :const:`Py_LE`.
936
937
938.. cfunction:: PyObject* PyUnicode_Format(PyObject *format, PyObject *args)
939
940 Return a new string object from *format* and *args*; this is analogous to
941 ``format % args``. The *args* argument must be a tuple.
942
943
944.. cfunction:: int PyUnicode_Contains(PyObject *container, PyObject *element)
945
946 Check whether *element* is contained in *container* and return true or false
947 accordingly.
948
949 *element* has to coerce to a one element Unicode string. ``-1`` is returned if
950 there was an error.