blob: 011a6cdb2c7507ce45c452d4055c08479a61b468 [file] [log] [blame]
Georg Brandl54a3faa2008-01-20 09:30:57 +00001.. highlightlang:: c
2
3.. _unicodeobjects:
4
5Unicode Objects and Codecs
6--------------------------
7
8.. sectionauthor:: Marc-Andre Lemburg <mal@lemburg.com>
9
10Unicode Objects
11^^^^^^^^^^^^^^^
12
13These are the basic Unicode object types used for the Unicode implementation in
14Python:
15
16.. % --- Unicode Type -------------------------------------------------------
17
18
19.. ctype:: Py_UNICODE
20
21 This type represents the storage type which is used by Python internally as
22 basis for holding Unicode ordinals. Python's default builds use a 16-bit type
23 for :ctype:`Py_UNICODE` and store Unicode values internally as UCS2. It is also
24 possible to build a UCS4 version of Python (most recent Linux distributions come
25 with UCS4 builds of Python). These builds then use a 32-bit type for
26 :ctype:`Py_UNICODE` and store Unicode data internally as UCS4. On platforms
27 where :ctype:`wchar_t` is available and compatible with the chosen Python
28 Unicode build variant, :ctype:`Py_UNICODE` is a typedef alias for
29 :ctype:`wchar_t` to enhance native platform compatibility. On all other
30 platforms, :ctype:`Py_UNICODE` is a typedef alias for either :ctype:`unsigned
31 short` (UCS2) or :ctype:`unsigned long` (UCS4).
32
33Note that UCS2 and UCS4 Python builds are not binary compatible. Please keep
34this in mind when writing extensions or interfaces.
35
36
37.. ctype:: PyUnicodeObject
38
39 This subtype of :ctype:`PyObject` represents a Python Unicode object.
40
41
42.. cvar:: PyTypeObject PyUnicode_Type
43
44 This instance of :ctype:`PyTypeObject` represents the Python Unicode type. It
45 is exposed to Python code as ``str``.
46
47The following APIs are really C macros and can be used to do fast checks and to
48access internal read-only data of Unicode objects:
49
50
51.. cfunction:: int PyUnicode_Check(PyObject *o)
52
53 Return true if the object *o* is a Unicode object or an instance of a Unicode
54 subtype.
55
56
57.. cfunction:: int PyUnicode_CheckExact(PyObject *o)
58
59 Return true if the object *o* is a Unicode object, but not an instance of a
60 subtype.
61
62
63.. cfunction:: Py_ssize_t PyUnicode_GET_SIZE(PyObject *o)
64
65 Return the size of the object. *o* has to be a :ctype:`PyUnicodeObject` (not
66 checked).
67
68
69.. cfunction:: Py_ssize_t PyUnicode_GET_DATA_SIZE(PyObject *o)
70
71 Return the size of the object's internal buffer in bytes. *o* has to be a
72 :ctype:`PyUnicodeObject` (not checked).
73
74
75.. cfunction:: Py_UNICODE* PyUnicode_AS_UNICODE(PyObject *o)
76
77 Return a pointer to the internal :ctype:`Py_UNICODE` buffer of the object. *o*
78 has to be a :ctype:`PyUnicodeObject` (not checked).
79
80
81.. cfunction:: const char* PyUnicode_AS_DATA(PyObject *o)
82
83 Return a pointer to the internal buffer of the object. *o* has to be a
84 :ctype:`PyUnicodeObject` (not checked).
85
Christian Heimesa156e092008-02-16 07:38:31 +000086
87.. cfunction:: int PyUnicode_ClearFreeList(void)
88
89 Clear the free list. Return the total number of freed items.
90
Georg Brandl54a3faa2008-01-20 09:30:57 +000091Unicode provides many different character properties. The most often needed ones
92are available through these macros which are mapped to C functions depending on
93the Python configuration.
94
95.. % --- Unicode character properties ---------------------------------------
96
97
98.. cfunction:: int Py_UNICODE_ISSPACE(Py_UNICODE ch)
99
100 Return 1 or 0 depending on whether *ch* is a whitespace character.
101
102
103.. cfunction:: int Py_UNICODE_ISLOWER(Py_UNICODE ch)
104
105 Return 1 or 0 depending on whether *ch* is a lowercase character.
106
107
108.. cfunction:: int Py_UNICODE_ISUPPER(Py_UNICODE ch)
109
110 Return 1 or 0 depending on whether *ch* is an uppercase character.
111
112
113.. cfunction:: int Py_UNICODE_ISTITLE(Py_UNICODE ch)
114
115 Return 1 or 0 depending on whether *ch* is a titlecase character.
116
117
118.. cfunction:: int Py_UNICODE_ISLINEBREAK(Py_UNICODE ch)
119
120 Return 1 or 0 depending on whether *ch* is a linebreak character.
121
122
123.. cfunction:: int Py_UNICODE_ISDECIMAL(Py_UNICODE ch)
124
125 Return 1 or 0 depending on whether *ch* is a decimal character.
126
127
128.. cfunction:: int Py_UNICODE_ISDIGIT(Py_UNICODE ch)
129
130 Return 1 or 0 depending on whether *ch* is a digit character.
131
132
133.. cfunction:: int Py_UNICODE_ISNUMERIC(Py_UNICODE ch)
134
135 Return 1 or 0 depending on whether *ch* is a numeric character.
136
137
138.. cfunction:: int Py_UNICODE_ISALPHA(Py_UNICODE ch)
139
140 Return 1 or 0 depending on whether *ch* is an alphabetic character.
141
142
143.. cfunction:: int Py_UNICODE_ISALNUM(Py_UNICODE ch)
144
145 Return 1 or 0 depending on whether *ch* is an alphanumeric character.
146
Georg Brandl559e5d72008-06-11 18:37:52 +0000147
148.. cfunction:: int Py_UNICODE_ISPRINTABLE(Py_UNICODE ch)
149
150 Return 1 or 0 depending on whether *ch* is a printable character.
151 Nonprintable characters are those characters defined in the Unicode character
152 database as "Other" or "Separator", excepting the ASCII space (0x20) which is
153 considered printable. (Note that printable characters in this context are
154 those which should not be escaped when :func:`repr` is invoked on a string.
155 It has no bearing on the handling of strings written to :data:`sys.stdout` or
156 :data:`sys.stderr`.)
157
158
Georg Brandl54a3faa2008-01-20 09:30:57 +0000159These APIs can be used for fast direct character conversions:
160
161
162.. cfunction:: Py_UNICODE Py_UNICODE_TOLOWER(Py_UNICODE ch)
163
164 Return the character *ch* converted to lower case.
165
166
167.. cfunction:: Py_UNICODE Py_UNICODE_TOUPPER(Py_UNICODE ch)
168
169 Return the character *ch* converted to upper case.
170
171
172.. cfunction:: Py_UNICODE Py_UNICODE_TOTITLE(Py_UNICODE ch)
173
174 Return the character *ch* converted to title case.
175
176
177.. cfunction:: int Py_UNICODE_TODECIMAL(Py_UNICODE ch)
178
179 Return the character *ch* converted to a decimal positive integer. Return
180 ``-1`` if this is not possible. This macro does not raise exceptions.
181
182
183.. cfunction:: int Py_UNICODE_TODIGIT(Py_UNICODE ch)
184
185 Return the character *ch* converted to a single digit integer. Return ``-1`` if
186 this is not possible. This macro does not raise exceptions.
187
188
189.. cfunction:: double Py_UNICODE_TONUMERIC(Py_UNICODE ch)
190
191 Return the character *ch* converted to a double. Return ``-1.0`` if this is not
192 possible. This macro does not raise exceptions.
193
194To create Unicode objects and access their basic sequence properties, use these
195APIs:
196
197.. % --- Plain Py_UNICODE ---------------------------------------------------
198
199
200.. cfunction:: PyObject* PyUnicode_FromUnicode(const Py_UNICODE *u, Py_ssize_t size)
201
202 Create a Unicode Object from the Py_UNICODE buffer *u* of the given size. *u*
203 may be *NULL* which causes the contents to be undefined. It is the user's
204 responsibility to fill in the needed data. The buffer is copied into the new
205 object. If the buffer is not *NULL*, the return value might be a shared object.
206 Therefore, modification of the resulting Unicode object is only allowed when *u*
207 is *NULL*.
208
209
210.. cfunction:: PyObject* PyUnicode_FromStringAndSize(const char *u, Py_ssize_t size)
211
212 Create a Unicode Object from the char buffer *u*. The bytes will be interpreted
213 as being UTF-8 encoded. *u* may also be *NULL* which
214 causes the contents to be undefined. It is the user's responsibility to fill in
215 the needed data. The buffer is copied into the new object. If the buffer is not
216 *NULL*, the return value might be a shared object. Therefore, modification of
217 the resulting Unicode object is only allowed when *u* is *NULL*.
218
219
220.. cfunction:: PyObject *PyUnicode_FromString(const char *u)
221
222 Create a Unicode object from an UTF-8 encoded null-terminated char buffer
223 *u*.
224
225
226.. cfunction:: PyObject* PyUnicode_FromFormat(const char *format, ...)
227
228 Take a C :cfunc:`printf`\ -style *format* string and a variable number of
229 arguments, calculate the size of the resulting Python unicode string and return
230 a string with the values formatted into it. The variable arguments must be C
231 types and must correspond exactly to the format characters in the *format*
232 string. The following format characters are allowed:
233
234 .. % The descriptions for %zd and %zu are wrong, but the truth is complicated
235 .. % because not all compilers support the %z width modifier -- we fake it
236 .. % when necessary via interpolating PY_FORMAT_SIZE_T.
237
238 +-------------------+---------------------+--------------------------------+
239 | Format Characters | Type | Comment |
240 +===================+=====================+================================+
241 | :attr:`%%` | *n/a* | The literal % character. |
242 +-------------------+---------------------+--------------------------------+
243 | :attr:`%c` | int | A single character, |
244 | | | represented as an C int. |
245 +-------------------+---------------------+--------------------------------+
246 | :attr:`%d` | int | Exactly equivalent to |
247 | | | ``printf("%d")``. |
248 +-------------------+---------------------+--------------------------------+
249 | :attr:`%u` | unsigned int | Exactly equivalent to |
250 | | | ``printf("%u")``. |
251 +-------------------+---------------------+--------------------------------+
252 | :attr:`%ld` | long | Exactly equivalent to |
253 | | | ``printf("%ld")``. |
254 +-------------------+---------------------+--------------------------------+
255 | :attr:`%lu` | unsigned long | Exactly equivalent to |
256 | | | ``printf("%lu")``. |
257 +-------------------+---------------------+--------------------------------+
258 | :attr:`%zd` | Py_ssize_t | Exactly equivalent to |
259 | | | ``printf("%zd")``. |
260 +-------------------+---------------------+--------------------------------+
261 | :attr:`%zu` | size_t | Exactly equivalent to |
262 | | | ``printf("%zu")``. |
263 +-------------------+---------------------+--------------------------------+
264 | :attr:`%i` | int | Exactly equivalent to |
265 | | | ``printf("%i")``. |
266 +-------------------+---------------------+--------------------------------+
267 | :attr:`%x` | int | Exactly equivalent to |
268 | | | ``printf("%x")``. |
269 +-------------------+---------------------+--------------------------------+
270 | :attr:`%s` | char\* | A null-terminated C character |
271 | | | array. |
272 +-------------------+---------------------+--------------------------------+
273 | :attr:`%p` | void\* | The hex representation of a C |
274 | | | pointer. Mostly equivalent to |
275 | | | ``printf("%p")`` except that |
276 | | | it is guaranteed to start with |
277 | | | the literal ``0x`` regardless |
278 | | | of what the platform's |
279 | | | ``printf`` yields. |
280 +-------------------+---------------------+--------------------------------+
Georg Brandl559e5d72008-06-11 18:37:52 +0000281 | :attr:`%A` | PyObject\* | The result of calling |
282 | | | :func:`ascii`. |
283 +-------------------+---------------------+--------------------------------+
Georg Brandl54a3faa2008-01-20 09:30:57 +0000284 | :attr:`%U` | PyObject\* | A unicode object. |
285 +-------------------+---------------------+--------------------------------+
286 | :attr:`%V` | PyObject\*, char \* | A unicode object (which may be |
287 | | | *NULL*) and a null-terminated |
288 | | | C character array as a second |
289 | | | parameter (which will be used, |
290 | | | if the first parameter is |
291 | | | *NULL*). |
292 +-------------------+---------------------+--------------------------------+
293 | :attr:`%S` | PyObject\* | The result of calling |
Benjamin Petersone8662062009-03-08 23:51:13 +0000294 | | | :func:`PyObject_Str`. |
Georg Brandl54a3faa2008-01-20 09:30:57 +0000295 +-------------------+---------------------+--------------------------------+
296 | :attr:`%R` | PyObject\* | The result of calling |
297 | | | :func:`PyObject_Repr`. |
298 +-------------------+---------------------+--------------------------------+
299
300 An unrecognized format character causes all the rest of the format string to be
301 copied as-is to the result string, and any extra arguments discarded.
302
303
304.. cfunction:: PyObject* PyUnicode_FromFormatV(const char *format, va_list vargs)
305
306 Identical to :func:`PyUnicode_FromFormat` except that it takes exactly two
307 arguments.
308
309
310.. cfunction:: Py_UNICODE* PyUnicode_AsUnicode(PyObject *unicode)
311
312 Return a read-only pointer to the Unicode object's internal :ctype:`Py_UNICODE`
313 buffer, *NULL* if *unicode* is not a Unicode object.
314
315
316.. cfunction:: Py_ssize_t PyUnicode_GetSize(PyObject *unicode)
317
318 Return the length of the Unicode object.
319
320
321.. cfunction:: PyObject* PyUnicode_FromEncodedObject(PyObject *obj, const char *encoding, const char *errors)
322
323 Coerce an encoded object *obj* to an Unicode object and return a reference with
324 incremented refcount.
325
326 String and other char buffer compatible objects are decoded according to the
327 given encoding and using the error handling defined by errors. Both can be
328 *NULL* to have the interface use the default values (see the next section for
329 details).
330
331 All other objects, including Unicode objects, cause a :exc:`TypeError` to be
332 set.
333
334 The API returns *NULL* if there was an error. The caller is responsible for
335 decref'ing the returned objects.
336
337
338.. cfunction:: PyObject* PyUnicode_FromObject(PyObject *obj)
339
340 Shortcut for ``PyUnicode_FromEncodedObject(obj, NULL, "strict")`` which is used
341 throughout the interpreter whenever coercion to Unicode is needed.
342
343If the platform supports :ctype:`wchar_t` and provides a header file wchar.h,
344Python can interface directly to this type using the following functions.
345Support is optimized if Python's own :ctype:`Py_UNICODE` type is identical to
346the system's :ctype:`wchar_t`.
347
348.. % --- wchar_t support for platforms which support it ---------------------
349
350
351.. cfunction:: PyObject* PyUnicode_FromWideChar(const wchar_t *w, Py_ssize_t size)
352
353 Create a Unicode object from the :ctype:`wchar_t` buffer *w* of the given size.
Martin v. Löwis790465f2008-04-05 20:41:37 +0000354 Passing -1 as the size indicates that the function must itself compute the length,
355 using wcslen.
Georg Brandl54a3faa2008-01-20 09:30:57 +0000356 Return *NULL* on failure.
357
358
359.. cfunction:: Py_ssize_t PyUnicode_AsWideChar(PyUnicodeObject *unicode, wchar_t *w, Py_ssize_t size)
360
361 Copy the Unicode object contents into the :ctype:`wchar_t` buffer *w*. At most
362 *size* :ctype:`wchar_t` characters are copied (excluding a possibly trailing
363 0-termination character). Return the number of :ctype:`wchar_t` characters
364 copied or -1 in case of an error. Note that the resulting :ctype:`wchar_t`
365 string may or may not be 0-terminated. It is the responsibility of the caller
366 to make sure that the :ctype:`wchar_t` string is 0-terminated in case this is
367 required by the application.
368
369
370.. _builtincodecs:
371
372Built-in Codecs
373^^^^^^^^^^^^^^^
374
375Python provides a set of builtin codecs which are written in C for speed. All of
376these codecs are directly usable via the following functions.
377
378Many of the following APIs take two arguments encoding and errors. These
379parameters encoding and errors have the same semantics as the ones of the
380builtin unicode() Unicode object constructor.
381
382Setting encoding to *NULL* causes the default encoding to be used which is
383ASCII. The file system calls should use :cdata:`Py_FileSystemDefaultEncoding`
384as the encoding for file names. This variable should be treated as read-only: On
385some systems, it will be a pointer to a static string, on others, it will change
386at run-time (such as when the application invokes setlocale).
387
388Error handling is set by errors which may also be set to *NULL* meaning to use
389the default handling defined for the codec. Default error handling for all
390builtin codecs is "strict" (:exc:`ValueError` is raised).
391
392The codecs all use a similar interface. Only deviation from the following
393generic ones are documented for simplicity.
394
395These are the generic codec APIs:
396
397.. % --- Generic Codecs -----------------------------------------------------
398
399
400.. cfunction:: PyObject* PyUnicode_Decode(const char *s, Py_ssize_t size, const char *encoding, const char *errors)
401
402 Create a Unicode object by decoding *size* bytes of the encoded string *s*.
403 *encoding* and *errors* have the same meaning as the parameters of the same name
404 in the :func:`unicode` builtin function. The codec to be used is looked up
405 using the Python codec registry. Return *NULL* if an exception was raised by
406 the codec.
407
408
409.. cfunction:: PyObject* PyUnicode_Encode(const Py_UNICODE *s, Py_ssize_t size, const char *encoding, const char *errors)
410
411 Encode the :ctype:`Py_UNICODE` buffer of the given size and return a Python
Benjamin Petersonb6eba4f2009-01-13 23:14:04 +0000412 bytes object. *encoding* and *errors* have the same meaning as the
413 parameters of the same name in the Unicode :meth:`encode` method. The codec
414 to be used is looked up using the Python codec registry. Return *NULL* if an
415 exception was raised by the codec.
Georg Brandl54a3faa2008-01-20 09:30:57 +0000416
417
418.. cfunction:: PyObject* PyUnicode_AsEncodedString(PyObject *unicode, const char *encoding, const char *errors)
419
Benjamin Petersonb6eba4f2009-01-13 23:14:04 +0000420 Encode a Unicode object and return the result as Python bytes object.
421 *encoding* and *errors* have the same meaning as the parameters of the same
422 name in the Unicode :meth:`encode` method. The codec to be used is looked up
423 using the Python codec registry. Return *NULL* if an exception was raised by
424 the codec.
Georg Brandl54a3faa2008-01-20 09:30:57 +0000425
426These are the UTF-8 codec APIs:
427
428.. % --- UTF-8 Codecs -------------------------------------------------------
429
430
431.. cfunction:: PyObject* PyUnicode_DecodeUTF8(const char *s, Py_ssize_t size, const char *errors)
432
433 Create a Unicode object by decoding *size* bytes of the UTF-8 encoded string
434 *s*. Return *NULL* if an exception was raised by the codec.
435
436
437.. cfunction:: PyObject* PyUnicode_DecodeUTF8Stateful(const char *s, Py_ssize_t size, const char *errors, Py_ssize_t *consumed)
438
439 If *consumed* is *NULL*, behave like :cfunc:`PyUnicode_DecodeUTF8`. If
440 *consumed* is not *NULL*, trailing incomplete UTF-8 byte sequences will not be
441 treated as an error. Those bytes will not be decoded and the number of bytes
442 that have been decoded will be stored in *consumed*.
443
444
445.. cfunction:: PyObject* PyUnicode_EncodeUTF8(const Py_UNICODE *s, Py_ssize_t size, const char *errors)
446
Benjamin Petersonb6eba4f2009-01-13 23:14:04 +0000447 Encode the :ctype:`Py_UNICODE` buffer of the given size using UTF-8 and
448 return a Python bytes object. Return *NULL* if an exception was raised by
449 the codec.
Georg Brandl54a3faa2008-01-20 09:30:57 +0000450
451
452.. cfunction:: PyObject* PyUnicode_AsUTF8String(PyObject *unicode)
453
Benjamin Petersonb6eba4f2009-01-13 23:14:04 +0000454 Encode a Unicode object using UTF-8 and return the result as Python bytes
455 object. Error handling is "strict". Return *NULL* if an exception was
456 raised by the codec.
Georg Brandl54a3faa2008-01-20 09:30:57 +0000457
458These are the UTF-32 codec APIs:
459
460.. % --- UTF-32 Codecs ------------------------------------------------------ */
461
462
463.. cfunction:: PyObject* PyUnicode_DecodeUTF32(const char *s, Py_ssize_t size, const char *errors, int *byteorder)
464
465 Decode *length* bytes from a UTF-32 encoded buffer string and return the
466 corresponding Unicode object. *errors* (if non-*NULL*) defines the error
467 handling. It defaults to "strict".
468
469 If *byteorder* is non-*NULL*, the decoder starts decoding using the given byte
470 order::
471
472 *byteorder == -1: little endian
473 *byteorder == 0: native order
474 *byteorder == 1: big endian
475
476 and then switches if the first four bytes of the input data are a byte order mark
477 (BOM) and the specified byte order is native order. This BOM is not copied into
478 the resulting Unicode string. After completion, *\*byteorder* is set to the
479 current byte order at the end of input data.
480
481 In a narrow build codepoints outside the BMP will be decoded as surrogate pairs.
482
483 If *byteorder* is *NULL*, the codec starts in native order mode.
484
485 Return *NULL* if an exception was raised by the codec.
486
487
488.. cfunction:: PyObject* PyUnicode_DecodeUTF32Stateful(const char *s, Py_ssize_t size, const char *errors, int *byteorder, Py_ssize_t *consumed)
489
490 If *consumed* is *NULL*, behave like :cfunc:`PyUnicode_DecodeUTF32`. If
491 *consumed* is not *NULL*, :cfunc:`PyUnicode_DecodeUTF32Stateful` will not treat
492 trailing incomplete UTF-32 byte sequences (such as a number of bytes not divisible
493 by four) as an error. Those bytes will not be decoded and the number of bytes
494 that have been decoded will be stored in *consumed*.
495
496
497.. cfunction:: PyObject* PyUnicode_EncodeUTF32(const Py_UNICODE *s, Py_ssize_t size, const char *errors, int byteorder)
498
499 Return a Python bytes object holding the UTF-32 encoded value of the Unicode
500 data in *s*. If *byteorder* is not ``0``, output is written according to the
501 following byte order::
502
503 byteorder == -1: little endian
504 byteorder == 0: native byte order (writes a BOM mark)
505 byteorder == 1: big endian
506
507 If byteorder is ``0``, the output string will always start with the Unicode BOM
508 mark (U+FEFF). In the other two modes, no BOM mark is prepended.
509
510 If *Py_UNICODE_WIDE* is not defined, surrogate pairs will be output
511 as a single codepoint.
512
513 Return *NULL* if an exception was raised by the codec.
514
515
516.. cfunction:: PyObject* PyUnicode_AsUTF32String(PyObject *unicode)
517
Benjamin Petersonb6eba4f2009-01-13 23:14:04 +0000518 Return a Python byte string using the UTF-32 encoding in native byte
519 order. The string always starts with a BOM mark. Error handling is "strict".
520 Return *NULL* if an exception was raised by the codec.
Georg Brandl54a3faa2008-01-20 09:30:57 +0000521
522
523These are the UTF-16 codec APIs:
524
525.. % --- UTF-16 Codecs ------------------------------------------------------ */
526
527
528.. cfunction:: PyObject* PyUnicode_DecodeUTF16(const char *s, Py_ssize_t size, const char *errors, int *byteorder)
529
530 Decode *length* bytes from a UTF-16 encoded buffer string and return the
531 corresponding Unicode object. *errors* (if non-*NULL*) defines the error
532 handling. It defaults to "strict".
533
534 If *byteorder* is non-*NULL*, the decoder starts decoding using the given byte
535 order::
536
537 *byteorder == -1: little endian
538 *byteorder == 0: native order
539 *byteorder == 1: big endian
540
541 and then switches if the first two bytes of the input data are a byte order mark
542 (BOM) and the specified byte order is native order. This BOM is not copied into
543 the resulting Unicode string. After completion, *\*byteorder* is set to the
544 current byte order at the end of input data.
545
546 If *byteorder* is *NULL*, the codec starts in native order mode.
547
548 Return *NULL* if an exception was raised by the codec.
549
550
551.. cfunction:: PyObject* PyUnicode_DecodeUTF16Stateful(const char *s, Py_ssize_t size, const char *errors, int *byteorder, Py_ssize_t *consumed)
552
553 If *consumed* is *NULL*, behave like :cfunc:`PyUnicode_DecodeUTF16`. If
554 *consumed* is not *NULL*, :cfunc:`PyUnicode_DecodeUTF16Stateful` will not treat
555 trailing incomplete UTF-16 byte sequences (such as an odd number of bytes or a
556 split surrogate pair) as an error. Those bytes will not be decoded and the
557 number of bytes that have been decoded will be stored in *consumed*.
558
559
560.. cfunction:: PyObject* PyUnicode_EncodeUTF16(const Py_UNICODE *s, Py_ssize_t size, const char *errors, int byteorder)
561
Benjamin Petersonb6eba4f2009-01-13 23:14:04 +0000562 Return a Python bytes object holding the UTF-16 encoded value of the Unicode
Georg Brandl54a3faa2008-01-20 09:30:57 +0000563 data in *s*. If *byteorder* is not ``0``, output is written according to the
564 following byte order::
565
566 byteorder == -1: little endian
567 byteorder == 0: native byte order (writes a BOM mark)
568 byteorder == 1: big endian
569
570 If byteorder is ``0``, the output string will always start with the Unicode BOM
571 mark (U+FEFF). In the other two modes, no BOM mark is prepended.
572
573 If *Py_UNICODE_WIDE* is defined, a single :ctype:`Py_UNICODE` value may get
574 represented as a surrogate pair. If it is not defined, each :ctype:`Py_UNICODE`
575 values is interpreted as an UCS-2 character.
576
577 Return *NULL* if an exception was raised by the codec.
578
579
580.. cfunction:: PyObject* PyUnicode_AsUTF16String(PyObject *unicode)
581
Benjamin Petersonb6eba4f2009-01-13 23:14:04 +0000582 Return a Python byte string using the UTF-16 encoding in native byte
583 order. The string always starts with a BOM mark. Error handling is "strict".
584 Return *NULL* if an exception was raised by the codec.
Georg Brandl54a3faa2008-01-20 09:30:57 +0000585
586These are the "Unicode Escape" codec APIs:
587
588.. % --- Unicode-Escape Codecs ----------------------------------------------
589
590
591.. cfunction:: PyObject* PyUnicode_DecodeUnicodeEscape(const char *s, Py_ssize_t size, const char *errors)
592
593 Create a Unicode object by decoding *size* bytes of the Unicode-Escape encoded
594 string *s*. Return *NULL* if an exception was raised by the codec.
595
596
597.. cfunction:: PyObject* PyUnicode_EncodeUnicodeEscape(const Py_UNICODE *s, Py_ssize_t size)
598
599 Encode the :ctype:`Py_UNICODE` buffer of the given size using Unicode-Escape and
600 return a Python string object. Return *NULL* if an exception was raised by the
601 codec.
602
603
604.. cfunction:: PyObject* PyUnicode_AsUnicodeEscapeString(PyObject *unicode)
605
606 Encode a Unicode object using Unicode-Escape and return the result as Python
607 string object. Error handling is "strict". Return *NULL* if an exception was
608 raised by the codec.
609
610These are the "Raw Unicode Escape" codec APIs:
611
612.. % --- Raw-Unicode-Escape Codecs ------------------------------------------
613
614
615.. cfunction:: PyObject* PyUnicode_DecodeRawUnicodeEscape(const char *s, Py_ssize_t size, const char *errors)
616
617 Create a Unicode object by decoding *size* bytes of the Raw-Unicode-Escape
618 encoded string *s*. Return *NULL* if an exception was raised by the codec.
619
620
621.. cfunction:: PyObject* PyUnicode_EncodeRawUnicodeEscape(const Py_UNICODE *s, Py_ssize_t size, const char *errors)
622
623 Encode the :ctype:`Py_UNICODE` buffer of the given size using Raw-Unicode-Escape
624 and return a Python string object. Return *NULL* if an exception was raised by
625 the codec.
626
627
628.. cfunction:: PyObject* PyUnicode_AsRawUnicodeEscapeString(PyObject *unicode)
629
630 Encode a Unicode object using Raw-Unicode-Escape and return the result as
631 Python string object. Error handling is "strict". Return *NULL* if an exception
632 was raised by the codec.
633
634These are the Latin-1 codec APIs: Latin-1 corresponds to the first 256 Unicode
635ordinals and only these are accepted by the codecs during encoding.
636
637.. % --- Latin-1 Codecs -----------------------------------------------------
638
639
640.. cfunction:: PyObject* PyUnicode_DecodeLatin1(const char *s, Py_ssize_t size, const char *errors)
641
642 Create a Unicode object by decoding *size* bytes of the Latin-1 encoded string
643 *s*. Return *NULL* if an exception was raised by the codec.
644
645
646.. cfunction:: PyObject* PyUnicode_EncodeLatin1(const Py_UNICODE *s, Py_ssize_t size, const char *errors)
647
Benjamin Petersonb6eba4f2009-01-13 23:14:04 +0000648 Encode the :ctype:`Py_UNICODE` buffer of the given size using Latin-1 and
649 return a Python bytes object. Return *NULL* if an exception was raised by
650 the codec.
Georg Brandl54a3faa2008-01-20 09:30:57 +0000651
652
653.. cfunction:: PyObject* PyUnicode_AsLatin1String(PyObject *unicode)
654
Benjamin Petersonb6eba4f2009-01-13 23:14:04 +0000655 Encode a Unicode object using Latin-1 and return the result as Python bytes
656 object. Error handling is "strict". Return *NULL* if an exception was
657 raised by the codec.
Georg Brandl54a3faa2008-01-20 09:30:57 +0000658
659These are the ASCII codec APIs. Only 7-bit ASCII data is accepted. All other
660codes generate errors.
661
662.. % --- ASCII Codecs -------------------------------------------------------
663
664
665.. cfunction:: PyObject* PyUnicode_DecodeASCII(const char *s, Py_ssize_t size, const char *errors)
666
667 Create a Unicode object by decoding *size* bytes of the ASCII encoded string
668 *s*. Return *NULL* if an exception was raised by the codec.
669
670
671.. cfunction:: PyObject* PyUnicode_EncodeASCII(const Py_UNICODE *s, Py_ssize_t size, const char *errors)
672
Benjamin Petersonb6eba4f2009-01-13 23:14:04 +0000673 Encode the :ctype:`Py_UNICODE` buffer of the given size using ASCII and
674 return a Python bytes object. Return *NULL* if an exception was raised by
675 the codec.
Georg Brandl54a3faa2008-01-20 09:30:57 +0000676
677
678.. cfunction:: PyObject* PyUnicode_AsASCIIString(PyObject *unicode)
679
Benjamin Petersonb6eba4f2009-01-13 23:14:04 +0000680 Encode a Unicode object using ASCII and return the result as Python bytes
681 object. Error handling is "strict". Return *NULL* if an exception was
682 raised by the codec.
Georg Brandl54a3faa2008-01-20 09:30:57 +0000683
684These are the mapping codec APIs:
685
686.. % --- Character Map Codecs -----------------------------------------------
687
688This codec is special in that it can be used to implement many different codecs
689(and this is in fact what was done to obtain most of the standard codecs
690included in the :mod:`encodings` package). The codec uses mapping to encode and
691decode characters.
692
693Decoding mappings must map single string characters to single Unicode
694characters, integers (which are then interpreted as Unicode ordinals) or None
695(meaning "undefined mapping" and causing an error).
696
697Encoding mappings must map single Unicode characters to single string
698characters, integers (which are then interpreted as Latin-1 ordinals) or None
699(meaning "undefined mapping" and causing an error).
700
701The mapping objects provided must only support the __getitem__ mapping
702interface.
703
704If a character lookup fails with a LookupError, the character is copied as-is
705meaning that its ordinal value will be interpreted as Unicode or Latin-1 ordinal
706resp. Because of this, mappings only need to contain those mappings which map
707characters to different code points.
708
709
710.. cfunction:: PyObject* PyUnicode_DecodeCharmap(const char *s, Py_ssize_t size, PyObject *mapping, const char *errors)
711
712 Create a Unicode object by decoding *size* bytes of the encoded string *s* using
713 the given *mapping* object. Return *NULL* if an exception was raised by the
714 codec. If *mapping* is *NULL* latin-1 decoding will be done. Else it can be a
715 dictionary mapping byte or a unicode string, which is treated as a lookup table.
716 Byte values greater that the length of the string and U+FFFE "characters" are
717 treated as "undefined mapping".
718
719
720.. cfunction:: PyObject* PyUnicode_EncodeCharmap(const Py_UNICODE *s, Py_ssize_t size, PyObject *mapping, const char *errors)
721
722 Encode the :ctype:`Py_UNICODE` buffer of the given size using the given
723 *mapping* object and return a Python string object. Return *NULL* if an
724 exception was raised by the codec.
725
726
727.. cfunction:: PyObject* PyUnicode_AsCharmapString(PyObject *unicode, PyObject *mapping)
728
729 Encode a Unicode object using the given *mapping* object and return the result
730 as Python string object. Error handling is "strict". Return *NULL* if an
731 exception was raised by the codec.
732
733The following codec API is special in that maps Unicode to Unicode.
734
735
736.. cfunction:: PyObject* PyUnicode_TranslateCharmap(const Py_UNICODE *s, Py_ssize_t size, PyObject *table, const char *errors)
737
738 Translate a :ctype:`Py_UNICODE` buffer of the given length by applying a
739 character mapping *table* to it and return the resulting Unicode object. Return
740 *NULL* when an exception was raised by the codec.
741
742 The *mapping* table must map Unicode ordinal integers to Unicode ordinal
743 integers or None (causing deletion of the character).
744
745 Mapping tables need only provide the :meth:`__getitem__` interface; dictionaries
746 and sequences work well. Unmapped character ordinals (ones which cause a
747 :exc:`LookupError`) are left untouched and are copied as-is.
748
749These are the MBCS codec APIs. They are currently only available on Windows and
750use the Win32 MBCS converters to implement the conversions. Note that MBCS (or
751DBCS) is a class of encodings, not just one. The target encoding is defined by
752the user settings on the machine running the codec.
753
754.. % --- MBCS codecs for Windows --------------------------------------------
755
756
757.. cfunction:: PyObject* PyUnicode_DecodeMBCS(const char *s, Py_ssize_t size, const char *errors)
758
759 Create a Unicode object by decoding *size* bytes of the MBCS encoded string *s*.
760 Return *NULL* if an exception was raised by the codec.
761
762
763.. cfunction:: PyObject* PyUnicode_DecodeMBCSStateful(const char *s, int size, const char *errors, int *consumed)
764
765 If *consumed* is *NULL*, behave like :cfunc:`PyUnicode_DecodeMBCS`. If
766 *consumed* is not *NULL*, :cfunc:`PyUnicode_DecodeMBCSStateful` will not decode
767 trailing lead byte and the number of bytes that have been decoded will be stored
768 in *consumed*.
769
770
771.. cfunction:: PyObject* PyUnicode_EncodeMBCS(const Py_UNICODE *s, Py_ssize_t size, const char *errors)
772
Benjamin Petersonb6eba4f2009-01-13 23:14:04 +0000773 Encode the :ctype:`Py_UNICODE` buffer of the given size using MBCS and return
774 a Python bytes object. Return *NULL* if an exception was raised by the
775 codec.
Georg Brandl54a3faa2008-01-20 09:30:57 +0000776
777
778.. cfunction:: PyObject* PyUnicode_AsMBCSString(PyObject *unicode)
779
Benjamin Petersonb6eba4f2009-01-13 23:14:04 +0000780 Encode a Unicode object using MBCS and return the result as Python bytes
781 object. Error handling is "strict". Return *NULL* if an exception was
782 raised by the codec.
Georg Brandl54a3faa2008-01-20 09:30:57 +0000783
784.. % --- Methods & Slots ----------------------------------------------------
785
786
787.. _unicodemethodsandslots:
788
789Methods and Slot Functions
790^^^^^^^^^^^^^^^^^^^^^^^^^^
791
792The following APIs are capable of handling Unicode objects and strings on input
793(we refer to them as strings in the descriptions) and return Unicode objects or
794integers as appropriate.
795
796They all return *NULL* or ``-1`` if an exception occurs.
797
798
799.. cfunction:: PyObject* PyUnicode_Concat(PyObject *left, PyObject *right)
800
801 Concat two strings giving a new Unicode string.
802
803
804.. cfunction:: PyObject* PyUnicode_Split(PyObject *s, PyObject *sep, Py_ssize_t maxsplit)
805
806 Split a string giving a list of Unicode strings. If sep is *NULL*, splitting
807 will be done at all whitespace substrings. Otherwise, splits occur at the given
808 separator. At most *maxsplit* splits will be done. If negative, no limit is
809 set. Separators are not included in the resulting list.
810
811
812.. cfunction:: PyObject* PyUnicode_Splitlines(PyObject *s, int keepend)
813
814 Split a Unicode string at line breaks, returning a list of Unicode strings.
815 CRLF is considered to be one line break. If *keepend* is 0, the Line break
816 characters are not included in the resulting strings.
817
818
819.. cfunction:: PyObject* PyUnicode_Translate(PyObject *str, PyObject *table, const char *errors)
820
821 Translate a string by applying a character mapping table to it and return the
822 resulting Unicode object.
823
824 The mapping table must map Unicode ordinal integers to Unicode ordinal integers
825 or None (causing deletion of the character).
826
827 Mapping tables need only provide the :meth:`__getitem__` interface; dictionaries
828 and sequences work well. Unmapped character ordinals (ones which cause a
829 :exc:`LookupError`) are left untouched and are copied as-is.
830
831 *errors* has the usual meaning for codecs. It may be *NULL* which indicates to
832 use the default error handling.
833
834
835.. cfunction:: PyObject* PyUnicode_Join(PyObject *separator, PyObject *seq)
836
837 Join a sequence of strings using the given separator and return the resulting
838 Unicode string.
839
840
841.. cfunction:: int PyUnicode_Tailmatch(PyObject *str, PyObject *substr, Py_ssize_t start, Py_ssize_t end, int direction)
842
843 Return 1 if *substr* matches *str*[*start*:*end*] at the given tail end
844 (*direction* == -1 means to do a prefix match, *direction* == 1 a suffix match),
845 0 otherwise. Return ``-1`` if an error occurred.
846
847
848.. cfunction:: Py_ssize_t PyUnicode_Find(PyObject *str, PyObject *substr, Py_ssize_t start, Py_ssize_t end, int direction)
849
850 Return the first position of *substr* in *str*[*start*:*end*] using the given
851 *direction* (*direction* == 1 means to do a forward search, *direction* == -1 a
852 backward search). The return value is the index of the first match; a value of
853 ``-1`` indicates that no match was found, and ``-2`` indicates that an error
854 occurred and an exception has been set.
855
856
857.. cfunction:: Py_ssize_t PyUnicode_Count(PyObject *str, PyObject *substr, Py_ssize_t start, Py_ssize_t end)
858
859 Return the number of non-overlapping occurrences of *substr* in
860 ``str[start:end]``. Return ``-1`` if an error occurred.
861
862
863.. cfunction:: PyObject* PyUnicode_Replace(PyObject *str, PyObject *substr, PyObject *replstr, Py_ssize_t maxcount)
864
865 Replace at most *maxcount* occurrences of *substr* in *str* with *replstr* and
866 return the resulting Unicode object. *maxcount* == -1 means replace all
867 occurrences.
868
869
870.. cfunction:: int PyUnicode_Compare(PyObject *left, PyObject *right)
871
872 Compare two strings and return -1, 0, 1 for less than, equal, and greater than,
873 respectively.
874
875
Benjamin Petersonc22ed142008-07-01 19:12:34 +0000876.. cfunction:: int PyUnicode_CompareWithASCIIString(PyObject *uni, char *string)
877
878 Compare a unicode object, *uni*, with *string* and return -1, 0, 1 for less
879 than, equal, and greater than, respectively.
880
881
Georg Brandl54a3faa2008-01-20 09:30:57 +0000882.. cfunction:: int PyUnicode_RichCompare(PyObject *left, PyObject *right, int op)
883
884 Rich compare two unicode strings and return one of the following:
885
886 * ``NULL`` in case an exception was raised
887 * :const:`Py_True` or :const:`Py_False` for successful comparisons
888 * :const:`Py_NotImplemented` in case the type combination is unknown
889
890 Note that :const:`Py_EQ` and :const:`Py_NE` comparisons can cause a
891 :exc:`UnicodeWarning` in case the conversion of the arguments to Unicode fails
892 with a :exc:`UnicodeDecodeError`.
893
894 Possible values for *op* are :const:`Py_GT`, :const:`Py_GE`, :const:`Py_EQ`,
895 :const:`Py_NE`, :const:`Py_LT`, and :const:`Py_LE`.
896
897
898.. cfunction:: PyObject* PyUnicode_Format(PyObject *format, PyObject *args)
899
900 Return a new string object from *format* and *args*; this is analogous to
901 ``format % args``. The *args* argument must be a tuple.
902
903
904.. cfunction:: int PyUnicode_Contains(PyObject *container, PyObject *element)
905
906 Check whether *element* is contained in *container* and return true or false
907 accordingly.
908
909 *element* has to coerce to a one element Unicode string. ``-1`` is returned if
910 there was an error.
911
912
913.. cfunction:: void PyUnicode_InternInPlace(PyObject **string)
914
915 Intern the argument *\*string* in place. The argument must be the address of a
916 pointer variable pointing to a Python unicode string object. If there is an
917 existing interned string that is the same as *\*string*, it sets *\*string* to
918 it (decrementing the reference count of the old string object and incrementing
919 the reference count of the interned string object), otherwise it leaves
920 *\*string* alone and interns it (incrementing its reference count).
921 (Clarification: even though there is a lot of talk about reference counts, think
922 of this function as reference-count-neutral; you own the object after the call
923 if and only if you owned it before the call.)
924
925
926.. cfunction:: PyObject* PyUnicode_InternFromString(const char *v)
927
928 A combination of :cfunc:`PyUnicode_FromString` and
929 :cfunc:`PyUnicode_InternInPlace`, returning either a new unicode string object
930 that has been interned, or a new ("owned") reference to an earlier interned
931 string object with the same value.
932