blob: e348ee7c8721504055c44c84b83ad769c2cb002d [file] [log] [blame]
Georg Brandl54a3faa2008-01-20 09:30:57 +00001.. highlightlang:: c
2
3.. _unicodeobjects:
4
5Unicode Objects and Codecs
6--------------------------
7
8.. sectionauthor:: Marc-Andre Lemburg <mal@lemburg.com>
9
10Unicode Objects
11^^^^^^^^^^^^^^^
12
13These are the basic Unicode object types used for the Unicode implementation in
14Python:
15
16.. % --- Unicode Type -------------------------------------------------------
17
18
19.. ctype:: Py_UNICODE
20
21 This type represents the storage type which is used by Python internally as
22 basis for holding Unicode ordinals. Python's default builds use a 16-bit type
23 for :ctype:`Py_UNICODE` and store Unicode values internally as UCS2. It is also
24 possible to build a UCS4 version of Python (most recent Linux distributions come
25 with UCS4 builds of Python). These builds then use a 32-bit type for
26 :ctype:`Py_UNICODE` and store Unicode data internally as UCS4. On platforms
27 where :ctype:`wchar_t` is available and compatible with the chosen Python
28 Unicode build variant, :ctype:`Py_UNICODE` is a typedef alias for
29 :ctype:`wchar_t` to enhance native platform compatibility. On all other
30 platforms, :ctype:`Py_UNICODE` is a typedef alias for either :ctype:`unsigned
31 short` (UCS2) or :ctype:`unsigned long` (UCS4).
32
33Note that UCS2 and UCS4 Python builds are not binary compatible. Please keep
34this in mind when writing extensions or interfaces.
35
36
37.. ctype:: PyUnicodeObject
38
39 This subtype of :ctype:`PyObject` represents a Python Unicode object.
40
41
42.. cvar:: PyTypeObject PyUnicode_Type
43
44 This instance of :ctype:`PyTypeObject` represents the Python Unicode type. It
45 is exposed to Python code as ``str``.
46
47The following APIs are really C macros and can be used to do fast checks and to
48access internal read-only data of Unicode objects:
49
50
51.. cfunction:: int PyUnicode_Check(PyObject *o)
52
53 Return true if the object *o* is a Unicode object or an instance of a Unicode
54 subtype.
55
56
57.. cfunction:: int PyUnicode_CheckExact(PyObject *o)
58
59 Return true if the object *o* is a Unicode object, but not an instance of a
60 subtype.
61
62
63.. cfunction:: Py_ssize_t PyUnicode_GET_SIZE(PyObject *o)
64
65 Return the size of the object. *o* has to be a :ctype:`PyUnicodeObject` (not
66 checked).
67
68
69.. cfunction:: Py_ssize_t PyUnicode_GET_DATA_SIZE(PyObject *o)
70
71 Return the size of the object's internal buffer in bytes. *o* has to be a
72 :ctype:`PyUnicodeObject` (not checked).
73
74
75.. cfunction:: Py_UNICODE* PyUnicode_AS_UNICODE(PyObject *o)
76
77 Return a pointer to the internal :ctype:`Py_UNICODE` buffer of the object. *o*
78 has to be a :ctype:`PyUnicodeObject` (not checked).
79
80
81.. cfunction:: const char* PyUnicode_AS_DATA(PyObject *o)
82
83 Return a pointer to the internal buffer of the object. *o* has to be a
84 :ctype:`PyUnicodeObject` (not checked).
85
Christian Heimesa156e092008-02-16 07:38:31 +000086
87.. cfunction:: int PyUnicode_ClearFreeList(void)
88
89 Clear the free list. Return the total number of freed items.
90
Georg Brandl54a3faa2008-01-20 09:30:57 +000091Unicode provides many different character properties. The most often needed ones
92are available through these macros which are mapped to C functions depending on
93the Python configuration.
94
95.. % --- Unicode character properties ---------------------------------------
96
97
98.. cfunction:: int Py_UNICODE_ISSPACE(Py_UNICODE ch)
99
100 Return 1 or 0 depending on whether *ch* is a whitespace character.
101
102
103.. cfunction:: int Py_UNICODE_ISLOWER(Py_UNICODE ch)
104
105 Return 1 or 0 depending on whether *ch* is a lowercase character.
106
107
108.. cfunction:: int Py_UNICODE_ISUPPER(Py_UNICODE ch)
109
110 Return 1 or 0 depending on whether *ch* is an uppercase character.
111
112
113.. cfunction:: int Py_UNICODE_ISTITLE(Py_UNICODE ch)
114
115 Return 1 or 0 depending on whether *ch* is a titlecase character.
116
117
118.. cfunction:: int Py_UNICODE_ISLINEBREAK(Py_UNICODE ch)
119
120 Return 1 or 0 depending on whether *ch* is a linebreak character.
121
122
123.. cfunction:: int Py_UNICODE_ISDECIMAL(Py_UNICODE ch)
124
125 Return 1 or 0 depending on whether *ch* is a decimal character.
126
127
128.. cfunction:: int Py_UNICODE_ISDIGIT(Py_UNICODE ch)
129
130 Return 1 or 0 depending on whether *ch* is a digit character.
131
132
133.. cfunction:: int Py_UNICODE_ISNUMERIC(Py_UNICODE ch)
134
135 Return 1 or 0 depending on whether *ch* is a numeric character.
136
137
138.. cfunction:: int Py_UNICODE_ISALPHA(Py_UNICODE ch)
139
140 Return 1 or 0 depending on whether *ch* is an alphabetic character.
141
142
143.. cfunction:: int Py_UNICODE_ISALNUM(Py_UNICODE ch)
144
145 Return 1 or 0 depending on whether *ch* is an alphanumeric character.
146
Georg Brandl559e5d72008-06-11 18:37:52 +0000147
148.. cfunction:: int Py_UNICODE_ISPRINTABLE(Py_UNICODE ch)
149
150 Return 1 or 0 depending on whether *ch* is a printable character.
151 Nonprintable characters are those characters defined in the Unicode character
152 database as "Other" or "Separator", excepting the ASCII space (0x20) which is
153 considered printable. (Note that printable characters in this context are
154 those which should not be escaped when :func:`repr` is invoked on a string.
155 It has no bearing on the handling of strings written to :data:`sys.stdout` or
156 :data:`sys.stderr`.)
157
158
Georg Brandl54a3faa2008-01-20 09:30:57 +0000159These APIs can be used for fast direct character conversions:
160
161
162.. cfunction:: Py_UNICODE Py_UNICODE_TOLOWER(Py_UNICODE ch)
163
164 Return the character *ch* converted to lower case.
165
166
167.. cfunction:: Py_UNICODE Py_UNICODE_TOUPPER(Py_UNICODE ch)
168
169 Return the character *ch* converted to upper case.
170
171
172.. cfunction:: Py_UNICODE Py_UNICODE_TOTITLE(Py_UNICODE ch)
173
174 Return the character *ch* converted to title case.
175
176
177.. cfunction:: int Py_UNICODE_TODECIMAL(Py_UNICODE ch)
178
179 Return the character *ch* converted to a decimal positive integer. Return
180 ``-1`` if this is not possible. This macro does not raise exceptions.
181
182
183.. cfunction:: int Py_UNICODE_TODIGIT(Py_UNICODE ch)
184
185 Return the character *ch* converted to a single digit integer. Return ``-1`` if
186 this is not possible. This macro does not raise exceptions.
187
188
189.. cfunction:: double Py_UNICODE_TONUMERIC(Py_UNICODE ch)
190
191 Return the character *ch* converted to a double. Return ``-1.0`` if this is not
192 possible. This macro does not raise exceptions.
193
194To create Unicode objects and access their basic sequence properties, use these
195APIs:
196
197.. % --- Plain Py_UNICODE ---------------------------------------------------
198
199
200.. cfunction:: PyObject* PyUnicode_FromUnicode(const Py_UNICODE *u, Py_ssize_t size)
201
202 Create a Unicode Object from the Py_UNICODE buffer *u* of the given size. *u*
203 may be *NULL* which causes the contents to be undefined. It is the user's
204 responsibility to fill in the needed data. The buffer is copied into the new
205 object. If the buffer is not *NULL*, the return value might be a shared object.
206 Therefore, modification of the resulting Unicode object is only allowed when *u*
207 is *NULL*.
208
209
210.. cfunction:: PyObject* PyUnicode_FromStringAndSize(const char *u, Py_ssize_t size)
211
212 Create a Unicode Object from the char buffer *u*. The bytes will be interpreted
213 as being UTF-8 encoded. *u* may also be *NULL* which
214 causes the contents to be undefined. It is the user's responsibility to fill in
215 the needed data. The buffer is copied into the new object. If the buffer is not
216 *NULL*, the return value might be a shared object. Therefore, modification of
217 the resulting Unicode object is only allowed when *u* is *NULL*.
218
219
220.. cfunction:: PyObject *PyUnicode_FromString(const char *u)
221
222 Create a Unicode object from an UTF-8 encoded null-terminated char buffer
223 *u*.
224
225
226.. cfunction:: PyObject* PyUnicode_FromFormat(const char *format, ...)
227
228 Take a C :cfunc:`printf`\ -style *format* string and a variable number of
229 arguments, calculate the size of the resulting Python unicode string and return
230 a string with the values formatted into it. The variable arguments must be C
231 types and must correspond exactly to the format characters in the *format*
232 string. The following format characters are allowed:
233
234 .. % The descriptions for %zd and %zu are wrong, but the truth is complicated
235 .. % because not all compilers support the %z width modifier -- we fake it
236 .. % when necessary via interpolating PY_FORMAT_SIZE_T.
237
238 +-------------------+---------------------+--------------------------------+
239 | Format Characters | Type | Comment |
240 +===================+=====================+================================+
241 | :attr:`%%` | *n/a* | The literal % character. |
242 +-------------------+---------------------+--------------------------------+
243 | :attr:`%c` | int | A single character, |
244 | | | represented as an C int. |
245 +-------------------+---------------------+--------------------------------+
246 | :attr:`%d` | int | Exactly equivalent to |
247 | | | ``printf("%d")``. |
248 +-------------------+---------------------+--------------------------------+
249 | :attr:`%u` | unsigned int | Exactly equivalent to |
250 | | | ``printf("%u")``. |
251 +-------------------+---------------------+--------------------------------+
252 | :attr:`%ld` | long | Exactly equivalent to |
253 | | | ``printf("%ld")``. |
254 +-------------------+---------------------+--------------------------------+
255 | :attr:`%lu` | unsigned long | Exactly equivalent to |
256 | | | ``printf("%lu")``. |
257 +-------------------+---------------------+--------------------------------+
258 | :attr:`%zd` | Py_ssize_t | Exactly equivalent to |
259 | | | ``printf("%zd")``. |
260 +-------------------+---------------------+--------------------------------+
261 | :attr:`%zu` | size_t | Exactly equivalent to |
262 | | | ``printf("%zu")``. |
263 +-------------------+---------------------+--------------------------------+
264 | :attr:`%i` | int | Exactly equivalent to |
265 | | | ``printf("%i")``. |
266 +-------------------+---------------------+--------------------------------+
267 | :attr:`%x` | int | Exactly equivalent to |
268 | | | ``printf("%x")``. |
269 +-------------------+---------------------+--------------------------------+
270 | :attr:`%s` | char\* | A null-terminated C character |
271 | | | array. |
272 +-------------------+---------------------+--------------------------------+
273 | :attr:`%p` | void\* | The hex representation of a C |
274 | | | pointer. Mostly equivalent to |
275 | | | ``printf("%p")`` except that |
276 | | | it is guaranteed to start with |
277 | | | the literal ``0x`` regardless |
278 | | | of what the platform's |
279 | | | ``printf`` yields. |
280 +-------------------+---------------------+--------------------------------+
Georg Brandl559e5d72008-06-11 18:37:52 +0000281 | :attr:`%A` | PyObject\* | The result of calling |
282 | | | :func:`ascii`. |
283 +-------------------+---------------------+--------------------------------+
Georg Brandl54a3faa2008-01-20 09:30:57 +0000284 | :attr:`%U` | PyObject\* | A unicode object. |
285 +-------------------+---------------------+--------------------------------+
286 | :attr:`%V` | PyObject\*, char \* | A unicode object (which may be |
287 | | | *NULL*) and a null-terminated |
288 | | | C character array as a second |
289 | | | parameter (which will be used, |
290 | | | if the first parameter is |
291 | | | *NULL*). |
292 +-------------------+---------------------+--------------------------------+
293 | :attr:`%S` | PyObject\* | The result of calling |
Benjamin Petersone8662062009-03-08 23:51:13 +0000294 | | | :func:`PyObject_Str`. |
Georg Brandl54a3faa2008-01-20 09:30:57 +0000295 +-------------------+---------------------+--------------------------------+
296 | :attr:`%R` | PyObject\* | The result of calling |
297 | | | :func:`PyObject_Repr`. |
298 +-------------------+---------------------+--------------------------------+
299
300 An unrecognized format character causes all the rest of the format string to be
301 copied as-is to the result string, and any extra arguments discarded.
302
303
304.. cfunction:: PyObject* PyUnicode_FromFormatV(const char *format, va_list vargs)
305
306 Identical to :func:`PyUnicode_FromFormat` except that it takes exactly two
307 arguments.
308
309
310.. cfunction:: Py_UNICODE* PyUnicode_AsUnicode(PyObject *unicode)
311
312 Return a read-only pointer to the Unicode object's internal :ctype:`Py_UNICODE`
313 buffer, *NULL* if *unicode* is not a Unicode object.
314
315
316.. cfunction:: Py_ssize_t PyUnicode_GetSize(PyObject *unicode)
317
318 Return the length of the Unicode object.
319
320
321.. cfunction:: PyObject* PyUnicode_FromEncodedObject(PyObject *obj, const char *encoding, const char *errors)
322
323 Coerce an encoded object *obj* to an Unicode object and return a reference with
324 incremented refcount.
325
326 String and other char buffer compatible objects are decoded according to the
327 given encoding and using the error handling defined by errors. Both can be
328 *NULL* to have the interface use the default values (see the next section for
329 details).
330
331 All other objects, including Unicode objects, cause a :exc:`TypeError` to be
332 set.
333
334 The API returns *NULL* if there was an error. The caller is responsible for
335 decref'ing the returned objects.
336
337
338.. cfunction:: PyObject* PyUnicode_FromObject(PyObject *obj)
339
340 Shortcut for ``PyUnicode_FromEncodedObject(obj, NULL, "strict")`` which is used
341 throughout the interpreter whenever coercion to Unicode is needed.
342
343If the platform supports :ctype:`wchar_t` and provides a header file wchar.h,
344Python can interface directly to this type using the following functions.
345Support is optimized if Python's own :ctype:`Py_UNICODE` type is identical to
346the system's :ctype:`wchar_t`.
347
348.. % --- wchar_t support for platforms which support it ---------------------
349
350
351.. cfunction:: PyObject* PyUnicode_FromWideChar(const wchar_t *w, Py_ssize_t size)
352
353 Create a Unicode object from the :ctype:`wchar_t` buffer *w* of the given size.
Martin v. Löwis790465f2008-04-05 20:41:37 +0000354 Passing -1 as the size indicates that the function must itself compute the length,
355 using wcslen.
Georg Brandl54a3faa2008-01-20 09:30:57 +0000356 Return *NULL* on failure.
357
358
359.. cfunction:: Py_ssize_t PyUnicode_AsWideChar(PyUnicodeObject *unicode, wchar_t *w, Py_ssize_t size)
360
361 Copy the Unicode object contents into the :ctype:`wchar_t` buffer *w*. At most
362 *size* :ctype:`wchar_t` characters are copied (excluding a possibly trailing
363 0-termination character). Return the number of :ctype:`wchar_t` characters
364 copied or -1 in case of an error. Note that the resulting :ctype:`wchar_t`
365 string may or may not be 0-terminated. It is the responsibility of the caller
366 to make sure that the :ctype:`wchar_t` string is 0-terminated in case this is
367 required by the application.
368
369
370.. _builtincodecs:
371
372Built-in Codecs
373^^^^^^^^^^^^^^^
374
375Python provides a set of builtin codecs which are written in C for speed. All of
376these codecs are directly usable via the following functions.
377
378Many of the following APIs take two arguments encoding and errors. These
379parameters encoding and errors have the same semantics as the ones of the
380builtin unicode() Unicode object constructor.
381
Martin v. Löwisc15bdef2009-05-29 14:47:46 +0000382Setting encoding to *NULL* causes the default encoding to be used
383which is ASCII. The file system calls should use
384:cfunc:`PyUnicode_FSConverter` for encoding file names. This uses the
385variable :cdata:`Py_FileSystemDefaultEncoding` internally. This
386variable should be treated as read-only: On some systems, it will be a
387pointer to a static string, on others, it will change at run-time
388(such as when the application invokes setlocale).
Georg Brandl54a3faa2008-01-20 09:30:57 +0000389
390Error handling is set by errors which may also be set to *NULL* meaning to use
391the default handling defined for the codec. Default error handling for all
392builtin codecs is "strict" (:exc:`ValueError` is raised).
393
394The codecs all use a similar interface. Only deviation from the following
395generic ones are documented for simplicity.
396
397These are the generic codec APIs:
398
399.. % --- Generic Codecs -----------------------------------------------------
400
401
402.. cfunction:: PyObject* PyUnicode_Decode(const char *s, Py_ssize_t size, const char *encoding, const char *errors)
403
404 Create a Unicode object by decoding *size* bytes of the encoded string *s*.
405 *encoding* and *errors* have the same meaning as the parameters of the same name
406 in the :func:`unicode` builtin function. The codec to be used is looked up
407 using the Python codec registry. Return *NULL* if an exception was raised by
408 the codec.
409
410
411.. cfunction:: PyObject* PyUnicode_Encode(const Py_UNICODE *s, Py_ssize_t size, const char *encoding, const char *errors)
412
413 Encode the :ctype:`Py_UNICODE` buffer of the given size and return a Python
Benjamin Petersonb6eba4f2009-01-13 23:14:04 +0000414 bytes object. *encoding* and *errors* have the same meaning as the
415 parameters of the same name in the Unicode :meth:`encode` method. The codec
416 to be used is looked up using the Python codec registry. Return *NULL* if an
417 exception was raised by the codec.
Georg Brandl54a3faa2008-01-20 09:30:57 +0000418
419
420.. cfunction:: PyObject* PyUnicode_AsEncodedString(PyObject *unicode, const char *encoding, const char *errors)
421
Benjamin Petersonb6eba4f2009-01-13 23:14:04 +0000422 Encode a Unicode object and return the result as Python bytes object.
423 *encoding* and *errors* have the same meaning as the parameters of the same
424 name in the Unicode :meth:`encode` method. The codec to be used is looked up
425 using the Python codec registry. Return *NULL* if an exception was raised by
426 the codec.
Georg Brandl54a3faa2008-01-20 09:30:57 +0000427
428These are the UTF-8 codec APIs:
429
430.. % --- UTF-8 Codecs -------------------------------------------------------
431
432
433.. cfunction:: PyObject* PyUnicode_DecodeUTF8(const char *s, Py_ssize_t size, const char *errors)
434
435 Create a Unicode object by decoding *size* bytes of the UTF-8 encoded string
436 *s*. Return *NULL* if an exception was raised by the codec.
437
438
439.. cfunction:: PyObject* PyUnicode_DecodeUTF8Stateful(const char *s, Py_ssize_t size, const char *errors, Py_ssize_t *consumed)
440
441 If *consumed* is *NULL*, behave like :cfunc:`PyUnicode_DecodeUTF8`. If
442 *consumed* is not *NULL*, trailing incomplete UTF-8 byte sequences will not be
443 treated as an error. Those bytes will not be decoded and the number of bytes
444 that have been decoded will be stored in *consumed*.
445
446
447.. cfunction:: PyObject* PyUnicode_EncodeUTF8(const Py_UNICODE *s, Py_ssize_t size, const char *errors)
448
Benjamin Petersonb6eba4f2009-01-13 23:14:04 +0000449 Encode the :ctype:`Py_UNICODE` buffer of the given size using UTF-8 and
450 return a Python bytes object. Return *NULL* if an exception was raised by
451 the codec.
Georg Brandl54a3faa2008-01-20 09:30:57 +0000452
453
454.. cfunction:: PyObject* PyUnicode_AsUTF8String(PyObject *unicode)
455
Benjamin Petersonb6eba4f2009-01-13 23:14:04 +0000456 Encode a Unicode object using UTF-8 and return the result as Python bytes
457 object. Error handling is "strict". Return *NULL* if an exception was
458 raised by the codec.
Georg Brandl54a3faa2008-01-20 09:30:57 +0000459
460These are the UTF-32 codec APIs:
461
462.. % --- UTF-32 Codecs ------------------------------------------------------ */
463
464
465.. cfunction:: PyObject* PyUnicode_DecodeUTF32(const char *s, Py_ssize_t size, const char *errors, int *byteorder)
466
467 Decode *length* bytes from a UTF-32 encoded buffer string and return the
468 corresponding Unicode object. *errors* (if non-*NULL*) defines the error
469 handling. It defaults to "strict".
470
471 If *byteorder* is non-*NULL*, the decoder starts decoding using the given byte
472 order::
473
474 *byteorder == -1: little endian
475 *byteorder == 0: native order
476 *byteorder == 1: big endian
477
478 and then switches if the first four bytes of the input data are a byte order mark
479 (BOM) and the specified byte order is native order. This BOM is not copied into
480 the resulting Unicode string. After completion, *\*byteorder* is set to the
481 current byte order at the end of input data.
482
483 In a narrow build codepoints outside the BMP will be decoded as surrogate pairs.
484
485 If *byteorder* is *NULL*, the codec starts in native order mode.
486
487 Return *NULL* if an exception was raised by the codec.
488
489
490.. cfunction:: PyObject* PyUnicode_DecodeUTF32Stateful(const char *s, Py_ssize_t size, const char *errors, int *byteorder, Py_ssize_t *consumed)
491
492 If *consumed* is *NULL*, behave like :cfunc:`PyUnicode_DecodeUTF32`. If
493 *consumed* is not *NULL*, :cfunc:`PyUnicode_DecodeUTF32Stateful` will not treat
494 trailing incomplete UTF-32 byte sequences (such as a number of bytes not divisible
495 by four) as an error. Those bytes will not be decoded and the number of bytes
496 that have been decoded will be stored in *consumed*.
497
498
499.. cfunction:: PyObject* PyUnicode_EncodeUTF32(const Py_UNICODE *s, Py_ssize_t size, const char *errors, int byteorder)
500
501 Return a Python bytes object holding the UTF-32 encoded value of the Unicode
502 data in *s*. If *byteorder* is not ``0``, output is written according to the
503 following byte order::
504
505 byteorder == -1: little endian
506 byteorder == 0: native byte order (writes a BOM mark)
507 byteorder == 1: big endian
508
509 If byteorder is ``0``, the output string will always start with the Unicode BOM
510 mark (U+FEFF). In the other two modes, no BOM mark is prepended.
511
512 If *Py_UNICODE_WIDE* is not defined, surrogate pairs will be output
513 as a single codepoint.
514
515 Return *NULL* if an exception was raised by the codec.
516
517
518.. cfunction:: PyObject* PyUnicode_AsUTF32String(PyObject *unicode)
519
Benjamin Petersonb6eba4f2009-01-13 23:14:04 +0000520 Return a Python byte string using the UTF-32 encoding in native byte
521 order. The string always starts with a BOM mark. Error handling is "strict".
522 Return *NULL* if an exception was raised by the codec.
Georg Brandl54a3faa2008-01-20 09:30:57 +0000523
524
525These are the UTF-16 codec APIs:
526
527.. % --- UTF-16 Codecs ------------------------------------------------------ */
528
529
530.. cfunction:: PyObject* PyUnicode_DecodeUTF16(const char *s, Py_ssize_t size, const char *errors, int *byteorder)
531
532 Decode *length* bytes from a UTF-16 encoded buffer string and return the
533 corresponding Unicode object. *errors* (if non-*NULL*) defines the error
534 handling. It defaults to "strict".
535
536 If *byteorder* is non-*NULL*, the decoder starts decoding using the given byte
537 order::
538
539 *byteorder == -1: little endian
540 *byteorder == 0: native order
541 *byteorder == 1: big endian
542
543 and then switches if the first two bytes of the input data are a byte order mark
544 (BOM) and the specified byte order is native order. This BOM is not copied into
545 the resulting Unicode string. After completion, *\*byteorder* is set to the
546 current byte order at the end of input data.
547
548 If *byteorder* is *NULL*, the codec starts in native order mode.
549
550 Return *NULL* if an exception was raised by the codec.
551
552
553.. cfunction:: PyObject* PyUnicode_DecodeUTF16Stateful(const char *s, Py_ssize_t size, const char *errors, int *byteorder, Py_ssize_t *consumed)
554
555 If *consumed* is *NULL*, behave like :cfunc:`PyUnicode_DecodeUTF16`. If
556 *consumed* is not *NULL*, :cfunc:`PyUnicode_DecodeUTF16Stateful` will not treat
557 trailing incomplete UTF-16 byte sequences (such as an odd number of bytes or a
558 split surrogate pair) as an error. Those bytes will not be decoded and the
559 number of bytes that have been decoded will be stored in *consumed*.
560
561
562.. cfunction:: PyObject* PyUnicode_EncodeUTF16(const Py_UNICODE *s, Py_ssize_t size, const char *errors, int byteorder)
563
Benjamin Petersonb6eba4f2009-01-13 23:14:04 +0000564 Return a Python bytes object holding the UTF-16 encoded value of the Unicode
Georg Brandl54a3faa2008-01-20 09:30:57 +0000565 data in *s*. If *byteorder* is not ``0``, output is written according to the
566 following byte order::
567
568 byteorder == -1: little endian
569 byteorder == 0: native byte order (writes a BOM mark)
570 byteorder == 1: big endian
571
572 If byteorder is ``0``, the output string will always start with the Unicode BOM
573 mark (U+FEFF). In the other two modes, no BOM mark is prepended.
574
575 If *Py_UNICODE_WIDE* is defined, a single :ctype:`Py_UNICODE` value may get
576 represented as a surrogate pair. If it is not defined, each :ctype:`Py_UNICODE`
577 values is interpreted as an UCS-2 character.
578
579 Return *NULL* if an exception was raised by the codec.
580
581
582.. cfunction:: PyObject* PyUnicode_AsUTF16String(PyObject *unicode)
583
Benjamin Petersonb6eba4f2009-01-13 23:14:04 +0000584 Return a Python byte string using the UTF-16 encoding in native byte
585 order. The string always starts with a BOM mark. Error handling is "strict".
586 Return *NULL* if an exception was raised by the codec.
Georg Brandl54a3faa2008-01-20 09:30:57 +0000587
588These are the "Unicode Escape" codec APIs:
589
590.. % --- Unicode-Escape Codecs ----------------------------------------------
591
592
593.. cfunction:: PyObject* PyUnicode_DecodeUnicodeEscape(const char *s, Py_ssize_t size, const char *errors)
594
595 Create a Unicode object by decoding *size* bytes of the Unicode-Escape encoded
596 string *s*. Return *NULL* if an exception was raised by the codec.
597
598
599.. cfunction:: PyObject* PyUnicode_EncodeUnicodeEscape(const Py_UNICODE *s, Py_ssize_t size)
600
601 Encode the :ctype:`Py_UNICODE` buffer of the given size using Unicode-Escape and
602 return a Python string object. Return *NULL* if an exception was raised by the
603 codec.
604
605
606.. cfunction:: PyObject* PyUnicode_AsUnicodeEscapeString(PyObject *unicode)
607
608 Encode a Unicode object using Unicode-Escape and return the result as Python
609 string object. Error handling is "strict". Return *NULL* if an exception was
610 raised by the codec.
611
612These are the "Raw Unicode Escape" codec APIs:
613
614.. % --- Raw-Unicode-Escape Codecs ------------------------------------------
615
616
617.. cfunction:: PyObject* PyUnicode_DecodeRawUnicodeEscape(const char *s, Py_ssize_t size, const char *errors)
618
619 Create a Unicode object by decoding *size* bytes of the Raw-Unicode-Escape
620 encoded string *s*. Return *NULL* if an exception was raised by the codec.
621
622
623.. cfunction:: PyObject* PyUnicode_EncodeRawUnicodeEscape(const Py_UNICODE *s, Py_ssize_t size, const char *errors)
624
625 Encode the :ctype:`Py_UNICODE` buffer of the given size using Raw-Unicode-Escape
626 and return a Python string object. Return *NULL* if an exception was raised by
627 the codec.
628
629
630.. cfunction:: PyObject* PyUnicode_AsRawUnicodeEscapeString(PyObject *unicode)
631
632 Encode a Unicode object using Raw-Unicode-Escape and return the result as
633 Python string object. Error handling is "strict". Return *NULL* if an exception
634 was raised by the codec.
635
636These are the Latin-1 codec APIs: Latin-1 corresponds to the first 256 Unicode
637ordinals and only these are accepted by the codecs during encoding.
638
639.. % --- Latin-1 Codecs -----------------------------------------------------
640
641
642.. cfunction:: PyObject* PyUnicode_DecodeLatin1(const char *s, Py_ssize_t size, const char *errors)
643
644 Create a Unicode object by decoding *size* bytes of the Latin-1 encoded string
645 *s*. Return *NULL* if an exception was raised by the codec.
646
647
648.. cfunction:: PyObject* PyUnicode_EncodeLatin1(const Py_UNICODE *s, Py_ssize_t size, const char *errors)
649
Benjamin Petersonb6eba4f2009-01-13 23:14:04 +0000650 Encode the :ctype:`Py_UNICODE` buffer of the given size using Latin-1 and
651 return a Python bytes object. Return *NULL* if an exception was raised by
652 the codec.
Georg Brandl54a3faa2008-01-20 09:30:57 +0000653
654
655.. cfunction:: PyObject* PyUnicode_AsLatin1String(PyObject *unicode)
656
Benjamin Petersonb6eba4f2009-01-13 23:14:04 +0000657 Encode a Unicode object using Latin-1 and return the result as Python bytes
658 object. Error handling is "strict". Return *NULL* if an exception was
659 raised by the codec.
Georg Brandl54a3faa2008-01-20 09:30:57 +0000660
661These are the ASCII codec APIs. Only 7-bit ASCII data is accepted. All other
662codes generate errors.
663
664.. % --- ASCII Codecs -------------------------------------------------------
665
666
667.. cfunction:: PyObject* PyUnicode_DecodeASCII(const char *s, Py_ssize_t size, const char *errors)
668
669 Create a Unicode object by decoding *size* bytes of the ASCII encoded string
670 *s*. Return *NULL* if an exception was raised by the codec.
671
672
673.. cfunction:: PyObject* PyUnicode_EncodeASCII(const Py_UNICODE *s, Py_ssize_t size, const char *errors)
674
Benjamin Petersonb6eba4f2009-01-13 23:14:04 +0000675 Encode the :ctype:`Py_UNICODE` buffer of the given size using ASCII and
676 return a Python bytes object. Return *NULL* if an exception was raised by
677 the codec.
Georg Brandl54a3faa2008-01-20 09:30:57 +0000678
679
680.. cfunction:: PyObject* PyUnicode_AsASCIIString(PyObject *unicode)
681
Benjamin Petersonb6eba4f2009-01-13 23:14:04 +0000682 Encode a Unicode object using ASCII and return the result as Python bytes
683 object. Error handling is "strict". Return *NULL* if an exception was
684 raised by the codec.
Georg Brandl54a3faa2008-01-20 09:30:57 +0000685
686These are the mapping codec APIs:
687
688.. % --- Character Map Codecs -----------------------------------------------
689
690This codec is special in that it can be used to implement many different codecs
691(and this is in fact what was done to obtain most of the standard codecs
692included in the :mod:`encodings` package). The codec uses mapping to encode and
693decode characters.
694
695Decoding mappings must map single string characters to single Unicode
696characters, integers (which are then interpreted as Unicode ordinals) or None
697(meaning "undefined mapping" and causing an error).
698
699Encoding mappings must map single Unicode characters to single string
700characters, integers (which are then interpreted as Latin-1 ordinals) or None
701(meaning "undefined mapping" and causing an error).
702
703The mapping objects provided must only support the __getitem__ mapping
704interface.
705
706If a character lookup fails with a LookupError, the character is copied as-is
707meaning that its ordinal value will be interpreted as Unicode or Latin-1 ordinal
708resp. Because of this, mappings only need to contain those mappings which map
709characters to different code points.
710
711
712.. cfunction:: PyObject* PyUnicode_DecodeCharmap(const char *s, Py_ssize_t size, PyObject *mapping, const char *errors)
713
714 Create a Unicode object by decoding *size* bytes of the encoded string *s* using
715 the given *mapping* object. Return *NULL* if an exception was raised by the
716 codec. If *mapping* is *NULL* latin-1 decoding will be done. Else it can be a
717 dictionary mapping byte or a unicode string, which is treated as a lookup table.
718 Byte values greater that the length of the string and U+FFFE "characters" are
719 treated as "undefined mapping".
720
721
722.. cfunction:: PyObject* PyUnicode_EncodeCharmap(const Py_UNICODE *s, Py_ssize_t size, PyObject *mapping, const char *errors)
723
724 Encode the :ctype:`Py_UNICODE` buffer of the given size using the given
725 *mapping* object and return a Python string object. Return *NULL* if an
726 exception was raised by the codec.
727
728
729.. cfunction:: PyObject* PyUnicode_AsCharmapString(PyObject *unicode, PyObject *mapping)
730
731 Encode a Unicode object using the given *mapping* object and return the result
732 as Python string object. Error handling is "strict". Return *NULL* if an
733 exception was raised by the codec.
734
735The following codec API is special in that maps Unicode to Unicode.
736
737
738.. cfunction:: PyObject* PyUnicode_TranslateCharmap(const Py_UNICODE *s, Py_ssize_t size, PyObject *table, const char *errors)
739
740 Translate a :ctype:`Py_UNICODE` buffer of the given length by applying a
741 character mapping *table* to it and return the resulting Unicode object. Return
742 *NULL* when an exception was raised by the codec.
743
744 The *mapping* table must map Unicode ordinal integers to Unicode ordinal
745 integers or None (causing deletion of the character).
746
747 Mapping tables need only provide the :meth:`__getitem__` interface; dictionaries
748 and sequences work well. Unmapped character ordinals (ones which cause a
749 :exc:`LookupError`) are left untouched and are copied as-is.
750
Jeroen Ruigrok van der Werven47a7d702009-04-27 05:43:17 +0000751
Georg Brandl54a3faa2008-01-20 09:30:57 +0000752These are the MBCS codec APIs. They are currently only available on Windows and
753use the Win32 MBCS converters to implement the conversions. Note that MBCS (or
754DBCS) is a class of encodings, not just one. The target encoding is defined by
755the user settings on the machine running the codec.
756
757.. % --- MBCS codecs for Windows --------------------------------------------
758
759
760.. cfunction:: PyObject* PyUnicode_DecodeMBCS(const char *s, Py_ssize_t size, const char *errors)
761
762 Create a Unicode object by decoding *size* bytes of the MBCS encoded string *s*.
763 Return *NULL* if an exception was raised by the codec.
764
765
766.. cfunction:: PyObject* PyUnicode_DecodeMBCSStateful(const char *s, int size, const char *errors, int *consumed)
767
768 If *consumed* is *NULL*, behave like :cfunc:`PyUnicode_DecodeMBCS`. If
769 *consumed* is not *NULL*, :cfunc:`PyUnicode_DecodeMBCSStateful` will not decode
770 trailing lead byte and the number of bytes that have been decoded will be stored
771 in *consumed*.
772
773
774.. cfunction:: PyObject* PyUnicode_EncodeMBCS(const Py_UNICODE *s, Py_ssize_t size, const char *errors)
775
Benjamin Petersonb6eba4f2009-01-13 23:14:04 +0000776 Encode the :ctype:`Py_UNICODE` buffer of the given size using MBCS and return
777 a Python bytes object. Return *NULL* if an exception was raised by the
778 codec.
Georg Brandl54a3faa2008-01-20 09:30:57 +0000779
780
781.. cfunction:: PyObject* PyUnicode_AsMBCSString(PyObject *unicode)
782
Benjamin Petersonb6eba4f2009-01-13 23:14:04 +0000783 Encode a Unicode object using MBCS and return the result as Python bytes
784 object. Error handling is "strict". Return *NULL* if an exception was
785 raised by the codec.
Georg Brandl54a3faa2008-01-20 09:30:57 +0000786
Martin v. Löwisc15bdef2009-05-29 14:47:46 +0000787For decoding file names and other environment strings, :cdata:`Py_FileSystemEncoding`
788should be used as the encoding, and ``"surrogateescape"`` should be used as the error
789handler. For encoding file names during argument parsing, the ``O&`` converter should
790be used, passsing PyUnicode_FSConverter as the conversion function:
791
792.. cfunction:: int PyUnicode_FSConverter(PyObject* obj, void* result)
793
794 Convert *obj* into *result*, using the file system encoding, and the ``surrogateescape``
795 error handler. *result* must be a ``PyObject*``, yielding a bytes or bytearray object
796 which must be released if it is no longer used.
797
798 .. versionadded:: 3.1
799
Georg Brandl54a3faa2008-01-20 09:30:57 +0000800.. % --- Methods & Slots ----------------------------------------------------
801
802
803.. _unicodemethodsandslots:
804
805Methods and Slot Functions
806^^^^^^^^^^^^^^^^^^^^^^^^^^
807
808The following APIs are capable of handling Unicode objects and strings on input
809(we refer to them as strings in the descriptions) and return Unicode objects or
810integers as appropriate.
811
812They all return *NULL* or ``-1`` if an exception occurs.
813
814
815.. cfunction:: PyObject* PyUnicode_Concat(PyObject *left, PyObject *right)
816
817 Concat two strings giving a new Unicode string.
818
819
820.. cfunction:: PyObject* PyUnicode_Split(PyObject *s, PyObject *sep, Py_ssize_t maxsplit)
821
822 Split a string giving a list of Unicode strings. If sep is *NULL*, splitting
823 will be done at all whitespace substrings. Otherwise, splits occur at the given
824 separator. At most *maxsplit* splits will be done. If negative, no limit is
825 set. Separators are not included in the resulting list.
826
827
828.. cfunction:: PyObject* PyUnicode_Splitlines(PyObject *s, int keepend)
829
830 Split a Unicode string at line breaks, returning a list of Unicode strings.
831 CRLF is considered to be one line break. If *keepend* is 0, the Line break
832 characters are not included in the resulting strings.
833
834
835.. cfunction:: PyObject* PyUnicode_Translate(PyObject *str, PyObject *table, const char *errors)
836
837 Translate a string by applying a character mapping table to it and return the
838 resulting Unicode object.
839
840 The mapping table must map Unicode ordinal integers to Unicode ordinal integers
841 or None (causing deletion of the character).
842
843 Mapping tables need only provide the :meth:`__getitem__` interface; dictionaries
844 and sequences work well. Unmapped character ordinals (ones which cause a
845 :exc:`LookupError`) are left untouched and are copied as-is.
846
847 *errors* has the usual meaning for codecs. It may be *NULL* which indicates to
848 use the default error handling.
849
850
851.. cfunction:: PyObject* PyUnicode_Join(PyObject *separator, PyObject *seq)
852
853 Join a sequence of strings using the given separator and return the resulting
854 Unicode string.
855
856
857.. cfunction:: int PyUnicode_Tailmatch(PyObject *str, PyObject *substr, Py_ssize_t start, Py_ssize_t end, int direction)
858
859 Return 1 if *substr* matches *str*[*start*:*end*] at the given tail end
860 (*direction* == -1 means to do a prefix match, *direction* == 1 a suffix match),
861 0 otherwise. Return ``-1`` if an error occurred.
862
863
864.. cfunction:: Py_ssize_t PyUnicode_Find(PyObject *str, PyObject *substr, Py_ssize_t start, Py_ssize_t end, int direction)
865
866 Return the first position of *substr* in *str*[*start*:*end*] using the given
867 *direction* (*direction* == 1 means to do a forward search, *direction* == -1 a
868 backward search). The return value is the index of the first match; a value of
869 ``-1`` indicates that no match was found, and ``-2`` indicates that an error
870 occurred and an exception has been set.
871
872
873.. cfunction:: Py_ssize_t PyUnicode_Count(PyObject *str, PyObject *substr, Py_ssize_t start, Py_ssize_t end)
874
875 Return the number of non-overlapping occurrences of *substr* in
876 ``str[start:end]``. Return ``-1`` if an error occurred.
877
878
879.. cfunction:: PyObject* PyUnicode_Replace(PyObject *str, PyObject *substr, PyObject *replstr, Py_ssize_t maxcount)
880
881 Replace at most *maxcount* occurrences of *substr* in *str* with *replstr* and
882 return the resulting Unicode object. *maxcount* == -1 means replace all
883 occurrences.
884
885
886.. cfunction:: int PyUnicode_Compare(PyObject *left, PyObject *right)
887
888 Compare two strings and return -1, 0, 1 for less than, equal, and greater than,
889 respectively.
890
891
Benjamin Petersonc22ed142008-07-01 19:12:34 +0000892.. cfunction:: int PyUnicode_CompareWithASCIIString(PyObject *uni, char *string)
893
894 Compare a unicode object, *uni*, with *string* and return -1, 0, 1 for less
895 than, equal, and greater than, respectively.
896
897
Georg Brandl54a3faa2008-01-20 09:30:57 +0000898.. cfunction:: int PyUnicode_RichCompare(PyObject *left, PyObject *right, int op)
899
900 Rich compare two unicode strings and return one of the following:
901
902 * ``NULL`` in case an exception was raised
903 * :const:`Py_True` or :const:`Py_False` for successful comparisons
904 * :const:`Py_NotImplemented` in case the type combination is unknown
905
906 Note that :const:`Py_EQ` and :const:`Py_NE` comparisons can cause a
907 :exc:`UnicodeWarning` in case the conversion of the arguments to Unicode fails
908 with a :exc:`UnicodeDecodeError`.
909
910 Possible values for *op* are :const:`Py_GT`, :const:`Py_GE`, :const:`Py_EQ`,
911 :const:`Py_NE`, :const:`Py_LT`, and :const:`Py_LE`.
912
913
914.. cfunction:: PyObject* PyUnicode_Format(PyObject *format, PyObject *args)
915
916 Return a new string object from *format* and *args*; this is analogous to
917 ``format % args``. The *args* argument must be a tuple.
918
919
920.. cfunction:: int PyUnicode_Contains(PyObject *container, PyObject *element)
921
922 Check whether *element* is contained in *container* and return true or false
923 accordingly.
924
925 *element* has to coerce to a one element Unicode string. ``-1`` is returned if
926 there was an error.
927
928
929.. cfunction:: void PyUnicode_InternInPlace(PyObject **string)
930
931 Intern the argument *\*string* in place. The argument must be the address of a
932 pointer variable pointing to a Python unicode string object. If there is an
933 existing interned string that is the same as *\*string*, it sets *\*string* to
934 it (decrementing the reference count of the old string object and incrementing
935 the reference count of the interned string object), otherwise it leaves
936 *\*string* alone and interns it (incrementing its reference count).
937 (Clarification: even though there is a lot of talk about reference counts, think
938 of this function as reference-count-neutral; you own the object after the call
939 if and only if you owned it before the call.)
940
941
942.. cfunction:: PyObject* PyUnicode_InternFromString(const char *v)
943
944 A combination of :cfunc:`PyUnicode_FromString` and
945 :cfunc:`PyUnicode_InternInPlace`, returning either a new unicode string object
946 that has been interned, or a new ("owned") reference to an earlier interned
947 string object with the same value.
948