blob: 448cf6895cc8e8ea063da0aefad82443729a6376 [file] [log] [blame]
Georg Brandl54a3faa2008-01-20 09:30:57 +00001.. highlightlang:: c
2
3.. _unicodeobjects:
4
5Unicode Objects and Codecs
6--------------------------
7
8.. sectionauthor:: Marc-Andre Lemburg <mal@lemburg.com>
9
10Unicode Objects
11^^^^^^^^^^^^^^^
12
13These are the basic Unicode object types used for the Unicode implementation in
14Python:
15
16.. % --- Unicode Type -------------------------------------------------------
17
18
19.. ctype:: Py_UNICODE
20
21 This type represents the storage type which is used by Python internally as
22 basis for holding Unicode ordinals. Python's default builds use a 16-bit type
23 for :ctype:`Py_UNICODE` and store Unicode values internally as UCS2. It is also
24 possible to build a UCS4 version of Python (most recent Linux distributions come
25 with UCS4 builds of Python). These builds then use a 32-bit type for
26 :ctype:`Py_UNICODE` and store Unicode data internally as UCS4. On platforms
27 where :ctype:`wchar_t` is available and compatible with the chosen Python
28 Unicode build variant, :ctype:`Py_UNICODE` is a typedef alias for
29 :ctype:`wchar_t` to enhance native platform compatibility. On all other
30 platforms, :ctype:`Py_UNICODE` is a typedef alias for either :ctype:`unsigned
31 short` (UCS2) or :ctype:`unsigned long` (UCS4).
32
33Note that UCS2 and UCS4 Python builds are not binary compatible. Please keep
34this in mind when writing extensions or interfaces.
35
36
37.. ctype:: PyUnicodeObject
38
39 This subtype of :ctype:`PyObject` represents a Python Unicode object.
40
41
42.. cvar:: PyTypeObject PyUnicode_Type
43
44 This instance of :ctype:`PyTypeObject` represents the Python Unicode type. It
45 is exposed to Python code as ``str``.
46
47The following APIs are really C macros and can be used to do fast checks and to
48access internal read-only data of Unicode objects:
49
50
51.. cfunction:: int PyUnicode_Check(PyObject *o)
52
53 Return true if the object *o* is a Unicode object or an instance of a Unicode
54 subtype.
55
56
57.. cfunction:: int PyUnicode_CheckExact(PyObject *o)
58
59 Return true if the object *o* is a Unicode object, but not an instance of a
60 subtype.
61
62
63.. cfunction:: Py_ssize_t PyUnicode_GET_SIZE(PyObject *o)
64
65 Return the size of the object. *o* has to be a :ctype:`PyUnicodeObject` (not
66 checked).
67
68
69.. cfunction:: Py_ssize_t PyUnicode_GET_DATA_SIZE(PyObject *o)
70
71 Return the size of the object's internal buffer in bytes. *o* has to be a
72 :ctype:`PyUnicodeObject` (not checked).
73
74
75.. cfunction:: Py_UNICODE* PyUnicode_AS_UNICODE(PyObject *o)
76
77 Return a pointer to the internal :ctype:`Py_UNICODE` buffer of the object. *o*
78 has to be a :ctype:`PyUnicodeObject` (not checked).
79
80
81.. cfunction:: const char* PyUnicode_AS_DATA(PyObject *o)
82
83 Return a pointer to the internal buffer of the object. *o* has to be a
84 :ctype:`PyUnicodeObject` (not checked).
85
Christian Heimesa156e092008-02-16 07:38:31 +000086
87.. cfunction:: int PyUnicode_ClearFreeList(void)
88
89 Clear the free list. Return the total number of freed items.
90
Georg Brandl54a3faa2008-01-20 09:30:57 +000091Unicode provides many different character properties. The most often needed ones
92are available through these macros which are mapped to C functions depending on
93the Python configuration.
94
95.. % --- Unicode character properties ---------------------------------------
96
97
98.. cfunction:: int Py_UNICODE_ISSPACE(Py_UNICODE ch)
99
100 Return 1 or 0 depending on whether *ch* is a whitespace character.
101
102
103.. cfunction:: int Py_UNICODE_ISLOWER(Py_UNICODE ch)
104
105 Return 1 or 0 depending on whether *ch* is a lowercase character.
106
107
108.. cfunction:: int Py_UNICODE_ISUPPER(Py_UNICODE ch)
109
110 Return 1 or 0 depending on whether *ch* is an uppercase character.
111
112
113.. cfunction:: int Py_UNICODE_ISTITLE(Py_UNICODE ch)
114
115 Return 1 or 0 depending on whether *ch* is a titlecase character.
116
117
118.. cfunction:: int Py_UNICODE_ISLINEBREAK(Py_UNICODE ch)
119
120 Return 1 or 0 depending on whether *ch* is a linebreak character.
121
122
123.. cfunction:: int Py_UNICODE_ISDECIMAL(Py_UNICODE ch)
124
125 Return 1 or 0 depending on whether *ch* is a decimal character.
126
127
128.. cfunction:: int Py_UNICODE_ISDIGIT(Py_UNICODE ch)
129
130 Return 1 or 0 depending on whether *ch* is a digit character.
131
132
133.. cfunction:: int Py_UNICODE_ISNUMERIC(Py_UNICODE ch)
134
135 Return 1 or 0 depending on whether *ch* is a numeric character.
136
137
138.. cfunction:: int Py_UNICODE_ISALPHA(Py_UNICODE ch)
139
140 Return 1 or 0 depending on whether *ch* is an alphabetic character.
141
142
143.. cfunction:: int Py_UNICODE_ISALNUM(Py_UNICODE ch)
144
145 Return 1 or 0 depending on whether *ch* is an alphanumeric character.
146
147These APIs can be used for fast direct character conversions:
148
149
150.. cfunction:: Py_UNICODE Py_UNICODE_TOLOWER(Py_UNICODE ch)
151
152 Return the character *ch* converted to lower case.
153
154
155.. cfunction:: Py_UNICODE Py_UNICODE_TOUPPER(Py_UNICODE ch)
156
157 Return the character *ch* converted to upper case.
158
159
160.. cfunction:: Py_UNICODE Py_UNICODE_TOTITLE(Py_UNICODE ch)
161
162 Return the character *ch* converted to title case.
163
164
165.. cfunction:: int Py_UNICODE_TODECIMAL(Py_UNICODE ch)
166
167 Return the character *ch* converted to a decimal positive integer. Return
168 ``-1`` if this is not possible. This macro does not raise exceptions.
169
170
171.. cfunction:: int Py_UNICODE_TODIGIT(Py_UNICODE ch)
172
173 Return the character *ch* converted to a single digit integer. Return ``-1`` if
174 this is not possible. This macro does not raise exceptions.
175
176
177.. cfunction:: double Py_UNICODE_TONUMERIC(Py_UNICODE ch)
178
179 Return the character *ch* converted to a double. Return ``-1.0`` if this is not
180 possible. This macro does not raise exceptions.
181
182To create Unicode objects and access their basic sequence properties, use these
183APIs:
184
185.. % --- Plain Py_UNICODE ---------------------------------------------------
186
187
188.. cfunction:: PyObject* PyUnicode_FromUnicode(const Py_UNICODE *u, Py_ssize_t size)
189
190 Create a Unicode Object from the Py_UNICODE buffer *u* of the given size. *u*
191 may be *NULL* which causes the contents to be undefined. It is the user's
192 responsibility to fill in the needed data. The buffer is copied into the new
193 object. If the buffer is not *NULL*, the return value might be a shared object.
194 Therefore, modification of the resulting Unicode object is only allowed when *u*
195 is *NULL*.
196
197
198.. cfunction:: PyObject* PyUnicode_FromStringAndSize(const char *u, Py_ssize_t size)
199
200 Create a Unicode Object from the char buffer *u*. The bytes will be interpreted
201 as being UTF-8 encoded. *u* may also be *NULL* which
202 causes the contents to be undefined. It is the user's responsibility to fill in
203 the needed data. The buffer is copied into the new object. If the buffer is not
204 *NULL*, the return value might be a shared object. Therefore, modification of
205 the resulting Unicode object is only allowed when *u* is *NULL*.
206
207
208.. cfunction:: PyObject *PyUnicode_FromString(const char *u)
209
210 Create a Unicode object from an UTF-8 encoded null-terminated char buffer
211 *u*.
212
213
214.. cfunction:: PyObject* PyUnicode_FromFormat(const char *format, ...)
215
216 Take a C :cfunc:`printf`\ -style *format* string and a variable number of
217 arguments, calculate the size of the resulting Python unicode string and return
218 a string with the values formatted into it. The variable arguments must be C
219 types and must correspond exactly to the format characters in the *format*
220 string. The following format characters are allowed:
221
222 .. % The descriptions for %zd and %zu are wrong, but the truth is complicated
223 .. % because not all compilers support the %z width modifier -- we fake it
224 .. % when necessary via interpolating PY_FORMAT_SIZE_T.
225
226 +-------------------+---------------------+--------------------------------+
227 | Format Characters | Type | Comment |
228 +===================+=====================+================================+
229 | :attr:`%%` | *n/a* | The literal % character. |
230 +-------------------+---------------------+--------------------------------+
231 | :attr:`%c` | int | A single character, |
232 | | | represented as an C int. |
233 +-------------------+---------------------+--------------------------------+
234 | :attr:`%d` | int | Exactly equivalent to |
235 | | | ``printf("%d")``. |
236 +-------------------+---------------------+--------------------------------+
237 | :attr:`%u` | unsigned int | Exactly equivalent to |
238 | | | ``printf("%u")``. |
239 +-------------------+---------------------+--------------------------------+
240 | :attr:`%ld` | long | Exactly equivalent to |
241 | | | ``printf("%ld")``. |
242 +-------------------+---------------------+--------------------------------+
243 | :attr:`%lu` | unsigned long | Exactly equivalent to |
244 | | | ``printf("%lu")``. |
245 +-------------------+---------------------+--------------------------------+
246 | :attr:`%zd` | Py_ssize_t | Exactly equivalent to |
247 | | | ``printf("%zd")``. |
248 +-------------------+---------------------+--------------------------------+
249 | :attr:`%zu` | size_t | Exactly equivalent to |
250 | | | ``printf("%zu")``. |
251 +-------------------+---------------------+--------------------------------+
252 | :attr:`%i` | int | Exactly equivalent to |
253 | | | ``printf("%i")``. |
254 +-------------------+---------------------+--------------------------------+
255 | :attr:`%x` | int | Exactly equivalent to |
256 | | | ``printf("%x")``. |
257 +-------------------+---------------------+--------------------------------+
258 | :attr:`%s` | char\* | A null-terminated C character |
259 | | | array. |
260 +-------------------+---------------------+--------------------------------+
261 | :attr:`%p` | void\* | The hex representation of a C |
262 | | | pointer. Mostly equivalent to |
263 | | | ``printf("%p")`` except that |
264 | | | it is guaranteed to start with |
265 | | | the literal ``0x`` regardless |
266 | | | of what the platform's |
267 | | | ``printf`` yields. |
268 +-------------------+---------------------+--------------------------------+
269 | :attr:`%U` | PyObject\* | A unicode object. |
270 +-------------------+---------------------+--------------------------------+
271 | :attr:`%V` | PyObject\*, char \* | A unicode object (which may be |
272 | | | *NULL*) and a null-terminated |
273 | | | C character array as a second |
274 | | | parameter (which will be used, |
275 | | | if the first parameter is |
276 | | | *NULL*). |
277 +-------------------+---------------------+--------------------------------+
278 | :attr:`%S` | PyObject\* | The result of calling |
279 | | | :func:`PyObject_Unicode`. |
280 +-------------------+---------------------+--------------------------------+
281 | :attr:`%R` | PyObject\* | The result of calling |
282 | | | :func:`PyObject_Repr`. |
283 +-------------------+---------------------+--------------------------------+
284
285 An unrecognized format character causes all the rest of the format string to be
286 copied as-is to the result string, and any extra arguments discarded.
287
288
289.. cfunction:: PyObject* PyUnicode_FromFormatV(const char *format, va_list vargs)
290
291 Identical to :func:`PyUnicode_FromFormat` except that it takes exactly two
292 arguments.
293
294
295.. cfunction:: Py_UNICODE* PyUnicode_AsUnicode(PyObject *unicode)
296
297 Return a read-only pointer to the Unicode object's internal :ctype:`Py_UNICODE`
298 buffer, *NULL* if *unicode* is not a Unicode object.
299
300
301.. cfunction:: Py_ssize_t PyUnicode_GetSize(PyObject *unicode)
302
303 Return the length of the Unicode object.
304
305
306.. cfunction:: PyObject* PyUnicode_FromEncodedObject(PyObject *obj, const char *encoding, const char *errors)
307
308 Coerce an encoded object *obj* to an Unicode object and return a reference with
309 incremented refcount.
310
311 String and other char buffer compatible objects are decoded according to the
312 given encoding and using the error handling defined by errors. Both can be
313 *NULL* to have the interface use the default values (see the next section for
314 details).
315
316 All other objects, including Unicode objects, cause a :exc:`TypeError` to be
317 set.
318
319 The API returns *NULL* if there was an error. The caller is responsible for
320 decref'ing the returned objects.
321
322
323.. cfunction:: PyObject* PyUnicode_FromObject(PyObject *obj)
324
325 Shortcut for ``PyUnicode_FromEncodedObject(obj, NULL, "strict")`` which is used
326 throughout the interpreter whenever coercion to Unicode is needed.
327
328If the platform supports :ctype:`wchar_t` and provides a header file wchar.h,
329Python can interface directly to this type using the following functions.
330Support is optimized if Python's own :ctype:`Py_UNICODE` type is identical to
331the system's :ctype:`wchar_t`.
332
333.. % --- wchar_t support for platforms which support it ---------------------
334
335
336.. cfunction:: PyObject* PyUnicode_FromWideChar(const wchar_t *w, Py_ssize_t size)
337
338 Create a Unicode object from the :ctype:`wchar_t` buffer *w* of the given size.
339 Return *NULL* on failure.
340
341
342.. cfunction:: Py_ssize_t PyUnicode_AsWideChar(PyUnicodeObject *unicode, wchar_t *w, Py_ssize_t size)
343
344 Copy the Unicode object contents into the :ctype:`wchar_t` buffer *w*. At most
345 *size* :ctype:`wchar_t` characters are copied (excluding a possibly trailing
346 0-termination character). Return the number of :ctype:`wchar_t` characters
347 copied or -1 in case of an error. Note that the resulting :ctype:`wchar_t`
348 string may or may not be 0-terminated. It is the responsibility of the caller
349 to make sure that the :ctype:`wchar_t` string is 0-terminated in case this is
350 required by the application.
351
352
353.. _builtincodecs:
354
355Built-in Codecs
356^^^^^^^^^^^^^^^
357
358Python provides a set of builtin codecs which are written in C for speed. All of
359these codecs are directly usable via the following functions.
360
361Many of the following APIs take two arguments encoding and errors. These
362parameters encoding and errors have the same semantics as the ones of the
363builtin unicode() Unicode object constructor.
364
365Setting encoding to *NULL* causes the default encoding to be used which is
366ASCII. The file system calls should use :cdata:`Py_FileSystemDefaultEncoding`
367as the encoding for file names. This variable should be treated as read-only: On
368some systems, it will be a pointer to a static string, on others, it will change
369at run-time (such as when the application invokes setlocale).
370
371Error handling is set by errors which may also be set to *NULL* meaning to use
372the default handling defined for the codec. Default error handling for all
373builtin codecs is "strict" (:exc:`ValueError` is raised).
374
375The codecs all use a similar interface. Only deviation from the following
376generic ones are documented for simplicity.
377
378These are the generic codec APIs:
379
380.. % --- Generic Codecs -----------------------------------------------------
381
382
383.. cfunction:: PyObject* PyUnicode_Decode(const char *s, Py_ssize_t size, const char *encoding, const char *errors)
384
385 Create a Unicode object by decoding *size* bytes of the encoded string *s*.
386 *encoding* and *errors* have the same meaning as the parameters of the same name
387 in the :func:`unicode` builtin function. The codec to be used is looked up
388 using the Python codec registry. Return *NULL* if an exception was raised by
389 the codec.
390
391
392.. cfunction:: PyObject* PyUnicode_Encode(const Py_UNICODE *s, Py_ssize_t size, const char *encoding, const char *errors)
393
394 Encode the :ctype:`Py_UNICODE` buffer of the given size and return a Python
395 string object. *encoding* and *errors* have the same meaning as the parameters
396 of the same name in the Unicode :meth:`encode` method. The codec to be used is
397 looked up using the Python codec registry. Return *NULL* if an exception was
398 raised by the codec.
399
400
401.. cfunction:: PyObject* PyUnicode_AsEncodedString(PyObject *unicode, const char *encoding, const char *errors)
402
403 Encode a Unicode object and return the result as Python string object.
404 *encoding* and *errors* have the same meaning as the parameters of the same name
405 in the Unicode :meth:`encode` method. The codec to be used is looked up using
406 the Python codec registry. Return *NULL* if an exception was raised by the
407 codec.
408
409These are the UTF-8 codec APIs:
410
411.. % --- UTF-8 Codecs -------------------------------------------------------
412
413
414.. cfunction:: PyObject* PyUnicode_DecodeUTF8(const char *s, Py_ssize_t size, const char *errors)
415
416 Create a Unicode object by decoding *size* bytes of the UTF-8 encoded string
417 *s*. Return *NULL* if an exception was raised by the codec.
418
419
420.. cfunction:: PyObject* PyUnicode_DecodeUTF8Stateful(const char *s, Py_ssize_t size, const char *errors, Py_ssize_t *consumed)
421
422 If *consumed* is *NULL*, behave like :cfunc:`PyUnicode_DecodeUTF8`. If
423 *consumed* is not *NULL*, trailing incomplete UTF-8 byte sequences will not be
424 treated as an error. Those bytes will not be decoded and the number of bytes
425 that have been decoded will be stored in *consumed*.
426
427
428.. cfunction:: PyObject* PyUnicode_EncodeUTF8(const Py_UNICODE *s, Py_ssize_t size, const char *errors)
429
430 Encode the :ctype:`Py_UNICODE` buffer of the given size using UTF-8 and return a
431 Python string object. Return *NULL* if an exception was raised by the codec.
432
433
434.. cfunction:: PyObject* PyUnicode_AsUTF8String(PyObject *unicode)
435
436 Encode a Unicode object using UTF-8 and return the result as Python string
437 object. Error handling is "strict". Return *NULL* if an exception was raised
438 by the codec.
439
440These are the UTF-32 codec APIs:
441
442.. % --- UTF-32 Codecs ------------------------------------------------------ */
443
444
445.. cfunction:: PyObject* PyUnicode_DecodeUTF32(const char *s, Py_ssize_t size, const char *errors, int *byteorder)
446
447 Decode *length* bytes from a UTF-32 encoded buffer string and return the
448 corresponding Unicode object. *errors* (if non-*NULL*) defines the error
449 handling. It defaults to "strict".
450
451 If *byteorder* is non-*NULL*, the decoder starts decoding using the given byte
452 order::
453
454 *byteorder == -1: little endian
455 *byteorder == 0: native order
456 *byteorder == 1: big endian
457
458 and then switches if the first four bytes of the input data are a byte order mark
459 (BOM) and the specified byte order is native order. This BOM is not copied into
460 the resulting Unicode string. After completion, *\*byteorder* is set to the
461 current byte order at the end of input data.
462
463 In a narrow build codepoints outside the BMP will be decoded as surrogate pairs.
464
465 If *byteorder* is *NULL*, the codec starts in native order mode.
466
467 Return *NULL* if an exception was raised by the codec.
468
469
470.. cfunction:: PyObject* PyUnicode_DecodeUTF32Stateful(const char *s, Py_ssize_t size, const char *errors, int *byteorder, Py_ssize_t *consumed)
471
472 If *consumed* is *NULL*, behave like :cfunc:`PyUnicode_DecodeUTF32`. If
473 *consumed* is not *NULL*, :cfunc:`PyUnicode_DecodeUTF32Stateful` will not treat
474 trailing incomplete UTF-32 byte sequences (such as a number of bytes not divisible
475 by four) as an error. Those bytes will not be decoded and the number of bytes
476 that have been decoded will be stored in *consumed*.
477
478
479.. cfunction:: PyObject* PyUnicode_EncodeUTF32(const Py_UNICODE *s, Py_ssize_t size, const char *errors, int byteorder)
480
481 Return a Python bytes object holding the UTF-32 encoded value of the Unicode
482 data in *s*. If *byteorder* is not ``0``, output is written according to the
483 following byte order::
484
485 byteorder == -1: little endian
486 byteorder == 0: native byte order (writes a BOM mark)
487 byteorder == 1: big endian
488
489 If byteorder is ``0``, the output string will always start with the Unicode BOM
490 mark (U+FEFF). In the other two modes, no BOM mark is prepended.
491
492 If *Py_UNICODE_WIDE* is not defined, surrogate pairs will be output
493 as a single codepoint.
494
495 Return *NULL* if an exception was raised by the codec.
496
497
498.. cfunction:: PyObject* PyUnicode_AsUTF32String(PyObject *unicode)
499
500 Return a Python string using the UTF-32 encoding in native byte order. The
501 string always starts with a BOM mark. Error handling is "strict". Return
502 *NULL* if an exception was raised by the codec.
503
504
505These are the UTF-16 codec APIs:
506
507.. % --- UTF-16 Codecs ------------------------------------------------------ */
508
509
510.. cfunction:: PyObject* PyUnicode_DecodeUTF16(const char *s, Py_ssize_t size, const char *errors, int *byteorder)
511
512 Decode *length* bytes from a UTF-16 encoded buffer string and return the
513 corresponding Unicode object. *errors* (if non-*NULL*) defines the error
514 handling. It defaults to "strict".
515
516 If *byteorder* is non-*NULL*, the decoder starts decoding using the given byte
517 order::
518
519 *byteorder == -1: little endian
520 *byteorder == 0: native order
521 *byteorder == 1: big endian
522
523 and then switches if the first two bytes of the input data are a byte order mark
524 (BOM) and the specified byte order is native order. This BOM is not copied into
525 the resulting Unicode string. After completion, *\*byteorder* is set to the
526 current byte order at the end of input data.
527
528 If *byteorder* is *NULL*, the codec starts in native order mode.
529
530 Return *NULL* if an exception was raised by the codec.
531
532
533.. cfunction:: PyObject* PyUnicode_DecodeUTF16Stateful(const char *s, Py_ssize_t size, const char *errors, int *byteorder, Py_ssize_t *consumed)
534
535 If *consumed* is *NULL*, behave like :cfunc:`PyUnicode_DecodeUTF16`. If
536 *consumed* is not *NULL*, :cfunc:`PyUnicode_DecodeUTF16Stateful` will not treat
537 trailing incomplete UTF-16 byte sequences (such as an odd number of bytes or a
538 split surrogate pair) as an error. Those bytes will not be decoded and the
539 number of bytes that have been decoded will be stored in *consumed*.
540
541
542.. cfunction:: PyObject* PyUnicode_EncodeUTF16(const Py_UNICODE *s, Py_ssize_t size, const char *errors, int byteorder)
543
544 Return a Python string object holding the UTF-16 encoded value of the Unicode
545 data in *s*. If *byteorder* is not ``0``, output is written according to the
546 following byte order::
547
548 byteorder == -1: little endian
549 byteorder == 0: native byte order (writes a BOM mark)
550 byteorder == 1: big endian
551
552 If byteorder is ``0``, the output string will always start with the Unicode BOM
553 mark (U+FEFF). In the other two modes, no BOM mark is prepended.
554
555 If *Py_UNICODE_WIDE* is defined, a single :ctype:`Py_UNICODE` value may get
556 represented as a surrogate pair. If it is not defined, each :ctype:`Py_UNICODE`
557 values is interpreted as an UCS-2 character.
558
559 Return *NULL* if an exception was raised by the codec.
560
561
562.. cfunction:: PyObject* PyUnicode_AsUTF16String(PyObject *unicode)
563
564 Return a Python string using the UTF-16 encoding in native byte order. The
565 string always starts with a BOM mark. Error handling is "strict". Return
566 *NULL* if an exception was raised by the codec.
567
568These are the "Unicode Escape" codec APIs:
569
570.. % --- Unicode-Escape Codecs ----------------------------------------------
571
572
573.. cfunction:: PyObject* PyUnicode_DecodeUnicodeEscape(const char *s, Py_ssize_t size, const char *errors)
574
575 Create a Unicode object by decoding *size* bytes of the Unicode-Escape encoded
576 string *s*. Return *NULL* if an exception was raised by the codec.
577
578
579.. cfunction:: PyObject* PyUnicode_EncodeUnicodeEscape(const Py_UNICODE *s, Py_ssize_t size)
580
581 Encode the :ctype:`Py_UNICODE` buffer of the given size using Unicode-Escape and
582 return a Python string object. Return *NULL* if an exception was raised by the
583 codec.
584
585
586.. cfunction:: PyObject* PyUnicode_AsUnicodeEscapeString(PyObject *unicode)
587
588 Encode a Unicode object using Unicode-Escape and return the result as Python
589 string object. Error handling is "strict". Return *NULL* if an exception was
590 raised by the codec.
591
592These are the "Raw Unicode Escape" codec APIs:
593
594.. % --- Raw-Unicode-Escape Codecs ------------------------------------------
595
596
597.. cfunction:: PyObject* PyUnicode_DecodeRawUnicodeEscape(const char *s, Py_ssize_t size, const char *errors)
598
599 Create a Unicode object by decoding *size* bytes of the Raw-Unicode-Escape
600 encoded string *s*. Return *NULL* if an exception was raised by the codec.
601
602
603.. cfunction:: PyObject* PyUnicode_EncodeRawUnicodeEscape(const Py_UNICODE *s, Py_ssize_t size, const char *errors)
604
605 Encode the :ctype:`Py_UNICODE` buffer of the given size using Raw-Unicode-Escape
606 and return a Python string object. Return *NULL* if an exception was raised by
607 the codec.
608
609
610.. cfunction:: PyObject* PyUnicode_AsRawUnicodeEscapeString(PyObject *unicode)
611
612 Encode a Unicode object using Raw-Unicode-Escape and return the result as
613 Python string object. Error handling is "strict". Return *NULL* if an exception
614 was raised by the codec.
615
616These are the Latin-1 codec APIs: Latin-1 corresponds to the first 256 Unicode
617ordinals and only these are accepted by the codecs during encoding.
618
619.. % --- Latin-1 Codecs -----------------------------------------------------
620
621
622.. cfunction:: PyObject* PyUnicode_DecodeLatin1(const char *s, Py_ssize_t size, const char *errors)
623
624 Create a Unicode object by decoding *size* bytes of the Latin-1 encoded string
625 *s*. Return *NULL* if an exception was raised by the codec.
626
627
628.. cfunction:: PyObject* PyUnicode_EncodeLatin1(const Py_UNICODE *s, Py_ssize_t size, const char *errors)
629
630 Encode the :ctype:`Py_UNICODE` buffer of the given size using Latin-1 and return
631 a Python string object. Return *NULL* if an exception was raised by the codec.
632
633
634.. cfunction:: PyObject* PyUnicode_AsLatin1String(PyObject *unicode)
635
636 Encode a Unicode object using Latin-1 and return the result as Python string
637 object. Error handling is "strict". Return *NULL* if an exception was raised
638 by the codec.
639
640These are the ASCII codec APIs. Only 7-bit ASCII data is accepted. All other
641codes generate errors.
642
643.. % --- ASCII Codecs -------------------------------------------------------
644
645
646.. cfunction:: PyObject* PyUnicode_DecodeASCII(const char *s, Py_ssize_t size, const char *errors)
647
648 Create a Unicode object by decoding *size* bytes of the ASCII encoded string
649 *s*. Return *NULL* if an exception was raised by the codec.
650
651
652.. cfunction:: PyObject* PyUnicode_EncodeASCII(const Py_UNICODE *s, Py_ssize_t size, const char *errors)
653
654 Encode the :ctype:`Py_UNICODE` buffer of the given size using ASCII and return a
655 Python string object. Return *NULL* if an exception was raised by the codec.
656
657
658.. cfunction:: PyObject* PyUnicode_AsASCIIString(PyObject *unicode)
659
660 Encode a Unicode object using ASCII and return the result as Python string
661 object. Error handling is "strict". Return *NULL* if an exception was raised
662 by the codec.
663
664These are the mapping codec APIs:
665
666.. % --- Character Map Codecs -----------------------------------------------
667
668This codec is special in that it can be used to implement many different codecs
669(and this is in fact what was done to obtain most of the standard codecs
670included in the :mod:`encodings` package). The codec uses mapping to encode and
671decode characters.
672
673Decoding mappings must map single string characters to single Unicode
674characters, integers (which are then interpreted as Unicode ordinals) or None
675(meaning "undefined mapping" and causing an error).
676
677Encoding mappings must map single Unicode characters to single string
678characters, integers (which are then interpreted as Latin-1 ordinals) or None
679(meaning "undefined mapping" and causing an error).
680
681The mapping objects provided must only support the __getitem__ mapping
682interface.
683
684If a character lookup fails with a LookupError, the character is copied as-is
685meaning that its ordinal value will be interpreted as Unicode or Latin-1 ordinal
686resp. Because of this, mappings only need to contain those mappings which map
687characters to different code points.
688
689
690.. cfunction:: PyObject* PyUnicode_DecodeCharmap(const char *s, Py_ssize_t size, PyObject *mapping, const char *errors)
691
692 Create a Unicode object by decoding *size* bytes of the encoded string *s* using
693 the given *mapping* object. Return *NULL* if an exception was raised by the
694 codec. If *mapping* is *NULL* latin-1 decoding will be done. Else it can be a
695 dictionary mapping byte or a unicode string, which is treated as a lookup table.
696 Byte values greater that the length of the string and U+FFFE "characters" are
697 treated as "undefined mapping".
698
699
700.. cfunction:: PyObject* PyUnicode_EncodeCharmap(const Py_UNICODE *s, Py_ssize_t size, PyObject *mapping, const char *errors)
701
702 Encode the :ctype:`Py_UNICODE` buffer of the given size using the given
703 *mapping* object and return a Python string object. Return *NULL* if an
704 exception was raised by the codec.
705
706
707.. cfunction:: PyObject* PyUnicode_AsCharmapString(PyObject *unicode, PyObject *mapping)
708
709 Encode a Unicode object using the given *mapping* object and return the result
710 as Python string object. Error handling is "strict". Return *NULL* if an
711 exception was raised by the codec.
712
713The following codec API is special in that maps Unicode to Unicode.
714
715
716.. cfunction:: PyObject* PyUnicode_TranslateCharmap(const Py_UNICODE *s, Py_ssize_t size, PyObject *table, const char *errors)
717
718 Translate a :ctype:`Py_UNICODE` buffer of the given length by applying a
719 character mapping *table* to it and return the resulting Unicode object. Return
720 *NULL* when an exception was raised by the codec.
721
722 The *mapping* table must map Unicode ordinal integers to Unicode ordinal
723 integers or None (causing deletion of the character).
724
725 Mapping tables need only provide the :meth:`__getitem__` interface; dictionaries
726 and sequences work well. Unmapped character ordinals (ones which cause a
727 :exc:`LookupError`) are left untouched and are copied as-is.
728
729These are the MBCS codec APIs. They are currently only available on Windows and
730use the Win32 MBCS converters to implement the conversions. Note that MBCS (or
731DBCS) is a class of encodings, not just one. The target encoding is defined by
732the user settings on the machine running the codec.
733
734.. % --- MBCS codecs for Windows --------------------------------------------
735
736
737.. cfunction:: PyObject* PyUnicode_DecodeMBCS(const char *s, Py_ssize_t size, const char *errors)
738
739 Create a Unicode object by decoding *size* bytes of the MBCS encoded string *s*.
740 Return *NULL* if an exception was raised by the codec.
741
742
743.. cfunction:: PyObject* PyUnicode_DecodeMBCSStateful(const char *s, int size, const char *errors, int *consumed)
744
745 If *consumed* is *NULL*, behave like :cfunc:`PyUnicode_DecodeMBCS`. If
746 *consumed* is not *NULL*, :cfunc:`PyUnicode_DecodeMBCSStateful` will not decode
747 trailing lead byte and the number of bytes that have been decoded will be stored
748 in *consumed*.
749
750
751.. cfunction:: PyObject* PyUnicode_EncodeMBCS(const Py_UNICODE *s, Py_ssize_t size, const char *errors)
752
753 Encode the :ctype:`Py_UNICODE` buffer of the given size using MBCS and return a
754 Python string object. Return *NULL* if an exception was raised by the codec.
755
756
757.. cfunction:: PyObject* PyUnicode_AsMBCSString(PyObject *unicode)
758
759 Encode a Unicode object using MBCS and return the result as Python string
760 object. Error handling is "strict". Return *NULL* if an exception was raised
761 by the codec.
762
763.. % --- Methods & Slots ----------------------------------------------------
764
765
766.. _unicodemethodsandslots:
767
768Methods and Slot Functions
769^^^^^^^^^^^^^^^^^^^^^^^^^^
770
771The following APIs are capable of handling Unicode objects and strings on input
772(we refer to them as strings in the descriptions) and return Unicode objects or
773integers as appropriate.
774
775They all return *NULL* or ``-1`` if an exception occurs.
776
777
778.. cfunction:: PyObject* PyUnicode_Concat(PyObject *left, PyObject *right)
779
780 Concat two strings giving a new Unicode string.
781
782
783.. cfunction:: PyObject* PyUnicode_Split(PyObject *s, PyObject *sep, Py_ssize_t maxsplit)
784
785 Split a string giving a list of Unicode strings. If sep is *NULL*, splitting
786 will be done at all whitespace substrings. Otherwise, splits occur at the given
787 separator. At most *maxsplit* splits will be done. If negative, no limit is
788 set. Separators are not included in the resulting list.
789
790
791.. cfunction:: PyObject* PyUnicode_Splitlines(PyObject *s, int keepend)
792
793 Split a Unicode string at line breaks, returning a list of Unicode strings.
794 CRLF is considered to be one line break. If *keepend* is 0, the Line break
795 characters are not included in the resulting strings.
796
797
798.. cfunction:: PyObject* PyUnicode_Translate(PyObject *str, PyObject *table, const char *errors)
799
800 Translate a string by applying a character mapping table to it and return the
801 resulting Unicode object.
802
803 The mapping table must map Unicode ordinal integers to Unicode ordinal integers
804 or None (causing deletion of the character).
805
806 Mapping tables need only provide the :meth:`__getitem__` interface; dictionaries
807 and sequences work well. Unmapped character ordinals (ones which cause a
808 :exc:`LookupError`) are left untouched and are copied as-is.
809
810 *errors* has the usual meaning for codecs. It may be *NULL* which indicates to
811 use the default error handling.
812
813
814.. cfunction:: PyObject* PyUnicode_Join(PyObject *separator, PyObject *seq)
815
816 Join a sequence of strings using the given separator and return the resulting
817 Unicode string.
818
819
820.. cfunction:: int PyUnicode_Tailmatch(PyObject *str, PyObject *substr, Py_ssize_t start, Py_ssize_t end, int direction)
821
822 Return 1 if *substr* matches *str*[*start*:*end*] at the given tail end
823 (*direction* == -1 means to do a prefix match, *direction* == 1 a suffix match),
824 0 otherwise. Return ``-1`` if an error occurred.
825
826
827.. cfunction:: Py_ssize_t PyUnicode_Find(PyObject *str, PyObject *substr, Py_ssize_t start, Py_ssize_t end, int direction)
828
829 Return the first position of *substr* in *str*[*start*:*end*] using the given
830 *direction* (*direction* == 1 means to do a forward search, *direction* == -1 a
831 backward search). The return value is the index of the first match; a value of
832 ``-1`` indicates that no match was found, and ``-2`` indicates that an error
833 occurred and an exception has been set.
834
835
836.. cfunction:: Py_ssize_t PyUnicode_Count(PyObject *str, PyObject *substr, Py_ssize_t start, Py_ssize_t end)
837
838 Return the number of non-overlapping occurrences of *substr* in
839 ``str[start:end]``. Return ``-1`` if an error occurred.
840
841
842.. cfunction:: PyObject* PyUnicode_Replace(PyObject *str, PyObject *substr, PyObject *replstr, Py_ssize_t maxcount)
843
844 Replace at most *maxcount* occurrences of *substr* in *str* with *replstr* and
845 return the resulting Unicode object. *maxcount* == -1 means replace all
846 occurrences.
847
848
849.. cfunction:: int PyUnicode_Compare(PyObject *left, PyObject *right)
850
851 Compare two strings and return -1, 0, 1 for less than, equal, and greater than,
852 respectively.
853
854
855.. cfunction:: int PyUnicode_RichCompare(PyObject *left, PyObject *right, int op)
856
857 Rich compare two unicode strings and return one of the following:
858
859 * ``NULL`` in case an exception was raised
860 * :const:`Py_True` or :const:`Py_False` for successful comparisons
861 * :const:`Py_NotImplemented` in case the type combination is unknown
862
863 Note that :const:`Py_EQ` and :const:`Py_NE` comparisons can cause a
864 :exc:`UnicodeWarning` in case the conversion of the arguments to Unicode fails
865 with a :exc:`UnicodeDecodeError`.
866
867 Possible values for *op* are :const:`Py_GT`, :const:`Py_GE`, :const:`Py_EQ`,
868 :const:`Py_NE`, :const:`Py_LT`, and :const:`Py_LE`.
869
870
871.. cfunction:: PyObject* PyUnicode_Format(PyObject *format, PyObject *args)
872
873 Return a new string object from *format* and *args*; this is analogous to
874 ``format % args``. The *args* argument must be a tuple.
875
876
877.. cfunction:: int PyUnicode_Contains(PyObject *container, PyObject *element)
878
879 Check whether *element* is contained in *container* and return true or false
880 accordingly.
881
882 *element* has to coerce to a one element Unicode string. ``-1`` is returned if
883 there was an error.
884
885
886.. cfunction:: void PyUnicode_InternInPlace(PyObject **string)
887
888 Intern the argument *\*string* in place. The argument must be the address of a
889 pointer variable pointing to a Python unicode string object. If there is an
890 existing interned string that is the same as *\*string*, it sets *\*string* to
891 it (decrementing the reference count of the old string object and incrementing
892 the reference count of the interned string object), otherwise it leaves
893 *\*string* alone and interns it (incrementing its reference count).
894 (Clarification: even though there is a lot of talk about reference counts, think
895 of this function as reference-count-neutral; you own the object after the call
896 if and only if you owned it before the call.)
897
898
899.. cfunction:: PyObject* PyUnicode_InternFromString(const char *v)
900
901 A combination of :cfunc:`PyUnicode_FromString` and
902 :cfunc:`PyUnicode_InternInPlace`, returning either a new unicode string object
903 that has been interned, or a new ("owned") reference to an earlier interned
904 string object with the same value.
905