blob: e2745910faf02ecb105e0b8bdc33e00db8ef55b3 [file] [log] [blame]
Georg Brandl54a3faa2008-01-20 09:30:57 +00001.. highlightlang:: c
2
3.. _unicodeobjects:
4
5Unicode Objects and Codecs
6--------------------------
7
8.. sectionauthor:: Marc-Andre Lemburg <mal@lemburg.com>
9
10Unicode Objects
11^^^^^^^^^^^^^^^
12
13These are the basic Unicode object types used for the Unicode implementation in
14Python:
15
16.. % --- Unicode Type -------------------------------------------------------
17
18
19.. ctype:: Py_UNICODE
20
21 This type represents the storage type which is used by Python internally as
22 basis for holding Unicode ordinals. Python's default builds use a 16-bit type
23 for :ctype:`Py_UNICODE` and store Unicode values internally as UCS2. It is also
24 possible to build a UCS4 version of Python (most recent Linux distributions come
25 with UCS4 builds of Python). These builds then use a 32-bit type for
26 :ctype:`Py_UNICODE` and store Unicode data internally as UCS4. On platforms
27 where :ctype:`wchar_t` is available and compatible with the chosen Python
28 Unicode build variant, :ctype:`Py_UNICODE` is a typedef alias for
29 :ctype:`wchar_t` to enhance native platform compatibility. On all other
30 platforms, :ctype:`Py_UNICODE` is a typedef alias for either :ctype:`unsigned
31 short` (UCS2) or :ctype:`unsigned long` (UCS4).
32
33Note that UCS2 and UCS4 Python builds are not binary compatible. Please keep
34this in mind when writing extensions or interfaces.
35
36
37.. ctype:: PyUnicodeObject
38
39 This subtype of :ctype:`PyObject` represents a Python Unicode object.
40
41
42.. cvar:: PyTypeObject PyUnicode_Type
43
44 This instance of :ctype:`PyTypeObject` represents the Python Unicode type. It
45 is exposed to Python code as ``str``.
46
47The following APIs are really C macros and can be used to do fast checks and to
48access internal read-only data of Unicode objects:
49
50
51.. cfunction:: int PyUnicode_Check(PyObject *o)
52
53 Return true if the object *o* is a Unicode object or an instance of a Unicode
54 subtype.
55
56
57.. cfunction:: int PyUnicode_CheckExact(PyObject *o)
58
59 Return true if the object *o* is a Unicode object, but not an instance of a
60 subtype.
61
62
63.. cfunction:: Py_ssize_t PyUnicode_GET_SIZE(PyObject *o)
64
65 Return the size of the object. *o* has to be a :ctype:`PyUnicodeObject` (not
66 checked).
67
68
69.. cfunction:: Py_ssize_t PyUnicode_GET_DATA_SIZE(PyObject *o)
70
71 Return the size of the object's internal buffer in bytes. *o* has to be a
72 :ctype:`PyUnicodeObject` (not checked).
73
74
75.. cfunction:: Py_UNICODE* PyUnicode_AS_UNICODE(PyObject *o)
76
77 Return a pointer to the internal :ctype:`Py_UNICODE` buffer of the object. *o*
78 has to be a :ctype:`PyUnicodeObject` (not checked).
79
80
81.. cfunction:: const char* PyUnicode_AS_DATA(PyObject *o)
82
83 Return a pointer to the internal buffer of the object. *o* has to be a
84 :ctype:`PyUnicodeObject` (not checked).
85
Christian Heimesa156e092008-02-16 07:38:31 +000086
87.. cfunction:: int PyUnicode_ClearFreeList(void)
88
89 Clear the free list. Return the total number of freed items.
90
Georg Brandl54a3faa2008-01-20 09:30:57 +000091Unicode provides many different character properties. The most often needed ones
92are available through these macros which are mapped to C functions depending on
93the Python configuration.
94
95.. % --- Unicode character properties ---------------------------------------
96
97
98.. cfunction:: int Py_UNICODE_ISSPACE(Py_UNICODE ch)
99
100 Return 1 or 0 depending on whether *ch* is a whitespace character.
101
102
103.. cfunction:: int Py_UNICODE_ISLOWER(Py_UNICODE ch)
104
105 Return 1 or 0 depending on whether *ch* is a lowercase character.
106
107
108.. cfunction:: int Py_UNICODE_ISUPPER(Py_UNICODE ch)
109
110 Return 1 or 0 depending on whether *ch* is an uppercase character.
111
112
113.. cfunction:: int Py_UNICODE_ISTITLE(Py_UNICODE ch)
114
115 Return 1 or 0 depending on whether *ch* is a titlecase character.
116
117
118.. cfunction:: int Py_UNICODE_ISLINEBREAK(Py_UNICODE ch)
119
120 Return 1 or 0 depending on whether *ch* is a linebreak character.
121
122
123.. cfunction:: int Py_UNICODE_ISDECIMAL(Py_UNICODE ch)
124
125 Return 1 or 0 depending on whether *ch* is a decimal character.
126
127
128.. cfunction:: int Py_UNICODE_ISDIGIT(Py_UNICODE ch)
129
130 Return 1 or 0 depending on whether *ch* is a digit character.
131
132
133.. cfunction:: int Py_UNICODE_ISNUMERIC(Py_UNICODE ch)
134
135 Return 1 or 0 depending on whether *ch* is a numeric character.
136
137
138.. cfunction:: int Py_UNICODE_ISALPHA(Py_UNICODE ch)
139
140 Return 1 or 0 depending on whether *ch* is an alphabetic character.
141
142
143.. cfunction:: int Py_UNICODE_ISALNUM(Py_UNICODE ch)
144
145 Return 1 or 0 depending on whether *ch* is an alphanumeric character.
146
Georg Brandl559e5d72008-06-11 18:37:52 +0000147
148.. cfunction:: int Py_UNICODE_ISPRINTABLE(Py_UNICODE ch)
149
150 Return 1 or 0 depending on whether *ch* is a printable character.
151 Nonprintable characters are those characters defined in the Unicode character
152 database as "Other" or "Separator", excepting the ASCII space (0x20) which is
153 considered printable. (Note that printable characters in this context are
154 those which should not be escaped when :func:`repr` is invoked on a string.
155 It has no bearing on the handling of strings written to :data:`sys.stdout` or
156 :data:`sys.stderr`.)
157
158
Georg Brandl54a3faa2008-01-20 09:30:57 +0000159These APIs can be used for fast direct character conversions:
160
161
162.. cfunction:: Py_UNICODE Py_UNICODE_TOLOWER(Py_UNICODE ch)
163
164 Return the character *ch* converted to lower case.
165
166
167.. cfunction:: Py_UNICODE Py_UNICODE_TOUPPER(Py_UNICODE ch)
168
169 Return the character *ch* converted to upper case.
170
171
172.. cfunction:: Py_UNICODE Py_UNICODE_TOTITLE(Py_UNICODE ch)
173
174 Return the character *ch* converted to title case.
175
176
177.. cfunction:: int Py_UNICODE_TODECIMAL(Py_UNICODE ch)
178
179 Return the character *ch* converted to a decimal positive integer. Return
180 ``-1`` if this is not possible. This macro does not raise exceptions.
181
182
183.. cfunction:: int Py_UNICODE_TODIGIT(Py_UNICODE ch)
184
185 Return the character *ch* converted to a single digit integer. Return ``-1`` if
186 this is not possible. This macro does not raise exceptions.
187
188
189.. cfunction:: double Py_UNICODE_TONUMERIC(Py_UNICODE ch)
190
191 Return the character *ch* converted to a double. Return ``-1.0`` if this is not
192 possible. This macro does not raise exceptions.
193
194To create Unicode objects and access their basic sequence properties, use these
195APIs:
196
197.. % --- Plain Py_UNICODE ---------------------------------------------------
198
199
200.. cfunction:: PyObject* PyUnicode_FromUnicode(const Py_UNICODE *u, Py_ssize_t size)
201
202 Create a Unicode Object from the Py_UNICODE buffer *u* of the given size. *u*
203 may be *NULL* which causes the contents to be undefined. It is the user's
204 responsibility to fill in the needed data. The buffer is copied into the new
205 object. If the buffer is not *NULL*, the return value might be a shared object.
206 Therefore, modification of the resulting Unicode object is only allowed when *u*
207 is *NULL*.
208
209
210.. cfunction:: PyObject* PyUnicode_FromStringAndSize(const char *u, Py_ssize_t size)
211
212 Create a Unicode Object from the char buffer *u*. The bytes will be interpreted
213 as being UTF-8 encoded. *u* may also be *NULL* which
214 causes the contents to be undefined. It is the user's responsibility to fill in
215 the needed data. The buffer is copied into the new object. If the buffer is not
216 *NULL*, the return value might be a shared object. Therefore, modification of
217 the resulting Unicode object is only allowed when *u* is *NULL*.
218
219
220.. cfunction:: PyObject *PyUnicode_FromString(const char *u)
221
222 Create a Unicode object from an UTF-8 encoded null-terminated char buffer
223 *u*.
224
225
226.. cfunction:: PyObject* PyUnicode_FromFormat(const char *format, ...)
227
228 Take a C :cfunc:`printf`\ -style *format* string and a variable number of
229 arguments, calculate the size of the resulting Python unicode string and return
230 a string with the values formatted into it. The variable arguments must be C
231 types and must correspond exactly to the format characters in the *format*
232 string. The following format characters are allowed:
233
234 .. % The descriptions for %zd and %zu are wrong, but the truth is complicated
235 .. % because not all compilers support the %z width modifier -- we fake it
236 .. % when necessary via interpolating PY_FORMAT_SIZE_T.
237
238 +-------------------+---------------------+--------------------------------+
239 | Format Characters | Type | Comment |
240 +===================+=====================+================================+
241 | :attr:`%%` | *n/a* | The literal % character. |
242 +-------------------+---------------------+--------------------------------+
243 | :attr:`%c` | int | A single character, |
244 | | | represented as an C int. |
245 +-------------------+---------------------+--------------------------------+
246 | :attr:`%d` | int | Exactly equivalent to |
247 | | | ``printf("%d")``. |
248 +-------------------+---------------------+--------------------------------+
249 | :attr:`%u` | unsigned int | Exactly equivalent to |
250 | | | ``printf("%u")``. |
251 +-------------------+---------------------+--------------------------------+
252 | :attr:`%ld` | long | Exactly equivalent to |
253 | | | ``printf("%ld")``. |
254 +-------------------+---------------------+--------------------------------+
255 | :attr:`%lu` | unsigned long | Exactly equivalent to |
256 | | | ``printf("%lu")``. |
257 +-------------------+---------------------+--------------------------------+
258 | :attr:`%zd` | Py_ssize_t | Exactly equivalent to |
259 | | | ``printf("%zd")``. |
260 +-------------------+---------------------+--------------------------------+
261 | :attr:`%zu` | size_t | Exactly equivalent to |
262 | | | ``printf("%zu")``. |
263 +-------------------+---------------------+--------------------------------+
264 | :attr:`%i` | int | Exactly equivalent to |
265 | | | ``printf("%i")``. |
266 +-------------------+---------------------+--------------------------------+
267 | :attr:`%x` | int | Exactly equivalent to |
268 | | | ``printf("%x")``. |
269 +-------------------+---------------------+--------------------------------+
270 | :attr:`%s` | char\* | A null-terminated C character |
271 | | | array. |
272 +-------------------+---------------------+--------------------------------+
273 | :attr:`%p` | void\* | The hex representation of a C |
274 | | | pointer. Mostly equivalent to |
275 | | | ``printf("%p")`` except that |
276 | | | it is guaranteed to start with |
277 | | | the literal ``0x`` regardless |
278 | | | of what the platform's |
279 | | | ``printf`` yields. |
280 +-------------------+---------------------+--------------------------------+
Georg Brandl559e5d72008-06-11 18:37:52 +0000281 | :attr:`%A` | PyObject\* | The result of calling |
282 | | | :func:`ascii`. |
283 +-------------------+---------------------+--------------------------------+
Georg Brandl54a3faa2008-01-20 09:30:57 +0000284 | :attr:`%U` | PyObject\* | A unicode object. |
285 +-------------------+---------------------+--------------------------------+
286 | :attr:`%V` | PyObject\*, char \* | A unicode object (which may be |
287 | | | *NULL*) and a null-terminated |
288 | | | C character array as a second |
289 | | | parameter (which will be used, |
290 | | | if the first parameter is |
291 | | | *NULL*). |
292 +-------------------+---------------------+--------------------------------+
293 | :attr:`%S` | PyObject\* | The result of calling |
294 | | | :func:`PyObject_Unicode`. |
295 +-------------------+---------------------+--------------------------------+
296 | :attr:`%R` | PyObject\* | The result of calling |
297 | | | :func:`PyObject_Repr`. |
298 +-------------------+---------------------+--------------------------------+
299
300 An unrecognized format character causes all the rest of the format string to be
301 copied as-is to the result string, and any extra arguments discarded.
302
303
304.. cfunction:: PyObject* PyUnicode_FromFormatV(const char *format, va_list vargs)
305
306 Identical to :func:`PyUnicode_FromFormat` except that it takes exactly two
307 arguments.
308
309
310.. cfunction:: Py_UNICODE* PyUnicode_AsUnicode(PyObject *unicode)
311
312 Return a read-only pointer to the Unicode object's internal :ctype:`Py_UNICODE`
313 buffer, *NULL* if *unicode* is not a Unicode object.
314
315
316.. cfunction:: Py_ssize_t PyUnicode_GetSize(PyObject *unicode)
317
318 Return the length of the Unicode object.
319
320
321.. cfunction:: PyObject* PyUnicode_FromEncodedObject(PyObject *obj, const char *encoding, const char *errors)
322
323 Coerce an encoded object *obj* to an Unicode object and return a reference with
324 incremented refcount.
325
326 String and other char buffer compatible objects are decoded according to the
327 given encoding and using the error handling defined by errors. Both can be
328 *NULL* to have the interface use the default values (see the next section for
329 details).
330
331 All other objects, including Unicode objects, cause a :exc:`TypeError` to be
332 set.
333
334 The API returns *NULL* if there was an error. The caller is responsible for
335 decref'ing the returned objects.
336
337
338.. cfunction:: PyObject* PyUnicode_FromObject(PyObject *obj)
339
340 Shortcut for ``PyUnicode_FromEncodedObject(obj, NULL, "strict")`` which is used
341 throughout the interpreter whenever coercion to Unicode is needed.
342
343If the platform supports :ctype:`wchar_t` and provides a header file wchar.h,
344Python can interface directly to this type using the following functions.
345Support is optimized if Python's own :ctype:`Py_UNICODE` type is identical to
346the system's :ctype:`wchar_t`.
347
348.. % --- wchar_t support for platforms which support it ---------------------
349
350
351.. cfunction:: PyObject* PyUnicode_FromWideChar(const wchar_t *w, Py_ssize_t size)
352
353 Create a Unicode object from the :ctype:`wchar_t` buffer *w* of the given size.
Martin v. Löwis790465f2008-04-05 20:41:37 +0000354 Passing -1 as the size indicates that the function must itself compute the length,
355 using wcslen.
Georg Brandl54a3faa2008-01-20 09:30:57 +0000356 Return *NULL* on failure.
357
358
359.. cfunction:: Py_ssize_t PyUnicode_AsWideChar(PyUnicodeObject *unicode, wchar_t *w, Py_ssize_t size)
360
361 Copy the Unicode object contents into the :ctype:`wchar_t` buffer *w*. At most
362 *size* :ctype:`wchar_t` characters are copied (excluding a possibly trailing
363 0-termination character). Return the number of :ctype:`wchar_t` characters
364 copied or -1 in case of an error. Note that the resulting :ctype:`wchar_t`
365 string may or may not be 0-terminated. It is the responsibility of the caller
366 to make sure that the :ctype:`wchar_t` string is 0-terminated in case this is
367 required by the application.
368
369
370.. _builtincodecs:
371
372Built-in Codecs
373^^^^^^^^^^^^^^^
374
375Python provides a set of builtin codecs which are written in C for speed. All of
376these codecs are directly usable via the following functions.
377
378Many of the following APIs take two arguments encoding and errors. These
379parameters encoding and errors have the same semantics as the ones of the
380builtin unicode() Unicode object constructor.
381
382Setting encoding to *NULL* causes the default encoding to be used which is
383ASCII. The file system calls should use :cdata:`Py_FileSystemDefaultEncoding`
384as the encoding for file names. This variable should be treated as read-only: On
385some systems, it will be a pointer to a static string, on others, it will change
386at run-time (such as when the application invokes setlocale).
387
388Error handling is set by errors which may also be set to *NULL* meaning to use
389the default handling defined for the codec. Default error handling for all
390builtin codecs is "strict" (:exc:`ValueError` is raised).
391
392The codecs all use a similar interface. Only deviation from the following
393generic ones are documented for simplicity.
394
395These are the generic codec APIs:
396
397.. % --- Generic Codecs -----------------------------------------------------
398
399
400.. cfunction:: PyObject* PyUnicode_Decode(const char *s, Py_ssize_t size, const char *encoding, const char *errors)
401
402 Create a Unicode object by decoding *size* bytes of the encoded string *s*.
403 *encoding* and *errors* have the same meaning as the parameters of the same name
404 in the :func:`unicode` builtin function. The codec to be used is looked up
405 using the Python codec registry. Return *NULL* if an exception was raised by
406 the codec.
407
408
409.. cfunction:: PyObject* PyUnicode_Encode(const Py_UNICODE *s, Py_ssize_t size, const char *encoding, const char *errors)
410
411 Encode the :ctype:`Py_UNICODE` buffer of the given size and return a Python
412 string object. *encoding* and *errors* have the same meaning as the parameters
413 of the same name in the Unicode :meth:`encode` method. The codec to be used is
414 looked up using the Python codec registry. Return *NULL* if an exception was
415 raised by the codec.
416
417
418.. cfunction:: PyObject* PyUnicode_AsEncodedString(PyObject *unicode, const char *encoding, const char *errors)
419
420 Encode a Unicode object and return the result as Python string object.
421 *encoding* and *errors* have the same meaning as the parameters of the same name
422 in the Unicode :meth:`encode` method. The codec to be used is looked up using
423 the Python codec registry. Return *NULL* if an exception was raised by the
424 codec.
425
426These are the UTF-8 codec APIs:
427
428.. % --- UTF-8 Codecs -------------------------------------------------------
429
430
431.. cfunction:: PyObject* PyUnicode_DecodeUTF8(const char *s, Py_ssize_t size, const char *errors)
432
433 Create a Unicode object by decoding *size* bytes of the UTF-8 encoded string
434 *s*. Return *NULL* if an exception was raised by the codec.
435
436
437.. cfunction:: PyObject* PyUnicode_DecodeUTF8Stateful(const char *s, Py_ssize_t size, const char *errors, Py_ssize_t *consumed)
438
439 If *consumed* is *NULL*, behave like :cfunc:`PyUnicode_DecodeUTF8`. If
440 *consumed* is not *NULL*, trailing incomplete UTF-8 byte sequences will not be
441 treated as an error. Those bytes will not be decoded and the number of bytes
442 that have been decoded will be stored in *consumed*.
443
444
445.. cfunction:: PyObject* PyUnicode_EncodeUTF8(const Py_UNICODE *s, Py_ssize_t size, const char *errors)
446
447 Encode the :ctype:`Py_UNICODE` buffer of the given size using UTF-8 and return a
448 Python string object. Return *NULL* if an exception was raised by the codec.
449
450
451.. cfunction:: PyObject* PyUnicode_AsUTF8String(PyObject *unicode)
452
453 Encode a Unicode object using UTF-8 and return the result as Python string
454 object. Error handling is "strict". Return *NULL* if an exception was raised
455 by the codec.
456
457These are the UTF-32 codec APIs:
458
459.. % --- UTF-32 Codecs ------------------------------------------------------ */
460
461
462.. cfunction:: PyObject* PyUnicode_DecodeUTF32(const char *s, Py_ssize_t size, const char *errors, int *byteorder)
463
464 Decode *length* bytes from a UTF-32 encoded buffer string and return the
465 corresponding Unicode object. *errors* (if non-*NULL*) defines the error
466 handling. It defaults to "strict".
467
468 If *byteorder* is non-*NULL*, the decoder starts decoding using the given byte
469 order::
470
471 *byteorder == -1: little endian
472 *byteorder == 0: native order
473 *byteorder == 1: big endian
474
475 and then switches if the first four bytes of the input data are a byte order mark
476 (BOM) and the specified byte order is native order. This BOM is not copied into
477 the resulting Unicode string. After completion, *\*byteorder* is set to the
478 current byte order at the end of input data.
479
480 In a narrow build codepoints outside the BMP will be decoded as surrogate pairs.
481
482 If *byteorder* is *NULL*, the codec starts in native order mode.
483
484 Return *NULL* if an exception was raised by the codec.
485
486
487.. cfunction:: PyObject* PyUnicode_DecodeUTF32Stateful(const char *s, Py_ssize_t size, const char *errors, int *byteorder, Py_ssize_t *consumed)
488
489 If *consumed* is *NULL*, behave like :cfunc:`PyUnicode_DecodeUTF32`. If
490 *consumed* is not *NULL*, :cfunc:`PyUnicode_DecodeUTF32Stateful` will not treat
491 trailing incomplete UTF-32 byte sequences (such as a number of bytes not divisible
492 by four) as an error. Those bytes will not be decoded and the number of bytes
493 that have been decoded will be stored in *consumed*.
494
495
496.. cfunction:: PyObject* PyUnicode_EncodeUTF32(const Py_UNICODE *s, Py_ssize_t size, const char *errors, int byteorder)
497
498 Return a Python bytes object holding the UTF-32 encoded value of the Unicode
499 data in *s*. If *byteorder* is not ``0``, output is written according to the
500 following byte order::
501
502 byteorder == -1: little endian
503 byteorder == 0: native byte order (writes a BOM mark)
504 byteorder == 1: big endian
505
506 If byteorder is ``0``, the output string will always start with the Unicode BOM
507 mark (U+FEFF). In the other two modes, no BOM mark is prepended.
508
509 If *Py_UNICODE_WIDE* is not defined, surrogate pairs will be output
510 as a single codepoint.
511
512 Return *NULL* if an exception was raised by the codec.
513
514
515.. cfunction:: PyObject* PyUnicode_AsUTF32String(PyObject *unicode)
516
517 Return a Python string using the UTF-32 encoding in native byte order. The
518 string always starts with a BOM mark. Error handling is "strict". Return
519 *NULL* if an exception was raised by the codec.
520
521
522These are the UTF-16 codec APIs:
523
524.. % --- UTF-16 Codecs ------------------------------------------------------ */
525
526
527.. cfunction:: PyObject* PyUnicode_DecodeUTF16(const char *s, Py_ssize_t size, const char *errors, int *byteorder)
528
529 Decode *length* bytes from a UTF-16 encoded buffer string and return the
530 corresponding Unicode object. *errors* (if non-*NULL*) defines the error
531 handling. It defaults to "strict".
532
533 If *byteorder* is non-*NULL*, the decoder starts decoding using the given byte
534 order::
535
536 *byteorder == -1: little endian
537 *byteorder == 0: native order
538 *byteorder == 1: big endian
539
540 and then switches if the first two bytes of the input data are a byte order mark
541 (BOM) and the specified byte order is native order. This BOM is not copied into
542 the resulting Unicode string. After completion, *\*byteorder* is set to the
543 current byte order at the end of input data.
544
545 If *byteorder* is *NULL*, the codec starts in native order mode.
546
547 Return *NULL* if an exception was raised by the codec.
548
549
550.. cfunction:: PyObject* PyUnicode_DecodeUTF16Stateful(const char *s, Py_ssize_t size, const char *errors, int *byteorder, Py_ssize_t *consumed)
551
552 If *consumed* is *NULL*, behave like :cfunc:`PyUnicode_DecodeUTF16`. If
553 *consumed* is not *NULL*, :cfunc:`PyUnicode_DecodeUTF16Stateful` will not treat
554 trailing incomplete UTF-16 byte sequences (such as an odd number of bytes or a
555 split surrogate pair) as an error. Those bytes will not be decoded and the
556 number of bytes that have been decoded will be stored in *consumed*.
557
558
559.. cfunction:: PyObject* PyUnicode_EncodeUTF16(const Py_UNICODE *s, Py_ssize_t size, const char *errors, int byteorder)
560
561 Return a Python string object holding the UTF-16 encoded value of the Unicode
562 data in *s*. If *byteorder* is not ``0``, output is written according to the
563 following byte order::
564
565 byteorder == -1: little endian
566 byteorder == 0: native byte order (writes a BOM mark)
567 byteorder == 1: big endian
568
569 If byteorder is ``0``, the output string will always start with the Unicode BOM
570 mark (U+FEFF). In the other two modes, no BOM mark is prepended.
571
572 If *Py_UNICODE_WIDE* is defined, a single :ctype:`Py_UNICODE` value may get
573 represented as a surrogate pair. If it is not defined, each :ctype:`Py_UNICODE`
574 values is interpreted as an UCS-2 character.
575
576 Return *NULL* if an exception was raised by the codec.
577
578
579.. cfunction:: PyObject* PyUnicode_AsUTF16String(PyObject *unicode)
580
581 Return a Python string using the UTF-16 encoding in native byte order. The
582 string always starts with a BOM mark. Error handling is "strict". Return
583 *NULL* if an exception was raised by the codec.
584
585These are the "Unicode Escape" codec APIs:
586
587.. % --- Unicode-Escape Codecs ----------------------------------------------
588
589
590.. cfunction:: PyObject* PyUnicode_DecodeUnicodeEscape(const char *s, Py_ssize_t size, const char *errors)
591
592 Create a Unicode object by decoding *size* bytes of the Unicode-Escape encoded
593 string *s*. Return *NULL* if an exception was raised by the codec.
594
595
596.. cfunction:: PyObject* PyUnicode_EncodeUnicodeEscape(const Py_UNICODE *s, Py_ssize_t size)
597
598 Encode the :ctype:`Py_UNICODE` buffer of the given size using Unicode-Escape and
599 return a Python string object. Return *NULL* if an exception was raised by the
600 codec.
601
602
603.. cfunction:: PyObject* PyUnicode_AsUnicodeEscapeString(PyObject *unicode)
604
605 Encode a Unicode object using Unicode-Escape and return the result as Python
606 string object. Error handling is "strict". Return *NULL* if an exception was
607 raised by the codec.
608
609These are the "Raw Unicode Escape" codec APIs:
610
611.. % --- Raw-Unicode-Escape Codecs ------------------------------------------
612
613
614.. cfunction:: PyObject* PyUnicode_DecodeRawUnicodeEscape(const char *s, Py_ssize_t size, const char *errors)
615
616 Create a Unicode object by decoding *size* bytes of the Raw-Unicode-Escape
617 encoded string *s*. Return *NULL* if an exception was raised by the codec.
618
619
620.. cfunction:: PyObject* PyUnicode_EncodeRawUnicodeEscape(const Py_UNICODE *s, Py_ssize_t size, const char *errors)
621
622 Encode the :ctype:`Py_UNICODE` buffer of the given size using Raw-Unicode-Escape
623 and return a Python string object. Return *NULL* if an exception was raised by
624 the codec.
625
626
627.. cfunction:: PyObject* PyUnicode_AsRawUnicodeEscapeString(PyObject *unicode)
628
629 Encode a Unicode object using Raw-Unicode-Escape and return the result as
630 Python string object. Error handling is "strict". Return *NULL* if an exception
631 was raised by the codec.
632
633These are the Latin-1 codec APIs: Latin-1 corresponds to the first 256 Unicode
634ordinals and only these are accepted by the codecs during encoding.
635
636.. % --- Latin-1 Codecs -----------------------------------------------------
637
638
639.. cfunction:: PyObject* PyUnicode_DecodeLatin1(const char *s, Py_ssize_t size, const char *errors)
640
641 Create a Unicode object by decoding *size* bytes of the Latin-1 encoded string
642 *s*. Return *NULL* if an exception was raised by the codec.
643
644
645.. cfunction:: PyObject* PyUnicode_EncodeLatin1(const Py_UNICODE *s, Py_ssize_t size, const char *errors)
646
647 Encode the :ctype:`Py_UNICODE` buffer of the given size using Latin-1 and return
648 a Python string object. Return *NULL* if an exception was raised by the codec.
649
650
651.. cfunction:: PyObject* PyUnicode_AsLatin1String(PyObject *unicode)
652
653 Encode a Unicode object using Latin-1 and return the result as Python string
654 object. Error handling is "strict". Return *NULL* if an exception was raised
655 by the codec.
656
657These are the ASCII codec APIs. Only 7-bit ASCII data is accepted. All other
658codes generate errors.
659
660.. % --- ASCII Codecs -------------------------------------------------------
661
662
663.. cfunction:: PyObject* PyUnicode_DecodeASCII(const char *s, Py_ssize_t size, const char *errors)
664
665 Create a Unicode object by decoding *size* bytes of the ASCII encoded string
666 *s*. Return *NULL* if an exception was raised by the codec.
667
668
669.. cfunction:: PyObject* PyUnicode_EncodeASCII(const Py_UNICODE *s, Py_ssize_t size, const char *errors)
670
671 Encode the :ctype:`Py_UNICODE` buffer of the given size using ASCII and return a
672 Python string object. Return *NULL* if an exception was raised by the codec.
673
674
675.. cfunction:: PyObject* PyUnicode_AsASCIIString(PyObject *unicode)
676
677 Encode a Unicode object using ASCII and return the result as Python string
678 object. Error handling is "strict". Return *NULL* if an exception was raised
679 by the codec.
680
681These are the mapping codec APIs:
682
683.. % --- Character Map Codecs -----------------------------------------------
684
685This codec is special in that it can be used to implement many different codecs
686(and this is in fact what was done to obtain most of the standard codecs
687included in the :mod:`encodings` package). The codec uses mapping to encode and
688decode characters.
689
690Decoding mappings must map single string characters to single Unicode
691characters, integers (which are then interpreted as Unicode ordinals) or None
692(meaning "undefined mapping" and causing an error).
693
694Encoding mappings must map single Unicode characters to single string
695characters, integers (which are then interpreted as Latin-1 ordinals) or None
696(meaning "undefined mapping" and causing an error).
697
698The mapping objects provided must only support the __getitem__ mapping
699interface.
700
701If a character lookup fails with a LookupError, the character is copied as-is
702meaning that its ordinal value will be interpreted as Unicode or Latin-1 ordinal
703resp. Because of this, mappings only need to contain those mappings which map
704characters to different code points.
705
706
707.. cfunction:: PyObject* PyUnicode_DecodeCharmap(const char *s, Py_ssize_t size, PyObject *mapping, const char *errors)
708
709 Create a Unicode object by decoding *size* bytes of the encoded string *s* using
710 the given *mapping* object. Return *NULL* if an exception was raised by the
711 codec. If *mapping* is *NULL* latin-1 decoding will be done. Else it can be a
712 dictionary mapping byte or a unicode string, which is treated as a lookup table.
713 Byte values greater that the length of the string and U+FFFE "characters" are
714 treated as "undefined mapping".
715
716
717.. cfunction:: PyObject* PyUnicode_EncodeCharmap(const Py_UNICODE *s, Py_ssize_t size, PyObject *mapping, const char *errors)
718
719 Encode the :ctype:`Py_UNICODE` buffer of the given size using the given
720 *mapping* object and return a Python string object. Return *NULL* if an
721 exception was raised by the codec.
722
723
724.. cfunction:: PyObject* PyUnicode_AsCharmapString(PyObject *unicode, PyObject *mapping)
725
726 Encode a Unicode object using the given *mapping* object and return the result
727 as Python string object. Error handling is "strict". Return *NULL* if an
728 exception was raised by the codec.
729
730The following codec API is special in that maps Unicode to Unicode.
731
732
733.. cfunction:: PyObject* PyUnicode_TranslateCharmap(const Py_UNICODE *s, Py_ssize_t size, PyObject *table, const char *errors)
734
735 Translate a :ctype:`Py_UNICODE` buffer of the given length by applying a
736 character mapping *table* to it and return the resulting Unicode object. Return
737 *NULL* when an exception was raised by the codec.
738
739 The *mapping* table must map Unicode ordinal integers to Unicode ordinal
740 integers or None (causing deletion of the character).
741
742 Mapping tables need only provide the :meth:`__getitem__` interface; dictionaries
743 and sequences work well. Unmapped character ordinals (ones which cause a
744 :exc:`LookupError`) are left untouched and are copied as-is.
745
746These are the MBCS codec APIs. They are currently only available on Windows and
747use the Win32 MBCS converters to implement the conversions. Note that MBCS (or
748DBCS) is a class of encodings, not just one. The target encoding is defined by
749the user settings on the machine running the codec.
750
751.. % --- MBCS codecs for Windows --------------------------------------------
752
753
754.. cfunction:: PyObject* PyUnicode_DecodeMBCS(const char *s, Py_ssize_t size, const char *errors)
755
756 Create a Unicode object by decoding *size* bytes of the MBCS encoded string *s*.
757 Return *NULL* if an exception was raised by the codec.
758
759
760.. cfunction:: PyObject* PyUnicode_DecodeMBCSStateful(const char *s, int size, const char *errors, int *consumed)
761
762 If *consumed* is *NULL*, behave like :cfunc:`PyUnicode_DecodeMBCS`. If
763 *consumed* is not *NULL*, :cfunc:`PyUnicode_DecodeMBCSStateful` will not decode
764 trailing lead byte and the number of bytes that have been decoded will be stored
765 in *consumed*.
766
767
768.. cfunction:: PyObject* PyUnicode_EncodeMBCS(const Py_UNICODE *s, Py_ssize_t size, const char *errors)
769
770 Encode the :ctype:`Py_UNICODE` buffer of the given size using MBCS and return a
771 Python string object. Return *NULL* if an exception was raised by the codec.
772
773
774.. cfunction:: PyObject* PyUnicode_AsMBCSString(PyObject *unicode)
775
776 Encode a Unicode object using MBCS and return the result as Python string
777 object. Error handling is "strict". Return *NULL* if an exception was raised
778 by the codec.
779
780.. % --- Methods & Slots ----------------------------------------------------
781
782
783.. _unicodemethodsandslots:
784
785Methods and Slot Functions
786^^^^^^^^^^^^^^^^^^^^^^^^^^
787
788The following APIs are capable of handling Unicode objects and strings on input
789(we refer to them as strings in the descriptions) and return Unicode objects or
790integers as appropriate.
791
792They all return *NULL* or ``-1`` if an exception occurs.
793
794
795.. cfunction:: PyObject* PyUnicode_Concat(PyObject *left, PyObject *right)
796
797 Concat two strings giving a new Unicode string.
798
799
800.. cfunction:: PyObject* PyUnicode_Split(PyObject *s, PyObject *sep, Py_ssize_t maxsplit)
801
802 Split a string giving a list of Unicode strings. If sep is *NULL*, splitting
803 will be done at all whitespace substrings. Otherwise, splits occur at the given
804 separator. At most *maxsplit* splits will be done. If negative, no limit is
805 set. Separators are not included in the resulting list.
806
807
808.. cfunction:: PyObject* PyUnicode_Splitlines(PyObject *s, int keepend)
809
810 Split a Unicode string at line breaks, returning a list of Unicode strings.
811 CRLF is considered to be one line break. If *keepend* is 0, the Line break
812 characters are not included in the resulting strings.
813
814
815.. cfunction:: PyObject* PyUnicode_Translate(PyObject *str, PyObject *table, const char *errors)
816
817 Translate a string by applying a character mapping table to it and return the
818 resulting Unicode object.
819
820 The mapping table must map Unicode ordinal integers to Unicode ordinal integers
821 or None (causing deletion of the character).
822
823 Mapping tables need only provide the :meth:`__getitem__` interface; dictionaries
824 and sequences work well. Unmapped character ordinals (ones which cause a
825 :exc:`LookupError`) are left untouched and are copied as-is.
826
827 *errors* has the usual meaning for codecs. It may be *NULL* which indicates to
828 use the default error handling.
829
830
831.. cfunction:: PyObject* PyUnicode_Join(PyObject *separator, PyObject *seq)
832
833 Join a sequence of strings using the given separator and return the resulting
834 Unicode string.
835
836
837.. cfunction:: int PyUnicode_Tailmatch(PyObject *str, PyObject *substr, Py_ssize_t start, Py_ssize_t end, int direction)
838
839 Return 1 if *substr* matches *str*[*start*:*end*] at the given tail end
840 (*direction* == -1 means to do a prefix match, *direction* == 1 a suffix match),
841 0 otherwise. Return ``-1`` if an error occurred.
842
843
844.. cfunction:: Py_ssize_t PyUnicode_Find(PyObject *str, PyObject *substr, Py_ssize_t start, Py_ssize_t end, int direction)
845
846 Return the first position of *substr* in *str*[*start*:*end*] using the given
847 *direction* (*direction* == 1 means to do a forward search, *direction* == -1 a
848 backward search). The return value is the index of the first match; a value of
849 ``-1`` indicates that no match was found, and ``-2`` indicates that an error
850 occurred and an exception has been set.
851
852
853.. cfunction:: Py_ssize_t PyUnicode_Count(PyObject *str, PyObject *substr, Py_ssize_t start, Py_ssize_t end)
854
855 Return the number of non-overlapping occurrences of *substr* in
856 ``str[start:end]``. Return ``-1`` if an error occurred.
857
858
859.. cfunction:: PyObject* PyUnicode_Replace(PyObject *str, PyObject *substr, PyObject *replstr, Py_ssize_t maxcount)
860
861 Replace at most *maxcount* occurrences of *substr* in *str* with *replstr* and
862 return the resulting Unicode object. *maxcount* == -1 means replace all
863 occurrences.
864
865
866.. cfunction:: int PyUnicode_Compare(PyObject *left, PyObject *right)
867
868 Compare two strings and return -1, 0, 1 for less than, equal, and greater than,
869 respectively.
870
871
Benjamin Petersonc22ed142008-07-01 19:12:34 +0000872.. cfunction:: int PyUnicode_CompareWithASCIIString(PyObject *uni, char *string)
873
874 Compare a unicode object, *uni*, with *string* and return -1, 0, 1 for less
875 than, equal, and greater than, respectively.
876
877
Georg Brandl54a3faa2008-01-20 09:30:57 +0000878.. cfunction:: int PyUnicode_RichCompare(PyObject *left, PyObject *right, int op)
879
880 Rich compare two unicode strings and return one of the following:
881
882 * ``NULL`` in case an exception was raised
883 * :const:`Py_True` or :const:`Py_False` for successful comparisons
884 * :const:`Py_NotImplemented` in case the type combination is unknown
885
886 Note that :const:`Py_EQ` and :const:`Py_NE` comparisons can cause a
887 :exc:`UnicodeWarning` in case the conversion of the arguments to Unicode fails
888 with a :exc:`UnicodeDecodeError`.
889
890 Possible values for *op* are :const:`Py_GT`, :const:`Py_GE`, :const:`Py_EQ`,
891 :const:`Py_NE`, :const:`Py_LT`, and :const:`Py_LE`.
892
893
894.. cfunction:: PyObject* PyUnicode_Format(PyObject *format, PyObject *args)
895
896 Return a new string object from *format* and *args*; this is analogous to
897 ``format % args``. The *args* argument must be a tuple.
898
899
900.. cfunction:: int PyUnicode_Contains(PyObject *container, PyObject *element)
901
902 Check whether *element* is contained in *container* and return true or false
903 accordingly.
904
905 *element* has to coerce to a one element Unicode string. ``-1`` is returned if
906 there was an error.
907
908
909.. cfunction:: void PyUnicode_InternInPlace(PyObject **string)
910
911 Intern the argument *\*string* in place. The argument must be the address of a
912 pointer variable pointing to a Python unicode string object. If there is an
913 existing interned string that is the same as *\*string*, it sets *\*string* to
914 it (decrementing the reference count of the old string object and incrementing
915 the reference count of the interned string object), otherwise it leaves
916 *\*string* alone and interns it (incrementing its reference count).
917 (Clarification: even though there is a lot of talk about reference counts, think
918 of this function as reference-count-neutral; you own the object after the call
919 if and only if you owned it before the call.)
920
921
922.. cfunction:: PyObject* PyUnicode_InternFromString(const char *v)
923
924 A combination of :cfunc:`PyUnicode_FromString` and
925 :cfunc:`PyUnicode_InternInPlace`, returning either a new unicode string object
926 that has been interned, or a new ("owned") reference to an earlier interned
927 string object with the same value.
928