blob: b60373464862e7a906e8dc3a04083ef8275130fa [file] [log] [blame]
Georg Brandlf6842722008-01-19 22:08:21 +00001.. highlightlang:: c
2
3.. _unicodeobjects:
4
5Unicode Objects and Codecs
6--------------------------
7
8.. sectionauthor:: Marc-Andre Lemburg <mal@lemburg.com>
9
10Unicode Objects
11^^^^^^^^^^^^^^^
12
13
14These are the basic Unicode object types used for the Unicode implementation in
15Python:
16
17.. % --- Unicode Type -------------------------------------------------------
18
19
20.. ctype:: Py_UNICODE
21
22 This type represents the storage type which is used by Python internally as
23 basis for holding Unicode ordinals. Python's default builds use a 16-bit type
24 for :ctype:`Py_UNICODE` and store Unicode values internally as UCS2. It is also
25 possible to build a UCS4 version of Python (most recent Linux distributions come
26 with UCS4 builds of Python). These builds then use a 32-bit type for
27 :ctype:`Py_UNICODE` and store Unicode data internally as UCS4. On platforms
28 where :ctype:`wchar_t` is available and compatible with the chosen Python
29 Unicode build variant, :ctype:`Py_UNICODE` is a typedef alias for
30 :ctype:`wchar_t` to enhance native platform compatibility. On all other
31 platforms, :ctype:`Py_UNICODE` is a typedef alias for either :ctype:`unsigned
32 short` (UCS2) or :ctype:`unsigned long` (UCS4).
33
34Note that UCS2 and UCS4 Python builds are not binary compatible. Please keep
35this in mind when writing extensions or interfaces.
36
37
38.. ctype:: PyUnicodeObject
39
40 This subtype of :ctype:`PyObject` represents a Python Unicode object.
41
42
43.. cvar:: PyTypeObject PyUnicode_Type
44
45 This instance of :ctype:`PyTypeObject` represents the Python Unicode type. It
46 is exposed to Python code as ``unicode`` and ``types.UnicodeType``.
47
48The following APIs are really C macros and can be used to do fast checks and to
49access internal read-only data of Unicode objects:
50
51
52.. cfunction:: int PyUnicode_Check(PyObject *o)
53
54 Return true if the object *o* is a Unicode object or an instance of a Unicode
55 subtype.
56
57 .. versionchanged:: 2.2
58 Allowed subtypes to be accepted.
59
60
61.. cfunction:: int PyUnicode_CheckExact(PyObject *o)
62
63 Return true if the object *o* is a Unicode object, but not an instance of a
64 subtype.
65
66 .. versionadded:: 2.2
67
68
69.. cfunction:: Py_ssize_t PyUnicode_GET_SIZE(PyObject *o)
70
71 Return the size of the object. *o* has to be a :ctype:`PyUnicodeObject` (not
72 checked).
73
Jeroen Ruigrok van der Werven0051bf32009-04-29 08:00:05 +000074 .. versionchanged:: 2.5
75 This function returned an :ctype:`int` type. This might require changes
76 in your code for properly supporting 64-bit systems.
77
Georg Brandlf6842722008-01-19 22:08:21 +000078
79.. cfunction:: Py_ssize_t PyUnicode_GET_DATA_SIZE(PyObject *o)
80
81 Return the size of the object's internal buffer in bytes. *o* has to be a
82 :ctype:`PyUnicodeObject` (not checked).
83
Jeroen Ruigrok van der Werven0051bf32009-04-29 08:00:05 +000084 .. versionchanged:: 2.5
85 This function returned an :ctype:`int` type. This might require changes
86 in your code for properly supporting 64-bit systems.
87
Georg Brandlf6842722008-01-19 22:08:21 +000088
89.. cfunction:: Py_UNICODE* PyUnicode_AS_UNICODE(PyObject *o)
90
91 Return a pointer to the internal :ctype:`Py_UNICODE` buffer of the object. *o*
92 has to be a :ctype:`PyUnicodeObject` (not checked).
93
94
95.. cfunction:: const char* PyUnicode_AS_DATA(PyObject *o)
96
97 Return a pointer to the internal buffer of the object. *o* has to be a
98 :ctype:`PyUnicodeObject` (not checked).
99
Christian Heimes3b718a72008-02-14 12:47:33 +0000100
Georg Brandlcda25a12009-10-27 14:34:21 +0000101.. cfunction:: int PyUnicode_ClearFreeList()
Christian Heimes3b718a72008-02-14 12:47:33 +0000102
103 Clear the free list. Return the total number of freed items.
104
105 .. versionadded:: 2.6
106
Georg Brandlcda25a12009-10-27 14:34:21 +0000107
Georg Brandlf6842722008-01-19 22:08:21 +0000108Unicode provides many different character properties. The most often needed ones
109are available through these macros which are mapped to C functions depending on
110the Python configuration.
111
112.. % --- Unicode character properties ---------------------------------------
113
114
115.. cfunction:: int Py_UNICODE_ISSPACE(Py_UNICODE ch)
116
117 Return 1 or 0 depending on whether *ch* is a whitespace character.
118
119
120.. cfunction:: int Py_UNICODE_ISLOWER(Py_UNICODE ch)
121
122 Return 1 or 0 depending on whether *ch* is a lowercase character.
123
124
125.. cfunction:: int Py_UNICODE_ISUPPER(Py_UNICODE ch)
126
127 Return 1 or 0 depending on whether *ch* is an uppercase character.
128
129
130.. cfunction:: int Py_UNICODE_ISTITLE(Py_UNICODE ch)
131
132 Return 1 or 0 depending on whether *ch* is a titlecase character.
133
134
135.. cfunction:: int Py_UNICODE_ISLINEBREAK(Py_UNICODE ch)
136
137 Return 1 or 0 depending on whether *ch* is a linebreak character.
138
139
140.. cfunction:: int Py_UNICODE_ISDECIMAL(Py_UNICODE ch)
141
142 Return 1 or 0 depending on whether *ch* is a decimal character.
143
144
145.. cfunction:: int Py_UNICODE_ISDIGIT(Py_UNICODE ch)
146
147 Return 1 or 0 depending on whether *ch* is a digit character.
148
149
150.. cfunction:: int Py_UNICODE_ISNUMERIC(Py_UNICODE ch)
151
152 Return 1 or 0 depending on whether *ch* is a numeric character.
153
154
155.. cfunction:: int Py_UNICODE_ISALPHA(Py_UNICODE ch)
156
157 Return 1 or 0 depending on whether *ch* is an alphabetic character.
158
159
160.. cfunction:: int Py_UNICODE_ISALNUM(Py_UNICODE ch)
161
162 Return 1 or 0 depending on whether *ch* is an alphanumeric character.
163
164These APIs can be used for fast direct character conversions:
165
166
167.. cfunction:: Py_UNICODE Py_UNICODE_TOLOWER(Py_UNICODE ch)
168
169 Return the character *ch* converted to lower case.
170
171
172.. cfunction:: Py_UNICODE Py_UNICODE_TOUPPER(Py_UNICODE ch)
173
174 Return the character *ch* converted to upper case.
175
176
177.. cfunction:: Py_UNICODE Py_UNICODE_TOTITLE(Py_UNICODE ch)
178
179 Return the character *ch* converted to title case.
180
181
182.. cfunction:: int Py_UNICODE_TODECIMAL(Py_UNICODE ch)
183
184 Return the character *ch* converted to a decimal positive integer. Return
185 ``-1`` if this is not possible. This macro does not raise exceptions.
186
187
188.. cfunction:: int Py_UNICODE_TODIGIT(Py_UNICODE ch)
189
190 Return the character *ch* converted to a single digit integer. Return ``-1`` if
191 this is not possible. This macro does not raise exceptions.
192
193
194.. cfunction:: double Py_UNICODE_TONUMERIC(Py_UNICODE ch)
195
196 Return the character *ch* converted to a double. Return ``-1.0`` if this is not
197 possible. This macro does not raise exceptions.
198
199To create Unicode objects and access their basic sequence properties, use these
200APIs:
201
202.. % --- Plain Py_UNICODE ---------------------------------------------------
203
204
205.. cfunction:: PyObject* PyUnicode_FromUnicode(const Py_UNICODE *u, Py_ssize_t size)
206
207 Create a Unicode Object from the Py_UNICODE buffer *u* of the given size. *u*
208 may be *NULL* which causes the contents to be undefined. It is the user's
209 responsibility to fill in the needed data. The buffer is copied into the new
210 object. If the buffer is not *NULL*, the return value might be a shared object.
211 Therefore, modification of the resulting Unicode object is only allowed when *u*
212 is *NULL*.
213
Jeroen Ruigrok van der Werven0051bf32009-04-29 08:00:05 +0000214 .. versionchanged:: 2.5
215 This function used an :ctype:`int` type for *size*. This might require
216 changes in your code for properly supporting 64-bit systems.
217
Georg Brandlf6842722008-01-19 22:08:21 +0000218
219.. cfunction:: Py_UNICODE* PyUnicode_AsUnicode(PyObject *unicode)
220
221 Return a read-only pointer to the Unicode object's internal :ctype:`Py_UNICODE`
222 buffer, *NULL* if *unicode* is not a Unicode object.
223
224
225.. cfunction:: Py_ssize_t PyUnicode_GetSize(PyObject *unicode)
226
227 Return the length of the Unicode object.
228
Jeroen Ruigrok van der Werven0051bf32009-04-29 08:00:05 +0000229 .. versionchanged:: 2.5
230 This function returned an :ctype:`int` type. This might require changes
231 in your code for properly supporting 64-bit systems.
232
Georg Brandlf6842722008-01-19 22:08:21 +0000233
234.. cfunction:: PyObject* PyUnicode_FromEncodedObject(PyObject *obj, const char *encoding, const char *errors)
235
236 Coerce an encoded object *obj* to an Unicode object and return a reference with
237 incremented refcount.
238
239 String and other char buffer compatible objects are decoded according to the
240 given encoding and using the error handling defined by errors. Both can be
241 *NULL* to have the interface use the default values (see the next section for
242 details).
243
244 All other objects, including Unicode objects, cause a :exc:`TypeError` to be
245 set.
246
247 The API returns *NULL* if there was an error. The caller is responsible for
248 decref'ing the returned objects.
249
250
251.. cfunction:: PyObject* PyUnicode_FromObject(PyObject *obj)
252
253 Shortcut for ``PyUnicode_FromEncodedObject(obj, NULL, "strict")`` which is used
254 throughout the interpreter whenever coercion to Unicode is needed.
255
256If the platform supports :ctype:`wchar_t` and provides a header file wchar.h,
257Python can interface directly to this type using the following functions.
258Support is optimized if Python's own :ctype:`Py_UNICODE` type is identical to
259the system's :ctype:`wchar_t`.
260
261.. % --- wchar_t support for platforms which support it ---------------------
262
263
264.. cfunction:: PyObject* PyUnicode_FromWideChar(const wchar_t *w, Py_ssize_t size)
265
266 Create a Unicode object from the :ctype:`wchar_t` buffer *w* of the given size.
267 Return *NULL* on failure.
268
Jeroen Ruigrok van der Werven0051bf32009-04-29 08:00:05 +0000269 .. versionchanged:: 2.5
270 This function used an :ctype:`int` type for *size*. This might require
271 changes in your code for properly supporting 64-bit systems.
272
Georg Brandlf6842722008-01-19 22:08:21 +0000273
274.. cfunction:: Py_ssize_t PyUnicode_AsWideChar(PyUnicodeObject *unicode, wchar_t *w, Py_ssize_t size)
275
276 Copy the Unicode object contents into the :ctype:`wchar_t` buffer *w*. At most
277 *size* :ctype:`wchar_t` characters are copied (excluding a possibly trailing
278 0-termination character). Return the number of :ctype:`wchar_t` characters
279 copied or -1 in case of an error. Note that the resulting :ctype:`wchar_t`
280 string may or may not be 0-terminated. It is the responsibility of the caller
281 to make sure that the :ctype:`wchar_t` string is 0-terminated in case this is
282 required by the application.
283
Jeroen Ruigrok van der Werven0051bf32009-04-29 08:00:05 +0000284 .. versionchanged:: 2.5
285 This function returned an :ctype:`int` type and used an :ctype:`int`
286 type for *size*. This might require changes in your code for properly
287 supporting 64-bit systems.
288
Georg Brandlf6842722008-01-19 22:08:21 +0000289
290.. _builtincodecs:
291
292Built-in Codecs
293^^^^^^^^^^^^^^^
294
295Python provides a set of builtin codecs which are written in C for speed. All of
296these codecs are directly usable via the following functions.
297
298Many of the following APIs take two arguments encoding and errors. These
299parameters encoding and errors have the same semantics as the ones of the
300builtin unicode() Unicode object constructor.
301
302Setting encoding to *NULL* causes the default encoding to be used which is
303ASCII. The file system calls should use :cdata:`Py_FileSystemDefaultEncoding`
304as the encoding for file names. This variable should be treated as read-only: On
305some systems, it will be a pointer to a static string, on others, it will change
306at run-time (such as when the application invokes setlocale).
307
308Error handling is set by errors which may also be set to *NULL* meaning to use
309the default handling defined for the codec. Default error handling for all
310builtin codecs is "strict" (:exc:`ValueError` is raised).
311
312The codecs all use a similar interface. Only deviation from the following
313generic ones are documented for simplicity.
314
315These are the generic codec APIs:
316
317.. % --- Generic Codecs -----------------------------------------------------
318
319
320.. cfunction:: PyObject* PyUnicode_Decode(const char *s, Py_ssize_t size, const char *encoding, const char *errors)
321
322 Create a Unicode object by decoding *size* bytes of the encoded string *s*.
323 *encoding* and *errors* have the same meaning as the parameters of the same name
324 in the :func:`unicode` builtin function. The codec to be used is looked up
325 using the Python codec registry. Return *NULL* if an exception was raised by
326 the codec.
327
Jeroen Ruigrok van der Werven0051bf32009-04-29 08:00:05 +0000328 .. versionchanged:: 2.5
329 This function used an :ctype:`int` type for *size*. This might require
330 changes in your code for properly supporting 64-bit systems.
331
Georg Brandlf6842722008-01-19 22:08:21 +0000332
333.. cfunction:: PyObject* PyUnicode_Encode(const Py_UNICODE *s, Py_ssize_t size, const char *encoding, const char *errors)
334
335 Encode the :ctype:`Py_UNICODE` buffer of the given size and return a Python
336 string object. *encoding* and *errors* have the same meaning as the parameters
337 of the same name in the Unicode :meth:`encode` method. The codec to be used is
338 looked up using the Python codec registry. Return *NULL* if an exception was
339 raised by the codec.
340
Jeroen Ruigrok van der Werven0051bf32009-04-29 08:00:05 +0000341 .. versionchanged:: 2.5
342 This function used an :ctype:`int` type for *size*. This might require
343 changes in your code for properly supporting 64-bit systems.
344
Georg Brandlf6842722008-01-19 22:08:21 +0000345
346.. cfunction:: PyObject* PyUnicode_AsEncodedString(PyObject *unicode, const char *encoding, const char *errors)
347
348 Encode a Unicode object and return the result as Python string object.
349 *encoding* and *errors* have the same meaning as the parameters of the same name
350 in the Unicode :meth:`encode` method. The codec to be used is looked up using
351 the Python codec registry. Return *NULL* if an exception was raised by the
352 codec.
353
354These are the UTF-8 codec APIs:
355
356.. % --- UTF-8 Codecs -------------------------------------------------------
357
358
359.. cfunction:: PyObject* PyUnicode_DecodeUTF8(const char *s, Py_ssize_t size, const char *errors)
360
361 Create a Unicode object by decoding *size* bytes of the UTF-8 encoded string
362 *s*. Return *NULL* if an exception was raised by the codec.
363
Jeroen Ruigrok van der Werven0051bf32009-04-29 08:00:05 +0000364 .. versionchanged:: 2.5
365 This function used an :ctype:`int` type for *size*. This might require
366 changes in your code for properly supporting 64-bit systems.
367
Georg Brandlf6842722008-01-19 22:08:21 +0000368
369.. cfunction:: PyObject* PyUnicode_DecodeUTF8Stateful(const char *s, Py_ssize_t size, const char *errors, Py_ssize_t *consumed)
370
371 If *consumed* is *NULL*, behave like :cfunc:`PyUnicode_DecodeUTF8`. If
372 *consumed* is not *NULL*, trailing incomplete UTF-8 byte sequences will not be
373 treated as an error. Those bytes will not be decoded and the number of bytes
374 that have been decoded will be stored in *consumed*.
375
376 .. versionadded:: 2.4
377
Jeroen Ruigrok van der Werven0051bf32009-04-29 08:00:05 +0000378 .. versionchanged:: 2.5
379 This function used an :ctype:`int` type for *size*. This might require
380 changes in your code for properly supporting 64-bit systems.
381
Georg Brandlf6842722008-01-19 22:08:21 +0000382
383.. cfunction:: PyObject* PyUnicode_EncodeUTF8(const Py_UNICODE *s, Py_ssize_t size, const char *errors)
384
385 Encode the :ctype:`Py_UNICODE` buffer of the given size using UTF-8 and return a
386 Python string object. Return *NULL* if an exception was raised by the codec.
387
Jeroen Ruigrok van der Werven0051bf32009-04-29 08:00:05 +0000388 .. versionchanged:: 2.5
389 This function used an :ctype:`int` type for *size*. This might require
390 changes in your code for properly supporting 64-bit systems.
391
Georg Brandlf6842722008-01-19 22:08:21 +0000392
393.. cfunction:: PyObject* PyUnicode_AsUTF8String(PyObject *unicode)
394
395 Encode a Unicode object using UTF-8 and return the result as Python string
396 object. Error handling is "strict". Return *NULL* if an exception was raised
397 by the codec.
398
399These are the UTF-32 codec APIs:
400
401.. % --- UTF-32 Codecs ------------------------------------------------------ */
402
403
404.. cfunction:: PyObject* PyUnicode_DecodeUTF32(const char *s, Py_ssize_t size, const char *errors, int *byteorder)
405
406 Decode *length* bytes from a UTF-32 encoded buffer string and return the
407 corresponding Unicode object. *errors* (if non-*NULL*) defines the error
408 handling. It defaults to "strict".
409
410 If *byteorder* is non-*NULL*, the decoder starts decoding using the given byte
411 order::
412
413 *byteorder == -1: little endian
414 *byteorder == 0: native order
415 *byteorder == 1: big endian
416
417 and then switches if the first four bytes of the input data are a byte order mark
418 (BOM) and the specified byte order is native order. This BOM is not copied into
419 the resulting Unicode string. After completion, *\*byteorder* is set to the
420 current byte order at the end of input data.
421
422 In a narrow build codepoints outside the BMP will be decoded as surrogate pairs.
423
424 If *byteorder* is *NULL*, the codec starts in native order mode.
425
426 Return *NULL* if an exception was raised by the codec.
427
428 .. versionadded:: 2.6
429
430
431.. cfunction:: PyObject* PyUnicode_DecodeUTF32Stateful(const char *s, Py_ssize_t size, const char *errors, int *byteorder, Py_ssize_t *consumed)
432
433 If *consumed* is *NULL*, behave like :cfunc:`PyUnicode_DecodeUTF32`. If
434 *consumed* is not *NULL*, :cfunc:`PyUnicode_DecodeUTF32Stateful` will not treat
435 trailing incomplete UTF-32 byte sequences (such as a number of bytes not divisible
436 by four) as an error. Those bytes will not be decoded and the number of bytes
437 that have been decoded will be stored in *consumed*.
438
439 .. versionadded:: 2.6
440
441
442.. cfunction:: PyObject* PyUnicode_EncodeUTF32(const Py_UNICODE *s, Py_ssize_t size, const char *errors, int byteorder)
443
444 Return a Python bytes object holding the UTF-32 encoded value of the Unicode
445 data in *s*. If *byteorder* is not ``0``, output is written according to the
446 following byte order::
447
448 byteorder == -1: little endian
449 byteorder == 0: native byte order (writes a BOM mark)
450 byteorder == 1: big endian
451
452 If byteorder is ``0``, the output string will always start with the Unicode BOM
453 mark (U+FEFF). In the other two modes, no BOM mark is prepended.
454
455 If *Py_UNICODE_WIDE* is not defined, surrogate pairs will be output
456 as a single codepoint.
457
458 Return *NULL* if an exception was raised by the codec.
459
460 .. versionadded:: 2.6
461
462
463.. cfunction:: PyObject* PyUnicode_AsUTF32String(PyObject *unicode)
464
465 Return a Python string using the UTF-32 encoding in native byte order. The
466 string always starts with a BOM mark. Error handling is "strict". Return
467 *NULL* if an exception was raised by the codec.
468
469 .. versionadded:: 2.6
470
471
472These are the UTF-16 codec APIs:
473
474.. % --- UTF-16 Codecs ------------------------------------------------------ */
475
476
477.. cfunction:: PyObject* PyUnicode_DecodeUTF16(const char *s, Py_ssize_t size, const char *errors, int *byteorder)
478
479 Decode *length* bytes from a UTF-16 encoded buffer string and return the
480 corresponding Unicode object. *errors* (if non-*NULL*) defines the error
481 handling. It defaults to "strict".
482
483 If *byteorder* is non-*NULL*, the decoder starts decoding using the given byte
484 order::
485
486 *byteorder == -1: little endian
487 *byteorder == 0: native order
488 *byteorder == 1: big endian
489
490 and then switches if the first two bytes of the input data are a byte order mark
491 (BOM) and the specified byte order is native order. This BOM is not copied into
492 the resulting Unicode string. After completion, *\*byteorder* is set to the
493 current byte order at the.
494
495 If *byteorder* is *NULL*, the codec starts in native order mode.
496
497 Return *NULL* if an exception was raised by the codec.
498
Jeroen Ruigrok van der Werven0051bf32009-04-29 08:00:05 +0000499 .. versionchanged:: 2.5
500 This function used an :ctype:`int` type for *size*. This might require
501 changes in your code for properly supporting 64-bit systems.
502
Georg Brandlf6842722008-01-19 22:08:21 +0000503
504.. cfunction:: PyObject* PyUnicode_DecodeUTF16Stateful(const char *s, Py_ssize_t size, const char *errors, int *byteorder, Py_ssize_t *consumed)
505
506 If *consumed* is *NULL*, behave like :cfunc:`PyUnicode_DecodeUTF16`. If
507 *consumed* is not *NULL*, :cfunc:`PyUnicode_DecodeUTF16Stateful` will not treat
508 trailing incomplete UTF-16 byte sequences (such as an odd number of bytes or a
509 split surrogate pair) as an error. Those bytes will not be decoded and the
510 number of bytes that have been decoded will be stored in *consumed*.
511
512 .. versionadded:: 2.4
513
Jeroen Ruigrok van der Werven0051bf32009-04-29 08:00:05 +0000514 .. versionchanged:: 2.5
515 This function used an :ctype:`int` type for *size* and an :ctype:`int *`
516 type for *consumed*. This might require changes in your code for
517 properly supporting 64-bit systems.
518
Georg Brandlf6842722008-01-19 22:08:21 +0000519
520.. cfunction:: PyObject* PyUnicode_EncodeUTF16(const Py_UNICODE *s, Py_ssize_t size, const char *errors, int byteorder)
521
522 Return a Python string object holding the UTF-16 encoded value of the Unicode
523 data in *s*. If *byteorder* is not ``0``, output is written according to the
524 following byte order::
525
526 byteorder == -1: little endian
527 byteorder == 0: native byte order (writes a BOM mark)
528 byteorder == 1: big endian
529
530 If byteorder is ``0``, the output string will always start with the Unicode BOM
531 mark (U+FEFF). In the other two modes, no BOM mark is prepended.
532
533 If *Py_UNICODE_WIDE* is defined, a single :ctype:`Py_UNICODE` value may get
534 represented as a surrogate pair. If it is not defined, each :ctype:`Py_UNICODE`
535 values is interpreted as an UCS-2 character.
536
537 Return *NULL* if an exception was raised by the codec.
538
Jeroen Ruigrok van der Werven0051bf32009-04-29 08:00:05 +0000539 .. versionchanged:: 2.5
540 This function used an :ctype:`int` type for *size*. This might require
541 changes in your code for properly supporting 64-bit systems.
542
Georg Brandlf6842722008-01-19 22:08:21 +0000543
544.. cfunction:: PyObject* PyUnicode_AsUTF16String(PyObject *unicode)
545
546 Return a Python string using the UTF-16 encoding in native byte order. The
547 string always starts with a BOM mark. Error handling is "strict". Return
548 *NULL* if an exception was raised by the codec.
549
550These are the "Unicode Escape" codec APIs:
551
552.. % --- Unicode-Escape Codecs ----------------------------------------------
553
554
555.. cfunction:: PyObject* PyUnicode_DecodeUnicodeEscape(const char *s, Py_ssize_t size, const char *errors)
556
557 Create a Unicode object by decoding *size* bytes of the Unicode-Escape encoded
558 string *s*. Return *NULL* if an exception was raised by the codec.
559
Jeroen Ruigrok van der Werven0051bf32009-04-29 08:00:05 +0000560 .. versionchanged:: 2.5
561 This function used an :ctype:`int` type for *size*. This might require
562 changes in your code for properly supporting 64-bit systems.
563
Georg Brandlf6842722008-01-19 22:08:21 +0000564
565.. cfunction:: PyObject* PyUnicode_EncodeUnicodeEscape(const Py_UNICODE *s, Py_ssize_t size)
566
567 Encode the :ctype:`Py_UNICODE` buffer of the given size using Unicode-Escape and
568 return a Python string object. Return *NULL* if an exception was raised by the
569 codec.
570
Jeroen Ruigrok van der Werven0051bf32009-04-29 08:00:05 +0000571 .. versionchanged:: 2.5
572 This function used an :ctype:`int` type for *size*. This might require
573 changes in your code for properly supporting 64-bit systems.
574
Georg Brandlf6842722008-01-19 22:08:21 +0000575
576.. cfunction:: PyObject* PyUnicode_AsUnicodeEscapeString(PyObject *unicode)
577
578 Encode a Unicode object using Unicode-Escape and return the result as Python
579 string object. Error handling is "strict". Return *NULL* if an exception was
580 raised by the codec.
581
582These are the "Raw Unicode Escape" codec APIs:
583
584.. % --- Raw-Unicode-Escape Codecs ------------------------------------------
585
586
587.. cfunction:: PyObject* PyUnicode_DecodeRawUnicodeEscape(const char *s, Py_ssize_t size, const char *errors)
588
589 Create a Unicode object by decoding *size* bytes of the Raw-Unicode-Escape
590 encoded string *s*. Return *NULL* if an exception was raised by the codec.
591
Jeroen Ruigrok van der Werven0051bf32009-04-29 08:00:05 +0000592 .. versionchanged:: 2.5
593 This function used an :ctype:`int` type for *size*. This might require
594 changes in your code for properly supporting 64-bit systems.
595
Georg Brandlf6842722008-01-19 22:08:21 +0000596
597.. cfunction:: PyObject* PyUnicode_EncodeRawUnicodeEscape(const Py_UNICODE *s, Py_ssize_t size, const char *errors)
598
599 Encode the :ctype:`Py_UNICODE` buffer of the given size using Raw-Unicode-Escape
600 and return a Python string object. Return *NULL* if an exception was raised by
601 the codec.
602
Jeroen Ruigrok van der Werven0051bf32009-04-29 08:00:05 +0000603 .. versionchanged:: 2.5
604 This function used an :ctype:`int` type for *size*. This might require
605 changes in your code for properly supporting 64-bit systems.
606
Georg Brandlf6842722008-01-19 22:08:21 +0000607
608.. cfunction:: PyObject* PyUnicode_AsRawUnicodeEscapeString(PyObject *unicode)
609
610 Encode a Unicode object using Raw-Unicode-Escape and return the result as
611 Python string object. Error handling is "strict". Return *NULL* if an exception
612 was raised by the codec.
613
614These are the Latin-1 codec APIs: Latin-1 corresponds to the first 256 Unicode
615ordinals and only these are accepted by the codecs during encoding.
616
617.. % --- Latin-1 Codecs -----------------------------------------------------
618
619
620.. cfunction:: PyObject* PyUnicode_DecodeLatin1(const char *s, Py_ssize_t size, const char *errors)
621
622 Create a Unicode object by decoding *size* bytes of the Latin-1 encoded string
623 *s*. Return *NULL* if an exception was raised by the codec.
624
Jeroen Ruigrok van der Werven0051bf32009-04-29 08:00:05 +0000625 .. versionchanged:: 2.5
626 This function used an :ctype:`int` type for *size*. This might require
627 changes in your code for properly supporting 64-bit systems.
628
Georg Brandlf6842722008-01-19 22:08:21 +0000629
630.. cfunction:: PyObject* PyUnicode_EncodeLatin1(const Py_UNICODE *s, Py_ssize_t size, const char *errors)
631
632 Encode the :ctype:`Py_UNICODE` buffer of the given size using Latin-1 and return
633 a Python string object. Return *NULL* if an exception was raised by the codec.
634
Jeroen Ruigrok van der Werven0051bf32009-04-29 08:00:05 +0000635 .. versionchanged:: 2.5
636 This function used an :ctype:`int` type for *size*. This might require
637 changes in your code for properly supporting 64-bit systems.
638
Georg Brandlf6842722008-01-19 22:08:21 +0000639
640.. cfunction:: PyObject* PyUnicode_AsLatin1String(PyObject *unicode)
641
642 Encode a Unicode object using Latin-1 and return the result as Python string
643 object. Error handling is "strict". Return *NULL* if an exception was raised
644 by the codec.
645
646These are the ASCII codec APIs. Only 7-bit ASCII data is accepted. All other
647codes generate errors.
648
649.. % --- ASCII Codecs -------------------------------------------------------
650
651
652.. cfunction:: PyObject* PyUnicode_DecodeASCII(const char *s, Py_ssize_t size, const char *errors)
653
654 Create a Unicode object by decoding *size* bytes of the ASCII encoded string
655 *s*. Return *NULL* if an exception was raised by the codec.
656
Jeroen Ruigrok van der Werven0051bf32009-04-29 08:00:05 +0000657 .. versionchanged:: 2.5
658 This function used an :ctype:`int` type for *size*. This might require
659 changes in your code for properly supporting 64-bit systems.
660
Georg Brandlf6842722008-01-19 22:08:21 +0000661
662.. cfunction:: PyObject* PyUnicode_EncodeASCII(const Py_UNICODE *s, Py_ssize_t size, const char *errors)
663
664 Encode the :ctype:`Py_UNICODE` buffer of the given size using ASCII and return a
665 Python string object. Return *NULL* if an exception was raised by the codec.
666
Jeroen Ruigrok van der Werven0051bf32009-04-29 08:00:05 +0000667 .. versionchanged:: 2.5
668 This function used an :ctype:`int` type for *size*. This might require
669 changes in your code for properly supporting 64-bit systems.
670
Georg Brandlf6842722008-01-19 22:08:21 +0000671
672.. cfunction:: PyObject* PyUnicode_AsASCIIString(PyObject *unicode)
673
674 Encode a Unicode object using ASCII and return the result as Python string
675 object. Error handling is "strict". Return *NULL* if an exception was raised
676 by the codec.
677
678These are the mapping codec APIs:
679
680.. % --- Character Map Codecs -----------------------------------------------
681
682This codec is special in that it can be used to implement many different codecs
683(and this is in fact what was done to obtain most of the standard codecs
684included in the :mod:`encodings` package). The codec uses mapping to encode and
685decode characters.
686
687Decoding mappings must map single string characters to single Unicode
688characters, integers (which are then interpreted as Unicode ordinals) or None
689(meaning "undefined mapping" and causing an error).
690
691Encoding mappings must map single Unicode characters to single string
692characters, integers (which are then interpreted as Latin-1 ordinals) or None
693(meaning "undefined mapping" and causing an error).
694
695The mapping objects provided must only support the __getitem__ mapping
696interface.
697
698If a character lookup fails with a LookupError, the character is copied as-is
699meaning that its ordinal value will be interpreted as Unicode or Latin-1 ordinal
700resp. Because of this, mappings only need to contain those mappings which map
701characters to different code points.
702
703
704.. cfunction:: PyObject* PyUnicode_DecodeCharmap(const char *s, Py_ssize_t size, PyObject *mapping, const char *errors)
705
706 Create a Unicode object by decoding *size* bytes of the encoded string *s* using
707 the given *mapping* object. Return *NULL* if an exception was raised by the
708 codec. If *mapping* is *NULL* latin-1 decoding will be done. Else it can be a
709 dictionary mapping byte or a unicode string, which is treated as a lookup table.
710 Byte values greater that the length of the string and U+FFFE "characters" are
711 treated as "undefined mapping".
712
713 .. versionchanged:: 2.4
714 Allowed unicode string as mapping argument.
715
Jeroen Ruigrok van der Werven0051bf32009-04-29 08:00:05 +0000716 .. versionchanged:: 2.5
717 This function used an :ctype:`int` type for *size*. This might require
718 changes in your code for properly supporting 64-bit systems.
719
Georg Brandlf6842722008-01-19 22:08:21 +0000720
721.. cfunction:: PyObject* PyUnicode_EncodeCharmap(const Py_UNICODE *s, Py_ssize_t size, PyObject *mapping, const char *errors)
722
723 Encode the :ctype:`Py_UNICODE` buffer of the given size using the given
724 *mapping* object and return a Python string object. Return *NULL* if an
725 exception was raised by the codec.
726
Jeroen Ruigrok van der Werven0051bf32009-04-29 08:00:05 +0000727 .. versionchanged:: 2.5
728 This function used an :ctype:`int` type for *size*. This might require
729 changes in your code for properly supporting 64-bit systems.
730
Georg Brandlf6842722008-01-19 22:08:21 +0000731
732.. cfunction:: PyObject* PyUnicode_AsCharmapString(PyObject *unicode, PyObject *mapping)
733
734 Encode a Unicode object using the given *mapping* object and return the result
735 as Python string object. Error handling is "strict". Return *NULL* if an
736 exception was raised by the codec.
737
738The following codec API is special in that maps Unicode to Unicode.
739
740
741.. cfunction:: PyObject* PyUnicode_TranslateCharmap(const Py_UNICODE *s, Py_ssize_t size, PyObject *table, const char *errors)
742
743 Translate a :ctype:`Py_UNICODE` buffer of the given length by applying a
744 character mapping *table* to it and return the resulting Unicode object. Return
745 *NULL* when an exception was raised by the codec.
746
747 The *mapping* table must map Unicode ordinal integers to Unicode ordinal
748 integers or None (causing deletion of the character).
749
750 Mapping tables need only provide the :meth:`__getitem__` interface; dictionaries
751 and sequences work well. Unmapped character ordinals (ones which cause a
752 :exc:`LookupError`) are left untouched and are copied as-is.
753
Jeroen Ruigrok van der Werven0051bf32009-04-29 08:00:05 +0000754 .. versionchanged:: 2.5
755 This function used an :ctype:`int` type for *size*. This might require
756 changes in your code for properly supporting 64-bit systems.
757
Georg Brandlf6842722008-01-19 22:08:21 +0000758These are the MBCS codec APIs. They are currently only available on Windows and
759use the Win32 MBCS converters to implement the conversions. Note that MBCS (or
760DBCS) is a class of encodings, not just one. The target encoding is defined by
761the user settings on the machine running the codec.
762
763.. % --- MBCS codecs for Windows --------------------------------------------
764
765
766.. cfunction:: PyObject* PyUnicode_DecodeMBCS(const char *s, Py_ssize_t size, const char *errors)
767
768 Create a Unicode object by decoding *size* bytes of the MBCS encoded string *s*.
769 Return *NULL* if an exception was raised by the codec.
770
Jeroen Ruigrok van der Werven0051bf32009-04-29 08:00:05 +0000771 .. versionchanged:: 2.5
772 This function used an :ctype:`int` type for *size*. This might require
773 changes in your code for properly supporting 64-bit systems.
774
Georg Brandlf6842722008-01-19 22:08:21 +0000775
776.. cfunction:: PyObject* PyUnicode_DecodeMBCSStateful(const char *s, int size, const char *errors, int *consumed)
777
778 If *consumed* is *NULL*, behave like :cfunc:`PyUnicode_DecodeMBCS`. If
779 *consumed* is not *NULL*, :cfunc:`PyUnicode_DecodeMBCSStateful` will not decode
780 trailing lead byte and the number of bytes that have been decoded will be stored
781 in *consumed*.
782
783 .. versionadded:: 2.5
784
785
786.. cfunction:: PyObject* PyUnicode_EncodeMBCS(const Py_UNICODE *s, Py_ssize_t size, const char *errors)
787
788 Encode the :ctype:`Py_UNICODE` buffer of the given size using MBCS and return a
789 Python string object. Return *NULL* if an exception was raised by the codec.
790
Jeroen Ruigrok van der Werven0051bf32009-04-29 08:00:05 +0000791 .. versionchanged:: 2.5
792 This function used an :ctype:`int` type for *size*. This might require
793 changes in your code for properly supporting 64-bit systems.
794
Georg Brandlf6842722008-01-19 22:08:21 +0000795
796.. cfunction:: PyObject* PyUnicode_AsMBCSString(PyObject *unicode)
797
798 Encode a Unicode object using MBCS and return the result as Python string
799 object. Error handling is "strict". Return *NULL* if an exception was raised
800 by the codec.
801
802.. % --- Methods & Slots ----------------------------------------------------
803
804
805.. _unicodemethodsandslots:
806
807Methods and Slot Functions
808^^^^^^^^^^^^^^^^^^^^^^^^^^
809
810The following APIs are capable of handling Unicode objects and strings on input
811(we refer to them as strings in the descriptions) and return Unicode objects or
812integers as appropriate.
813
814They all return *NULL* or ``-1`` if an exception occurs.
815
816
817.. cfunction:: PyObject* PyUnicode_Concat(PyObject *left, PyObject *right)
818
819 Concat two strings giving a new Unicode string.
820
821
822.. cfunction:: PyObject* PyUnicode_Split(PyObject *s, PyObject *sep, Py_ssize_t maxsplit)
823
824 Split a string giving a list of Unicode strings. If sep is *NULL*, splitting
825 will be done at all whitespace substrings. Otherwise, splits occur at the given
826 separator. At most *maxsplit* splits will be done. If negative, no limit is
827 set. Separators are not included in the resulting list.
828
Jeroen Ruigrok van der Werven0051bf32009-04-29 08:00:05 +0000829 .. versionchanged:: 2.5
830 This function used an :ctype:`int` type for *maxsplit*. This might require
831 changes in your code for properly supporting 64-bit systems.
832
Georg Brandlf6842722008-01-19 22:08:21 +0000833
834.. cfunction:: PyObject* PyUnicode_Splitlines(PyObject *s, int keepend)
835
836 Split a Unicode string at line breaks, returning a list of Unicode strings.
837 CRLF is considered to be one line break. If *keepend* is 0, the Line break
838 characters are not included in the resulting strings.
839
840
841.. cfunction:: PyObject* PyUnicode_Translate(PyObject *str, PyObject *table, const char *errors)
842
843 Translate a string by applying a character mapping table to it and return the
844 resulting Unicode object.
845
846 The mapping table must map Unicode ordinal integers to Unicode ordinal integers
847 or None (causing deletion of the character).
848
849 Mapping tables need only provide the :meth:`__getitem__` interface; dictionaries
850 and sequences work well. Unmapped character ordinals (ones which cause a
851 :exc:`LookupError`) are left untouched and are copied as-is.
852
853 *errors* has the usual meaning for codecs. It may be *NULL* which indicates to
854 use the default error handling.
855
856
857.. cfunction:: PyObject* PyUnicode_Join(PyObject *separator, PyObject *seq)
858
859 Join a sequence of strings using the given separator and return the resulting
860 Unicode string.
861
862
863.. cfunction:: int PyUnicode_Tailmatch(PyObject *str, PyObject *substr, Py_ssize_t start, Py_ssize_t end, int direction)
864
865 Return 1 if *substr* matches *str*[*start*:*end*] at the given tail end
866 (*direction* == -1 means to do a prefix match, *direction* == 1 a suffix match),
867 0 otherwise. Return ``-1`` if an error occurred.
868
Jeroen Ruigrok van der Werven0051bf32009-04-29 08:00:05 +0000869 .. versionchanged:: 2.5
870 This function used an :ctype:`int` type for *start* and *end*. This
871 might require changes in your code for properly supporting 64-bit
872 systems.
873
Georg Brandlf6842722008-01-19 22:08:21 +0000874
875.. cfunction:: Py_ssize_t PyUnicode_Find(PyObject *str, PyObject *substr, Py_ssize_t start, Py_ssize_t end, int direction)
876
877 Return the first position of *substr* in *str*[*start*:*end*] using the given
878 *direction* (*direction* == 1 means to do a forward search, *direction* == -1 a
879 backward search). The return value is the index of the first match; a value of
880 ``-1`` indicates that no match was found, and ``-2`` indicates that an error
881 occurred and an exception has been set.
882
Jeroen Ruigrok van der Werven0051bf32009-04-29 08:00:05 +0000883 .. versionchanged:: 2.5
884 This function used an :ctype:`int` type for *start* and *end*. This
885 might require changes in your code for properly supporting 64-bit
886 systems.
887
Georg Brandlf6842722008-01-19 22:08:21 +0000888
889.. cfunction:: Py_ssize_t PyUnicode_Count(PyObject *str, PyObject *substr, Py_ssize_t start, Py_ssize_t end)
890
891 Return the number of non-overlapping occurrences of *substr* in
892 ``str[start:end]``. Return ``-1`` if an error occurred.
893
Jeroen Ruigrok van der Werven0051bf32009-04-29 08:00:05 +0000894 .. versionchanged:: 2.5
895 This function returned an :ctype:`int` type and used an :ctype:`int`
896 type for *start* and *end*. This might require changes in your code for
897 properly supporting 64-bit systems.
898
Georg Brandlf6842722008-01-19 22:08:21 +0000899
900.. cfunction:: PyObject* PyUnicode_Replace(PyObject *str, PyObject *substr, PyObject *replstr, Py_ssize_t maxcount)
901
902 Replace at most *maxcount* occurrences of *substr* in *str* with *replstr* and
903 return the resulting Unicode object. *maxcount* == -1 means replace all
904 occurrences.
905
Jeroen Ruigrok van der Werven0051bf32009-04-29 08:00:05 +0000906 .. versionchanged:: 2.5
907 This function used an :ctype:`int` type for *maxcount*. This might
908 require changes in your code for properly supporting 64-bit systems.
909
Georg Brandlf6842722008-01-19 22:08:21 +0000910
911.. cfunction:: int PyUnicode_Compare(PyObject *left, PyObject *right)
912
913 Compare two strings and return -1, 0, 1 for less than, equal, and greater than,
914 respectively.
915
916
917.. cfunction:: int PyUnicode_RichCompare(PyObject *left, PyObject *right, int op)
918
919 Rich compare two unicode strings and return one of the following:
920
921 * ``NULL`` in case an exception was raised
922 * :const:`Py_True` or :const:`Py_False` for successful comparisons
923 * :const:`Py_NotImplemented` in case the type combination is unknown
924
925 Note that :const:`Py_EQ` and :const:`Py_NE` comparisons can cause a
926 :exc:`UnicodeWarning` in case the conversion of the arguments to Unicode fails
927 with a :exc:`UnicodeDecodeError`.
928
929 Possible values for *op* are :const:`Py_GT`, :const:`Py_GE`, :const:`Py_EQ`,
930 :const:`Py_NE`, :const:`Py_LT`, and :const:`Py_LE`.
931
932
933.. cfunction:: PyObject* PyUnicode_Format(PyObject *format, PyObject *args)
934
935 Return a new string object from *format* and *args*; this is analogous to
936 ``format % args``. The *args* argument must be a tuple.
937
938
939.. cfunction:: int PyUnicode_Contains(PyObject *container, PyObject *element)
940
941 Check whether *element* is contained in *container* and return true or false
942 accordingly.
943
944 *element* has to coerce to a one element Unicode string. ``-1`` is returned if
945 there was an error.