blob: 7fce170b5e220ba99684d2c45833e914e9d1d967 [file] [log] [blame]
Georg Brandlf6842722008-01-19 22:08:21 +00001.. highlightlang:: c
2
3.. _unicodeobjects:
4
5Unicode Objects and Codecs
6--------------------------
7
8.. sectionauthor:: Marc-Andre Lemburg <mal@lemburg.com>
9
10Unicode Objects
11^^^^^^^^^^^^^^^
12
13
Victor Stinner5f8aae02010-05-14 15:53:20 +000014Unicode Type
15""""""""""""
16
Georg Brandlf6842722008-01-19 22:08:21 +000017These are the basic Unicode object types used for the Unicode implementation in
18Python:
19
Georg Brandlf6842722008-01-19 22:08:21 +000020
21.. ctype:: Py_UNICODE
22
23 This type represents the storage type which is used by Python internally as
24 basis for holding Unicode ordinals. Python's default builds use a 16-bit type
25 for :ctype:`Py_UNICODE` and store Unicode values internally as UCS2. It is also
26 possible to build a UCS4 version of Python (most recent Linux distributions come
27 with UCS4 builds of Python). These builds then use a 32-bit type for
28 :ctype:`Py_UNICODE` and store Unicode data internally as UCS4. On platforms
29 where :ctype:`wchar_t` is available and compatible with the chosen Python
30 Unicode build variant, :ctype:`Py_UNICODE` is a typedef alias for
31 :ctype:`wchar_t` to enhance native platform compatibility. On all other
32 platforms, :ctype:`Py_UNICODE` is a typedef alias for either :ctype:`unsigned
33 short` (UCS2) or :ctype:`unsigned long` (UCS4).
34
35Note that UCS2 and UCS4 Python builds are not binary compatible. Please keep
36this in mind when writing extensions or interfaces.
37
38
39.. ctype:: PyUnicodeObject
40
41 This subtype of :ctype:`PyObject` represents a Python Unicode object.
42
43
44.. cvar:: PyTypeObject PyUnicode_Type
45
46 This instance of :ctype:`PyTypeObject` represents the Python Unicode type. It
47 is exposed to Python code as ``unicode`` and ``types.UnicodeType``.
48
49The following APIs are really C macros and can be used to do fast checks and to
50access internal read-only data of Unicode objects:
51
52
53.. cfunction:: int PyUnicode_Check(PyObject *o)
54
55 Return true if the object *o* is a Unicode object or an instance of a Unicode
56 subtype.
57
58 .. versionchanged:: 2.2
59 Allowed subtypes to be accepted.
60
61
62.. cfunction:: int PyUnicode_CheckExact(PyObject *o)
63
64 Return true if the object *o* is a Unicode object, but not an instance of a
65 subtype.
66
67 .. versionadded:: 2.2
68
69
70.. cfunction:: Py_ssize_t PyUnicode_GET_SIZE(PyObject *o)
71
72 Return the size of the object. *o* has to be a :ctype:`PyUnicodeObject` (not
73 checked).
74
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +000075 .. versionchanged:: 2.5
76 This function returned an :ctype:`int` type. This might require changes
77 in your code for properly supporting 64-bit systems.
78
Georg Brandlf6842722008-01-19 22:08:21 +000079
80.. cfunction:: Py_ssize_t PyUnicode_GET_DATA_SIZE(PyObject *o)
81
82 Return the size of the object's internal buffer in bytes. *o* has to be a
83 :ctype:`PyUnicodeObject` (not checked).
84
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +000085 .. versionchanged:: 2.5
86 This function returned an :ctype:`int` type. This might require changes
87 in your code for properly supporting 64-bit systems.
88
Georg Brandlf6842722008-01-19 22:08:21 +000089
90.. cfunction:: Py_UNICODE* PyUnicode_AS_UNICODE(PyObject *o)
91
92 Return a pointer to the internal :ctype:`Py_UNICODE` buffer of the object. *o*
93 has to be a :ctype:`PyUnicodeObject` (not checked).
94
95
96.. cfunction:: const char* PyUnicode_AS_DATA(PyObject *o)
97
98 Return a pointer to the internal buffer of the object. *o* has to be a
99 :ctype:`PyUnicodeObject` (not checked).
100
Christian Heimes3b718a72008-02-14 12:47:33 +0000101
Georg Brandl36b30b52009-07-24 16:46:38 +0000102.. cfunction:: int PyUnicode_ClearFreeList()
Christian Heimes3b718a72008-02-14 12:47:33 +0000103
104 Clear the free list. Return the total number of freed items.
105
106 .. versionadded:: 2.6
107
Georg Brandl36b30b52009-07-24 16:46:38 +0000108
Victor Stinner5f8aae02010-05-14 15:53:20 +0000109Unicode Character Properties
110""""""""""""""""""""""""""""
111
Georg Brandlf6842722008-01-19 22:08:21 +0000112Unicode provides many different character properties. The most often needed ones
113are available through these macros which are mapped to C functions depending on
114the Python configuration.
115
Georg Brandlf6842722008-01-19 22:08:21 +0000116
117.. cfunction:: int Py_UNICODE_ISSPACE(Py_UNICODE ch)
118
119 Return 1 or 0 depending on whether *ch* is a whitespace character.
120
121
122.. cfunction:: int Py_UNICODE_ISLOWER(Py_UNICODE ch)
123
124 Return 1 or 0 depending on whether *ch* is a lowercase character.
125
126
127.. cfunction:: int Py_UNICODE_ISUPPER(Py_UNICODE ch)
128
129 Return 1 or 0 depending on whether *ch* is an uppercase character.
130
131
132.. cfunction:: int Py_UNICODE_ISTITLE(Py_UNICODE ch)
133
134 Return 1 or 0 depending on whether *ch* is a titlecase character.
135
136
137.. cfunction:: int Py_UNICODE_ISLINEBREAK(Py_UNICODE ch)
138
139 Return 1 or 0 depending on whether *ch* is a linebreak character.
140
141
142.. cfunction:: int Py_UNICODE_ISDECIMAL(Py_UNICODE ch)
143
144 Return 1 or 0 depending on whether *ch* is a decimal character.
145
146
147.. cfunction:: int Py_UNICODE_ISDIGIT(Py_UNICODE ch)
148
149 Return 1 or 0 depending on whether *ch* is a digit character.
150
151
152.. cfunction:: int Py_UNICODE_ISNUMERIC(Py_UNICODE ch)
153
154 Return 1 or 0 depending on whether *ch* is a numeric character.
155
156
157.. cfunction:: int Py_UNICODE_ISALPHA(Py_UNICODE ch)
158
159 Return 1 or 0 depending on whether *ch* is an alphabetic character.
160
161
162.. cfunction:: int Py_UNICODE_ISALNUM(Py_UNICODE ch)
163
164 Return 1 or 0 depending on whether *ch* is an alphanumeric character.
165
166These APIs can be used for fast direct character conversions:
167
168
169.. cfunction:: Py_UNICODE Py_UNICODE_TOLOWER(Py_UNICODE ch)
170
171 Return the character *ch* converted to lower case.
172
173
174.. cfunction:: Py_UNICODE Py_UNICODE_TOUPPER(Py_UNICODE ch)
175
176 Return the character *ch* converted to upper case.
177
178
179.. cfunction:: Py_UNICODE Py_UNICODE_TOTITLE(Py_UNICODE ch)
180
181 Return the character *ch* converted to title case.
182
183
184.. cfunction:: int Py_UNICODE_TODECIMAL(Py_UNICODE ch)
185
186 Return the character *ch* converted to a decimal positive integer. Return
187 ``-1`` if this is not possible. This macro does not raise exceptions.
188
189
190.. cfunction:: int Py_UNICODE_TODIGIT(Py_UNICODE ch)
191
192 Return the character *ch* converted to a single digit integer. Return ``-1`` if
193 this is not possible. This macro does not raise exceptions.
194
195
196.. cfunction:: double Py_UNICODE_TONUMERIC(Py_UNICODE ch)
197
198 Return the character *ch* converted to a double. Return ``-1.0`` if this is not
199 possible. This macro does not raise exceptions.
200
Victor Stinner5f8aae02010-05-14 15:53:20 +0000201
202Plain Py_UNICODE
203""""""""""""""""
204
Georg Brandlf6842722008-01-19 22:08:21 +0000205To create Unicode objects and access their basic sequence properties, use these
206APIs:
207
Georg Brandlf6842722008-01-19 22:08:21 +0000208
209.. cfunction:: PyObject* PyUnicode_FromUnicode(const Py_UNICODE *u, Py_ssize_t size)
210
211 Create a Unicode Object from the Py_UNICODE buffer *u* of the given size. *u*
212 may be *NULL* which causes the contents to be undefined. It is the user's
213 responsibility to fill in the needed data. The buffer is copied into the new
214 object. If the buffer is not *NULL*, the return value might be a shared object.
215 Therefore, modification of the resulting Unicode object is only allowed when *u*
216 is *NULL*.
217
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +0000218 .. versionchanged:: 2.5
219 This function used an :ctype:`int` type for *size*. This might require
220 changes in your code for properly supporting 64-bit systems.
221
Georg Brandlf6842722008-01-19 22:08:21 +0000222
223.. cfunction:: Py_UNICODE* PyUnicode_AsUnicode(PyObject *unicode)
224
225 Return a read-only pointer to the Unicode object's internal :ctype:`Py_UNICODE`
226 buffer, *NULL* if *unicode* is not a Unicode object.
227
228
229.. cfunction:: Py_ssize_t PyUnicode_GetSize(PyObject *unicode)
230
231 Return the length of the Unicode object.
232
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +0000233 .. versionchanged:: 2.5
234 This function returned an :ctype:`int` type. This might require changes
235 in your code for properly supporting 64-bit systems.
236
Georg Brandlf6842722008-01-19 22:08:21 +0000237
238.. cfunction:: PyObject* PyUnicode_FromEncodedObject(PyObject *obj, const char *encoding, const char *errors)
239
240 Coerce an encoded object *obj* to an Unicode object and return a reference with
241 incremented refcount.
242
243 String and other char buffer compatible objects are decoded according to the
244 given encoding and using the error handling defined by errors. Both can be
245 *NULL* to have the interface use the default values (see the next section for
246 details).
247
248 All other objects, including Unicode objects, cause a :exc:`TypeError` to be
249 set.
250
251 The API returns *NULL* if there was an error. The caller is responsible for
252 decref'ing the returned objects.
253
254
255.. cfunction:: PyObject* PyUnicode_FromObject(PyObject *obj)
256
257 Shortcut for ``PyUnicode_FromEncodedObject(obj, NULL, "strict")`` which is used
258 throughout the interpreter whenever coercion to Unicode is needed.
259
260If the platform supports :ctype:`wchar_t` and provides a header file wchar.h,
261Python can interface directly to this type using the following functions.
262Support is optimized if Python's own :ctype:`Py_UNICODE` type is identical to
263the system's :ctype:`wchar_t`.
264
Georg Brandlf6842722008-01-19 22:08:21 +0000265
Victor Stinner5f8aae02010-05-14 15:53:20 +0000266wchar_t Support
267"""""""""""""""
268
269wchar_t support for platforms which support it:
Georg Brandlf6842722008-01-19 22:08:21 +0000270
271.. cfunction:: PyObject* PyUnicode_FromWideChar(const wchar_t *w, Py_ssize_t size)
272
273 Create a Unicode object from the :ctype:`wchar_t` buffer *w* of the given size.
274 Return *NULL* on failure.
275
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +0000276 .. versionchanged:: 2.5
277 This function used an :ctype:`int` type for *size*. This might require
278 changes in your code for properly supporting 64-bit systems.
279
Georg Brandlf6842722008-01-19 22:08:21 +0000280
281.. cfunction:: Py_ssize_t PyUnicode_AsWideChar(PyUnicodeObject *unicode, wchar_t *w, Py_ssize_t size)
282
283 Copy the Unicode object contents into the :ctype:`wchar_t` buffer *w*. At most
284 *size* :ctype:`wchar_t` characters are copied (excluding a possibly trailing
285 0-termination character). Return the number of :ctype:`wchar_t` characters
286 copied or -1 in case of an error. Note that the resulting :ctype:`wchar_t`
287 string may or may not be 0-terminated. It is the responsibility of the caller
288 to make sure that the :ctype:`wchar_t` string is 0-terminated in case this is
289 required by the application.
290
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +0000291 .. versionchanged:: 2.5
292 This function returned an :ctype:`int` type and used an :ctype:`int`
293 type for *size*. This might require changes in your code for properly
294 supporting 64-bit systems.
295
Georg Brandlf6842722008-01-19 22:08:21 +0000296
297.. _builtincodecs:
298
299Built-in Codecs
300^^^^^^^^^^^^^^^
301
Georg Brandld7d4fd72009-07-26 14:37:28 +0000302Python provides a set of built-in codecs which are written in C for speed. All of
Georg Brandlf6842722008-01-19 22:08:21 +0000303these codecs are directly usable via the following functions.
304
305Many of the following APIs take two arguments encoding and errors. These
306parameters encoding and errors have the same semantics as the ones of the
Georg Brandld7d4fd72009-07-26 14:37:28 +0000307built-in :func:`unicode` Unicode object constructor.
Georg Brandlf6842722008-01-19 22:08:21 +0000308
309Setting encoding to *NULL* causes the default encoding to be used which is
310ASCII. The file system calls should use :cdata:`Py_FileSystemDefaultEncoding`
311as the encoding for file names. This variable should be treated as read-only: On
312some systems, it will be a pointer to a static string, on others, it will change
313at run-time (such as when the application invokes setlocale).
314
315Error handling is set by errors which may also be set to *NULL* meaning to use
316the default handling defined for the codec. Default error handling for all
Georg Brandld7d4fd72009-07-26 14:37:28 +0000317built-in codecs is "strict" (:exc:`ValueError` is raised).
Georg Brandlf6842722008-01-19 22:08:21 +0000318
319The codecs all use a similar interface. Only deviation from the following
320generic ones are documented for simplicity.
321
Georg Brandlf6842722008-01-19 22:08:21 +0000322
Victor Stinner5f8aae02010-05-14 15:53:20 +0000323Generic Codecs
324""""""""""""""
325
326These are the generic codec APIs:
Georg Brandlf6842722008-01-19 22:08:21 +0000327
328
329.. cfunction:: PyObject* PyUnicode_Decode(const char *s, Py_ssize_t size, const char *encoding, const char *errors)
330
331 Create a Unicode object by decoding *size* bytes of the encoded string *s*.
332 *encoding* and *errors* have the same meaning as the parameters of the same name
Georg Brandld7d4fd72009-07-26 14:37:28 +0000333 in the :func:`unicode` built-in function. The codec to be used is looked up
Georg Brandlf6842722008-01-19 22:08:21 +0000334 using the Python codec registry. Return *NULL* if an exception was raised by
335 the codec.
336
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +0000337 .. versionchanged:: 2.5
338 This function used an :ctype:`int` type for *size*. This might require
339 changes in your code for properly supporting 64-bit systems.
340
Georg Brandlf6842722008-01-19 22:08:21 +0000341
342.. cfunction:: PyObject* PyUnicode_Encode(const Py_UNICODE *s, Py_ssize_t size, const char *encoding, const char *errors)
343
344 Encode the :ctype:`Py_UNICODE` buffer of the given size and return a Python
345 string object. *encoding* and *errors* have the same meaning as the parameters
346 of the same name in the Unicode :meth:`encode` method. The codec to be used is
347 looked up using the Python codec registry. Return *NULL* if an exception was
348 raised by the codec.
349
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +0000350 .. versionchanged:: 2.5
351 This function used an :ctype:`int` type for *size*. This might require
352 changes in your code for properly supporting 64-bit systems.
353
Georg Brandlf6842722008-01-19 22:08:21 +0000354
355.. cfunction:: PyObject* PyUnicode_AsEncodedString(PyObject *unicode, const char *encoding, const char *errors)
356
357 Encode a Unicode object and return the result as Python string object.
358 *encoding* and *errors* have the same meaning as the parameters of the same name
359 in the Unicode :meth:`encode` method. The codec to be used is looked up using
360 the Python codec registry. Return *NULL* if an exception was raised by the
361 codec.
362
Georg Brandlf6842722008-01-19 22:08:21 +0000363
Victor Stinner5f8aae02010-05-14 15:53:20 +0000364UTF-8 Codecs
365""""""""""""
366
367These are the UTF-8 codec APIs:
Georg Brandlf6842722008-01-19 22:08:21 +0000368
369
370.. cfunction:: PyObject* PyUnicode_DecodeUTF8(const char *s, Py_ssize_t size, const char *errors)
371
372 Create a Unicode object by decoding *size* bytes of the UTF-8 encoded string
373 *s*. Return *NULL* if an exception was raised by the codec.
374
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +0000375 .. versionchanged:: 2.5
376 This function used an :ctype:`int` type for *size*. This might require
377 changes in your code for properly supporting 64-bit systems.
378
Georg Brandlf6842722008-01-19 22:08:21 +0000379
380.. cfunction:: PyObject* PyUnicode_DecodeUTF8Stateful(const char *s, Py_ssize_t size, const char *errors, Py_ssize_t *consumed)
381
382 If *consumed* is *NULL*, behave like :cfunc:`PyUnicode_DecodeUTF8`. If
383 *consumed* is not *NULL*, trailing incomplete UTF-8 byte sequences will not be
384 treated as an error. Those bytes will not be decoded and the number of bytes
385 that have been decoded will be stored in *consumed*.
386
387 .. versionadded:: 2.4
388
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +0000389 .. versionchanged:: 2.5
390 This function used an :ctype:`int` type for *size*. This might require
391 changes in your code for properly supporting 64-bit systems.
392
Georg Brandlf6842722008-01-19 22:08:21 +0000393
394.. cfunction:: PyObject* PyUnicode_EncodeUTF8(const Py_UNICODE *s, Py_ssize_t size, const char *errors)
395
396 Encode the :ctype:`Py_UNICODE` buffer of the given size using UTF-8 and return a
397 Python string object. Return *NULL* if an exception was raised by the codec.
398
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +0000399 .. versionchanged:: 2.5
400 This function used an :ctype:`int` type for *size*. This might require
401 changes in your code for properly supporting 64-bit systems.
402
Georg Brandlf6842722008-01-19 22:08:21 +0000403
404.. cfunction:: PyObject* PyUnicode_AsUTF8String(PyObject *unicode)
405
406 Encode a Unicode object using UTF-8 and return the result as Python string
407 object. Error handling is "strict". Return *NULL* if an exception was raised
408 by the codec.
409
Georg Brandlf6842722008-01-19 22:08:21 +0000410
Victor Stinner5f8aae02010-05-14 15:53:20 +0000411UTF-32 Codecs
412"""""""""""""
413
414These are the UTF-32 codec APIs:
Georg Brandlf6842722008-01-19 22:08:21 +0000415
416
417.. cfunction:: PyObject* PyUnicode_DecodeUTF32(const char *s, Py_ssize_t size, const char *errors, int *byteorder)
418
419 Decode *length* bytes from a UTF-32 encoded buffer string and return the
420 corresponding Unicode object. *errors* (if non-*NULL*) defines the error
421 handling. It defaults to "strict".
422
423 If *byteorder* is non-*NULL*, the decoder starts decoding using the given byte
424 order::
425
426 *byteorder == -1: little endian
427 *byteorder == 0: native order
428 *byteorder == 1: big endian
429
Georg Brandl579a3582009-09-18 21:35:59 +0000430 If ``*byteorder`` is zero, and the first four bytes of the input data are a
431 byte order mark (BOM), the decoder switches to this byte order and the BOM is
432 not copied into the resulting Unicode string. If ``*byteorder`` is ``-1`` or
433 ``1``, any byte order mark is copied to the output.
434
435 After completion, *\*byteorder* is set to the current byte order at the end
436 of input data.
Georg Brandlf6842722008-01-19 22:08:21 +0000437
438 In a narrow build codepoints outside the BMP will be decoded as surrogate pairs.
439
440 If *byteorder* is *NULL*, the codec starts in native order mode.
441
442 Return *NULL* if an exception was raised by the codec.
443
444 .. versionadded:: 2.6
445
446
447.. cfunction:: PyObject* PyUnicode_DecodeUTF32Stateful(const char *s, Py_ssize_t size, const char *errors, int *byteorder, Py_ssize_t *consumed)
448
449 If *consumed* is *NULL*, behave like :cfunc:`PyUnicode_DecodeUTF32`. If
450 *consumed* is not *NULL*, :cfunc:`PyUnicode_DecodeUTF32Stateful` will not treat
451 trailing incomplete UTF-32 byte sequences (such as a number of bytes not divisible
452 by four) as an error. Those bytes will not be decoded and the number of bytes
453 that have been decoded will be stored in *consumed*.
454
455 .. versionadded:: 2.6
456
457
458.. cfunction:: PyObject* PyUnicode_EncodeUTF32(const Py_UNICODE *s, Py_ssize_t size, const char *errors, int byteorder)
459
460 Return a Python bytes object holding the UTF-32 encoded value of the Unicode
Georg Brandl579a3582009-09-18 21:35:59 +0000461 data in *s*. Output is written according to the following byte order::
Georg Brandlf6842722008-01-19 22:08:21 +0000462
463 byteorder == -1: little endian
464 byteorder == 0: native byte order (writes a BOM mark)
465 byteorder == 1: big endian
466
467 If byteorder is ``0``, the output string will always start with the Unicode BOM
468 mark (U+FEFF). In the other two modes, no BOM mark is prepended.
469
470 If *Py_UNICODE_WIDE* is not defined, surrogate pairs will be output
471 as a single codepoint.
472
473 Return *NULL* if an exception was raised by the codec.
474
475 .. versionadded:: 2.6
476
477
478.. cfunction:: PyObject* PyUnicode_AsUTF32String(PyObject *unicode)
479
480 Return a Python string using the UTF-32 encoding in native byte order. The
481 string always starts with a BOM mark. Error handling is "strict". Return
482 *NULL* if an exception was raised by the codec.
483
484 .. versionadded:: 2.6
485
486
Victor Stinner5f8aae02010-05-14 15:53:20 +0000487UTF-16 Codecs
488"""""""""""""
Georg Brandlf6842722008-01-19 22:08:21 +0000489
Victor Stinner5f8aae02010-05-14 15:53:20 +0000490These are the UTF-16 codec APIs:
Georg Brandlf6842722008-01-19 22:08:21 +0000491
492
493.. cfunction:: PyObject* PyUnicode_DecodeUTF16(const char *s, Py_ssize_t size, const char *errors, int *byteorder)
494
495 Decode *length* bytes from a UTF-16 encoded buffer string and return the
496 corresponding Unicode object. *errors* (if non-*NULL*) defines the error
497 handling. It defaults to "strict".
498
499 If *byteorder* is non-*NULL*, the decoder starts decoding using the given byte
500 order::
501
502 *byteorder == -1: little endian
503 *byteorder == 0: native order
504 *byteorder == 1: big endian
505
Georg Brandl579a3582009-09-18 21:35:59 +0000506 If ``*byteorder`` is zero, and the first two bytes of the input data are a
507 byte order mark (BOM), the decoder switches to this byte order and the BOM is
508 not copied into the resulting Unicode string. If ``*byteorder`` is ``-1`` or
509 ``1``, any byte order mark is copied to the output (where it will result in
510 either a ``\ufeff`` or a ``\ufffe`` character).
511
512 After completion, *\*byteorder* is set to the current byte order at the end
513 of input data.
Georg Brandlf6842722008-01-19 22:08:21 +0000514
515 If *byteorder* is *NULL*, the codec starts in native order mode.
516
517 Return *NULL* if an exception was raised by the codec.
518
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +0000519 .. versionchanged:: 2.5
520 This function used an :ctype:`int` type for *size*. This might require
521 changes in your code for properly supporting 64-bit systems.
522
Georg Brandlf6842722008-01-19 22:08:21 +0000523
524.. cfunction:: PyObject* PyUnicode_DecodeUTF16Stateful(const char *s, Py_ssize_t size, const char *errors, int *byteorder, Py_ssize_t *consumed)
525
526 If *consumed* is *NULL*, behave like :cfunc:`PyUnicode_DecodeUTF16`. If
527 *consumed* is not *NULL*, :cfunc:`PyUnicode_DecodeUTF16Stateful` will not treat
528 trailing incomplete UTF-16 byte sequences (such as an odd number of bytes or a
529 split surrogate pair) as an error. Those bytes will not be decoded and the
530 number of bytes that have been decoded will be stored in *consumed*.
531
532 .. versionadded:: 2.4
533
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +0000534 .. versionchanged:: 2.5
535 This function used an :ctype:`int` type for *size* and an :ctype:`int *`
536 type for *consumed*. This might require changes in your code for
537 properly supporting 64-bit systems.
538
Georg Brandlf6842722008-01-19 22:08:21 +0000539
540.. cfunction:: PyObject* PyUnicode_EncodeUTF16(const Py_UNICODE *s, Py_ssize_t size, const char *errors, int byteorder)
541
542 Return a Python string object holding the UTF-16 encoded value of the Unicode
Georg Brandl579a3582009-09-18 21:35:59 +0000543 data in *s*. Output is written according to the following byte order::
Georg Brandlf6842722008-01-19 22:08:21 +0000544
545 byteorder == -1: little endian
546 byteorder == 0: native byte order (writes a BOM mark)
547 byteorder == 1: big endian
548
549 If byteorder is ``0``, the output string will always start with the Unicode BOM
550 mark (U+FEFF). In the other two modes, no BOM mark is prepended.
551
552 If *Py_UNICODE_WIDE* is defined, a single :ctype:`Py_UNICODE` value may get
553 represented as a surrogate pair. If it is not defined, each :ctype:`Py_UNICODE`
554 values is interpreted as an UCS-2 character.
555
556 Return *NULL* if an exception was raised by the codec.
557
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +0000558 .. versionchanged:: 2.5
559 This function used an :ctype:`int` type for *size*. This might require
560 changes in your code for properly supporting 64-bit systems.
561
Georg Brandlf6842722008-01-19 22:08:21 +0000562
563.. cfunction:: PyObject* PyUnicode_AsUTF16String(PyObject *unicode)
564
565 Return a Python string using the UTF-16 encoding in native byte order. The
566 string always starts with a BOM mark. Error handling is "strict". Return
567 *NULL* if an exception was raised by the codec.
568
Georg Brandlf6842722008-01-19 22:08:21 +0000569
Victor Stinner5f8aae02010-05-14 15:53:20 +0000570Unicode-Escape Codecs
571"""""""""""""""""""""
572
573These are the "Unicode Escape" codec APIs:
Georg Brandlf6842722008-01-19 22:08:21 +0000574
575
576.. cfunction:: PyObject* PyUnicode_DecodeUnicodeEscape(const char *s, Py_ssize_t size, const char *errors)
577
578 Create a Unicode object by decoding *size* bytes of the Unicode-Escape encoded
579 string *s*. Return *NULL* if an exception was raised by the codec.
580
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +0000581 .. versionchanged:: 2.5
582 This function used an :ctype:`int` type for *size*. This might require
583 changes in your code for properly supporting 64-bit systems.
584
Georg Brandlf6842722008-01-19 22:08:21 +0000585
586.. cfunction:: PyObject* PyUnicode_EncodeUnicodeEscape(const Py_UNICODE *s, Py_ssize_t size)
587
588 Encode the :ctype:`Py_UNICODE` buffer of the given size using Unicode-Escape and
589 return a Python string object. Return *NULL* if an exception was raised by the
590 codec.
591
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +0000592 .. versionchanged:: 2.5
593 This function used an :ctype:`int` type for *size*. This might require
594 changes in your code for properly supporting 64-bit systems.
595
Georg Brandlf6842722008-01-19 22:08:21 +0000596
597.. cfunction:: PyObject* PyUnicode_AsUnicodeEscapeString(PyObject *unicode)
598
599 Encode a Unicode object using Unicode-Escape and return the result as Python
600 string object. Error handling is "strict". Return *NULL* if an exception was
601 raised by the codec.
602
Georg Brandlf6842722008-01-19 22:08:21 +0000603
Victor Stinner5f8aae02010-05-14 15:53:20 +0000604Raw-Unicode-Escape Codecs
605"""""""""""""""""""""""""
606
607These are the "Raw Unicode Escape" codec APIs:
Georg Brandlf6842722008-01-19 22:08:21 +0000608
609
610.. cfunction:: PyObject* PyUnicode_DecodeRawUnicodeEscape(const char *s, Py_ssize_t size, const char *errors)
611
612 Create a Unicode object by decoding *size* bytes of the Raw-Unicode-Escape
613 encoded string *s*. Return *NULL* if an exception was raised by the codec.
614
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +0000615 .. versionchanged:: 2.5
616 This function used an :ctype:`int` type for *size*. This might require
617 changes in your code for properly supporting 64-bit systems.
618
Georg Brandlf6842722008-01-19 22:08:21 +0000619
620.. cfunction:: PyObject* PyUnicode_EncodeRawUnicodeEscape(const Py_UNICODE *s, Py_ssize_t size, const char *errors)
621
622 Encode the :ctype:`Py_UNICODE` buffer of the given size using Raw-Unicode-Escape
623 and return a Python string object. Return *NULL* if an exception was raised by
624 the codec.
625
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +0000626 .. versionchanged:: 2.5
627 This function used an :ctype:`int` type for *size*. This might require
628 changes in your code for properly supporting 64-bit systems.
629
Georg Brandlf6842722008-01-19 22:08:21 +0000630
631.. cfunction:: PyObject* PyUnicode_AsRawUnicodeEscapeString(PyObject *unicode)
632
633 Encode a Unicode object using Raw-Unicode-Escape and return the result as
634 Python string object. Error handling is "strict". Return *NULL* if an exception
635 was raised by the codec.
636
Victor Stinner5f8aae02010-05-14 15:53:20 +0000637
638Latin-1 Codecs
639""""""""""""""
640
Georg Brandlf6842722008-01-19 22:08:21 +0000641These are the Latin-1 codec APIs: Latin-1 corresponds to the first 256 Unicode
642ordinals and only these are accepted by the codecs during encoding.
643
Georg Brandlf6842722008-01-19 22:08:21 +0000644
645.. cfunction:: PyObject* PyUnicode_DecodeLatin1(const char *s, Py_ssize_t size, const char *errors)
646
647 Create a Unicode object by decoding *size* bytes of the Latin-1 encoded string
648 *s*. Return *NULL* if an exception was raised by the codec.
649
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +0000650 .. versionchanged:: 2.5
651 This function used an :ctype:`int` type for *size*. This might require
652 changes in your code for properly supporting 64-bit systems.
653
Georg Brandlf6842722008-01-19 22:08:21 +0000654
655.. cfunction:: PyObject* PyUnicode_EncodeLatin1(const Py_UNICODE *s, Py_ssize_t size, const char *errors)
656
657 Encode the :ctype:`Py_UNICODE` buffer of the given size using Latin-1 and return
658 a Python string object. Return *NULL* if an exception was raised by the codec.
659
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +0000660 .. versionchanged:: 2.5
661 This function used an :ctype:`int` type for *size*. This might require
662 changes in your code for properly supporting 64-bit systems.
663
Georg Brandlf6842722008-01-19 22:08:21 +0000664
665.. cfunction:: PyObject* PyUnicode_AsLatin1String(PyObject *unicode)
666
667 Encode a Unicode object using Latin-1 and return the result as Python string
668 object. Error handling is "strict". Return *NULL* if an exception was raised
669 by the codec.
670
Victor Stinner5f8aae02010-05-14 15:53:20 +0000671
672ASCII Codecs
673""""""""""""
674
Georg Brandlf6842722008-01-19 22:08:21 +0000675These are the ASCII codec APIs. Only 7-bit ASCII data is accepted. All other
676codes generate errors.
677
Georg Brandlf6842722008-01-19 22:08:21 +0000678
679.. cfunction:: PyObject* PyUnicode_DecodeASCII(const char *s, Py_ssize_t size, const char *errors)
680
681 Create a Unicode object by decoding *size* bytes of the ASCII encoded string
682 *s*. Return *NULL* if an exception was raised by the codec.
683
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +0000684 .. versionchanged:: 2.5
685 This function used an :ctype:`int` type for *size*. This might require
686 changes in your code for properly supporting 64-bit systems.
687
Georg Brandlf6842722008-01-19 22:08:21 +0000688
689.. cfunction:: PyObject* PyUnicode_EncodeASCII(const Py_UNICODE *s, Py_ssize_t size, const char *errors)
690
691 Encode the :ctype:`Py_UNICODE` buffer of the given size using ASCII and return a
692 Python string object. Return *NULL* if an exception was raised by the codec.
693
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +0000694 .. versionchanged:: 2.5
695 This function used an :ctype:`int` type for *size*. This might require
696 changes in your code for properly supporting 64-bit systems.
697
Georg Brandlf6842722008-01-19 22:08:21 +0000698
699.. cfunction:: PyObject* PyUnicode_AsASCIIString(PyObject *unicode)
700
701 Encode a Unicode object using ASCII and return the result as Python string
702 object. Error handling is "strict". Return *NULL* if an exception was raised
703 by the codec.
704
Georg Brandlf6842722008-01-19 22:08:21 +0000705
Victor Stinner5f8aae02010-05-14 15:53:20 +0000706Character Map Codecs
707""""""""""""""""""""
708
709These are the mapping codec APIs:
Georg Brandlf6842722008-01-19 22:08:21 +0000710
711This codec is special in that it can be used to implement many different codecs
712(and this is in fact what was done to obtain most of the standard codecs
713included in the :mod:`encodings` package). The codec uses mapping to encode and
714decode characters.
715
716Decoding mappings must map single string characters to single Unicode
717characters, integers (which are then interpreted as Unicode ordinals) or None
718(meaning "undefined mapping" and causing an error).
719
720Encoding mappings must map single Unicode characters to single string
721characters, integers (which are then interpreted as Latin-1 ordinals) or None
722(meaning "undefined mapping" and causing an error).
723
724The mapping objects provided must only support the __getitem__ mapping
725interface.
726
727If a character lookup fails with a LookupError, the character is copied as-is
728meaning that its ordinal value will be interpreted as Unicode or Latin-1 ordinal
729resp. Because of this, mappings only need to contain those mappings which map
730characters to different code points.
731
732
733.. cfunction:: PyObject* PyUnicode_DecodeCharmap(const char *s, Py_ssize_t size, PyObject *mapping, const char *errors)
734
735 Create a Unicode object by decoding *size* bytes of the encoded string *s* using
736 the given *mapping* object. Return *NULL* if an exception was raised by the
737 codec. If *mapping* is *NULL* latin-1 decoding will be done. Else it can be a
738 dictionary mapping byte or a unicode string, which is treated as a lookup table.
739 Byte values greater that the length of the string and U+FFFE "characters" are
740 treated as "undefined mapping".
741
742 .. versionchanged:: 2.4
743 Allowed unicode string as mapping argument.
744
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +0000745 .. versionchanged:: 2.5
746 This function used an :ctype:`int` type for *size*. This might require
747 changes in your code for properly supporting 64-bit systems.
748
Georg Brandlf6842722008-01-19 22:08:21 +0000749
750.. cfunction:: PyObject* PyUnicode_EncodeCharmap(const Py_UNICODE *s, Py_ssize_t size, PyObject *mapping, const char *errors)
751
752 Encode the :ctype:`Py_UNICODE` buffer of the given size using the given
753 *mapping* object and return a Python string object. Return *NULL* if an
754 exception was raised by the codec.
755
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +0000756 .. versionchanged:: 2.5
757 This function used an :ctype:`int` type for *size*. This might require
758 changes in your code for properly supporting 64-bit systems.
759
Georg Brandlf6842722008-01-19 22:08:21 +0000760
761.. cfunction:: PyObject* PyUnicode_AsCharmapString(PyObject *unicode, PyObject *mapping)
762
763 Encode a Unicode object using the given *mapping* object and return the result
764 as Python string object. Error handling is "strict". Return *NULL* if an
765 exception was raised by the codec.
766
767The following codec API is special in that maps Unicode to Unicode.
768
769
770.. cfunction:: PyObject* PyUnicode_TranslateCharmap(const Py_UNICODE *s, Py_ssize_t size, PyObject *table, const char *errors)
771
772 Translate a :ctype:`Py_UNICODE` buffer of the given length by applying a
773 character mapping *table* to it and return the resulting Unicode object. Return
774 *NULL* when an exception was raised by the codec.
775
776 The *mapping* table must map Unicode ordinal integers to Unicode ordinal
777 integers or None (causing deletion of the character).
778
779 Mapping tables need only provide the :meth:`__getitem__` interface; dictionaries
780 and sequences work well. Unmapped character ordinals (ones which cause a
781 :exc:`LookupError`) are left untouched and are copied as-is.
782
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +0000783 .. versionchanged:: 2.5
784 This function used an :ctype:`int` type for *size*. This might require
785 changes in your code for properly supporting 64-bit systems.
786
Georg Brandlf6842722008-01-19 22:08:21 +0000787These are the MBCS codec APIs. They are currently only available on Windows and
788use the Win32 MBCS converters to implement the conversions. Note that MBCS (or
789DBCS) is a class of encodings, not just one. The target encoding is defined by
790the user settings on the machine running the codec.
791
Victor Stinner5f8aae02010-05-14 15:53:20 +0000792
793MBCS codecs for Windows
794"""""""""""""""""""""""
Georg Brandlf6842722008-01-19 22:08:21 +0000795
796
797.. cfunction:: PyObject* PyUnicode_DecodeMBCS(const char *s, Py_ssize_t size, const char *errors)
798
799 Create a Unicode object by decoding *size* bytes of the MBCS encoded string *s*.
800 Return *NULL* if an exception was raised by the codec.
801
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +0000802 .. versionchanged:: 2.5
803 This function used an :ctype:`int` type for *size*. This might require
804 changes in your code for properly supporting 64-bit systems.
805
Georg Brandlf6842722008-01-19 22:08:21 +0000806
807.. cfunction:: PyObject* PyUnicode_DecodeMBCSStateful(const char *s, int size, const char *errors, int *consumed)
808
809 If *consumed* is *NULL*, behave like :cfunc:`PyUnicode_DecodeMBCS`. If
810 *consumed* is not *NULL*, :cfunc:`PyUnicode_DecodeMBCSStateful` will not decode
811 trailing lead byte and the number of bytes that have been decoded will be stored
812 in *consumed*.
813
814 .. versionadded:: 2.5
815
816
817.. cfunction:: PyObject* PyUnicode_EncodeMBCS(const Py_UNICODE *s, Py_ssize_t size, const char *errors)
818
819 Encode the :ctype:`Py_UNICODE` buffer of the given size using MBCS and return a
820 Python string object. Return *NULL* if an exception was raised by the codec.
821
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +0000822 .. versionchanged:: 2.5
823 This function used an :ctype:`int` type for *size*. This might require
824 changes in your code for properly supporting 64-bit systems.
825
Georg Brandlf6842722008-01-19 22:08:21 +0000826
827.. cfunction:: PyObject* PyUnicode_AsMBCSString(PyObject *unicode)
828
829 Encode a Unicode object using MBCS and return the result as Python string
830 object. Error handling is "strict". Return *NULL* if an exception was raised
831 by the codec.
832
Georg Brandlf6842722008-01-19 22:08:21 +0000833
Victor Stinner5f8aae02010-05-14 15:53:20 +0000834Methods & Slots
835"""""""""""""""
Georg Brandlf6842722008-01-19 22:08:21 +0000836
837.. _unicodemethodsandslots:
838
839Methods and Slot Functions
840^^^^^^^^^^^^^^^^^^^^^^^^^^
841
842The following APIs are capable of handling Unicode objects and strings on input
843(we refer to them as strings in the descriptions) and return Unicode objects or
844integers as appropriate.
845
846They all return *NULL* or ``-1`` if an exception occurs.
847
848
849.. cfunction:: PyObject* PyUnicode_Concat(PyObject *left, PyObject *right)
850
851 Concat two strings giving a new Unicode string.
852
853
854.. cfunction:: PyObject* PyUnicode_Split(PyObject *s, PyObject *sep, Py_ssize_t maxsplit)
855
856 Split a string giving a list of Unicode strings. If sep is *NULL*, splitting
857 will be done at all whitespace substrings. Otherwise, splits occur at the given
858 separator. At most *maxsplit* splits will be done. If negative, no limit is
859 set. Separators are not included in the resulting list.
860
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +0000861 .. versionchanged:: 2.5
862 This function used an :ctype:`int` type for *maxsplit*. This might require
863 changes in your code for properly supporting 64-bit systems.
864
Georg Brandlf6842722008-01-19 22:08:21 +0000865
866.. cfunction:: PyObject* PyUnicode_Splitlines(PyObject *s, int keepend)
867
868 Split a Unicode string at line breaks, returning a list of Unicode strings.
869 CRLF is considered to be one line break. If *keepend* is 0, the Line break
870 characters are not included in the resulting strings.
871
872
873.. cfunction:: PyObject* PyUnicode_Translate(PyObject *str, PyObject *table, const char *errors)
874
875 Translate a string by applying a character mapping table to it and return the
876 resulting Unicode object.
877
878 The mapping table must map Unicode ordinal integers to Unicode ordinal integers
879 or None (causing deletion of the character).
880
881 Mapping tables need only provide the :meth:`__getitem__` interface; dictionaries
882 and sequences work well. Unmapped character ordinals (ones which cause a
883 :exc:`LookupError`) are left untouched and are copied as-is.
884
885 *errors* has the usual meaning for codecs. It may be *NULL* which indicates to
886 use the default error handling.
887
888
889.. cfunction:: PyObject* PyUnicode_Join(PyObject *separator, PyObject *seq)
890
891 Join a sequence of strings using the given separator and return the resulting
892 Unicode string.
893
894
895.. cfunction:: int PyUnicode_Tailmatch(PyObject *str, PyObject *substr, Py_ssize_t start, Py_ssize_t end, int direction)
896
897 Return 1 if *substr* matches *str*[*start*:*end*] at the given tail end
898 (*direction* == -1 means to do a prefix match, *direction* == 1 a suffix match),
899 0 otherwise. Return ``-1`` if an error occurred.
900
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +0000901 .. versionchanged:: 2.5
902 This function used an :ctype:`int` type for *start* and *end*. This
903 might require changes in your code for properly supporting 64-bit
904 systems.
905
Georg Brandlf6842722008-01-19 22:08:21 +0000906
907.. cfunction:: Py_ssize_t PyUnicode_Find(PyObject *str, PyObject *substr, Py_ssize_t start, Py_ssize_t end, int direction)
908
909 Return the first position of *substr* in *str*[*start*:*end*] using the given
910 *direction* (*direction* == 1 means to do a forward search, *direction* == -1 a
911 backward search). The return value is the index of the first match; a value of
912 ``-1`` indicates that no match was found, and ``-2`` indicates that an error
913 occurred and an exception has been set.
914
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +0000915 .. versionchanged:: 2.5
916 This function used an :ctype:`int` type for *start* and *end*. This
917 might require changes in your code for properly supporting 64-bit
918 systems.
919
Georg Brandlf6842722008-01-19 22:08:21 +0000920
921.. cfunction:: Py_ssize_t PyUnicode_Count(PyObject *str, PyObject *substr, Py_ssize_t start, Py_ssize_t end)
922
923 Return the number of non-overlapping occurrences of *substr* in
924 ``str[start:end]``. Return ``-1`` if an error occurred.
925
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +0000926 .. versionchanged:: 2.5
927 This function returned an :ctype:`int` type and used an :ctype:`int`
928 type for *start* and *end*. This might require changes in your code for
929 properly supporting 64-bit systems.
930
Georg Brandlf6842722008-01-19 22:08:21 +0000931
932.. cfunction:: PyObject* PyUnicode_Replace(PyObject *str, PyObject *substr, PyObject *replstr, Py_ssize_t maxcount)
933
934 Replace at most *maxcount* occurrences of *substr* in *str* with *replstr* and
935 return the resulting Unicode object. *maxcount* == -1 means replace all
936 occurrences.
937
Jeroen Ruigrok van der Wervendfcffd42009-04-25 21:16:05 +0000938 .. versionchanged:: 2.5
939 This function used an :ctype:`int` type for *maxcount*. This might
940 require changes in your code for properly supporting 64-bit systems.
941
Georg Brandlf6842722008-01-19 22:08:21 +0000942
943.. cfunction:: int PyUnicode_Compare(PyObject *left, PyObject *right)
944
945 Compare two strings and return -1, 0, 1 for less than, equal, and greater than,
946 respectively.
947
948
949.. cfunction:: int PyUnicode_RichCompare(PyObject *left, PyObject *right, int op)
950
951 Rich compare two unicode strings and return one of the following:
952
953 * ``NULL`` in case an exception was raised
954 * :const:`Py_True` or :const:`Py_False` for successful comparisons
955 * :const:`Py_NotImplemented` in case the type combination is unknown
956
957 Note that :const:`Py_EQ` and :const:`Py_NE` comparisons can cause a
958 :exc:`UnicodeWarning` in case the conversion of the arguments to Unicode fails
959 with a :exc:`UnicodeDecodeError`.
960
961 Possible values for *op* are :const:`Py_GT`, :const:`Py_GE`, :const:`Py_EQ`,
962 :const:`Py_NE`, :const:`Py_LT`, and :const:`Py_LE`.
963
964
965.. cfunction:: PyObject* PyUnicode_Format(PyObject *format, PyObject *args)
966
967 Return a new string object from *format* and *args*; this is analogous to
968 ``format % args``. The *args* argument must be a tuple.
969
970
971.. cfunction:: int PyUnicode_Contains(PyObject *container, PyObject *element)
972
973 Check whether *element* is contained in *container* and return true or false
974 accordingly.
975
976 *element* has to coerce to a one element Unicode string. ``-1`` is returned if
977 there was an error.