Blame - Doc/c-api/unicode.rst - platform/external/python/cpython3

blob: dc48158ac9e6a49fdac25ae18f9e5a1be7600cf2 [file] [log] [blame]

Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	1	.. highlightlang:: c
				2
				3	.. _unicodeobjects:
				4
				5	Unicode Objects and Codecs
				6	--------------------------
				7
				8	.. sectionauthor:: Marc-Andre Lemburg <mal@lemburg.com>
				9
				10	Unicode Objects
				11	^^^^^^^^^^^^^^^
				12
				13	These are the basic Unicode object types used for the Unicode implementation in
				14	Python:
				15
				16	.. % --- Unicode Type -------------------------------------------------------
				17
				18
				19	.. ctype:: Py_UNICODE
				20
				21	This type represents the storage type which is used by Python internally as
				22	basis for holding Unicode ordinals. Python's default builds use a 16-bit type
				23	for :ctype:`Py_UNICODE` and store Unicode values internally as UCS2. It is also
				24	possible to build a UCS4 version of Python (most recent Linux distributions come
				25	with UCS4 builds of Python). These builds then use a 32-bit type for
				26	:ctype:`Py_UNICODE` and store Unicode data internally as UCS4. On platforms
				27	where :ctype:`wchar_t` is available and compatible with the chosen Python
				28	Unicode build variant, :ctype:`Py_UNICODE` is a typedef alias for
				29	:ctype:`wchar_t` to enhance native platform compatibility. On all other
				30	platforms, :ctype:`Py_UNICODE` is a typedef alias for either :ctype:`unsigned
				31	short` (UCS2) or :ctype:`unsigned long` (UCS4).
				32
				33	Note that UCS2 and UCS4 Python builds are not binary compatible. Please keep
				34	this in mind when writing extensions or interfaces.
				35
				36
				37	.. ctype:: PyUnicodeObject
				38
				39	This subtype of :ctype:`PyObject` represents a Python Unicode object.
				40
				41
				42	.. cvar:: PyTypeObject PyUnicode_Type
				43
				44	This instance of :ctype:`PyTypeObject` represents the Python Unicode type. It
				45	is exposed to Python code as ``str``.
				46
				47	The following APIs are really C macros and can be used to do fast checks and to
				48	access internal read-only data of Unicode objects:
				49
				50
				51	.. cfunction:: int PyUnicode_Check(PyObject *o)
				52
				53	Return true if the object o is a Unicode object or an instance of a Unicode
				54	subtype.
				55
				56
				57	.. cfunction:: int PyUnicode_CheckExact(PyObject *o)
				58
				59	Return true if the object o is a Unicode object, but not an instance of a
				60	subtype.
				61
				62
				63	.. cfunction:: Py_ssize_t PyUnicode_GET_SIZE(PyObject *o)
				64
				65	Return the size of the object. o has to be a :ctype:`PyUnicodeObject` (not
				66	checked).
				67
				68
				69	.. cfunction:: Py_ssize_t PyUnicode_GET_DATA_SIZE(PyObject *o)
				70
				71	Return the size of the object's internal buffer in bytes. o has to be a
				72	:ctype:`PyUnicodeObject` (not checked).
				73
				74
				75	.. cfunction:: Py_UNICODE* PyUnicode_AS_UNICODE(PyObject *o)
				76
				77	Return a pointer to the internal :ctype:`Py_UNICODE` buffer of the object. o
				78	has to be a :ctype:`PyUnicodeObject` (not checked).
				79
				80
				81	.. cfunction:: const char* PyUnicode_AS_DATA(PyObject *o)
				82
				83	Return a pointer to the internal buffer of the object. o has to be a
				84	:ctype:`PyUnicodeObject` (not checked).
				85
Christian Heimes	a156e09	2008-02-16 07:38:31 +0000	[diff] [blame]	86
Georg Brandl	c5605df	2009-08-13 08:26:44 +0000	[diff] [blame^]	87	.. cfunction:: int PyUnicode_ClearFreeList()
Christian Heimes	a156e09	2008-02-16 07:38:31 +0000	[diff] [blame]	88
				89	Clear the free list. Return the total number of freed items.
				90
Georg Brandl	c5605df	2009-08-13 08:26:44 +0000	[diff] [blame^]	91
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	92	Unicode provides many different character properties. The most often needed ones
				93	are available through these macros which are mapped to C functions depending on
				94	the Python configuration.
				95
				96	.. % --- Unicode character properties ---------------------------------------
				97
				98
				99	.. cfunction:: int Py_UNICODE_ISSPACE(Py_UNICODE ch)
				100
				101	Return 1 or 0 depending on whether ch is a whitespace character.
				102
				103
				104	.. cfunction:: int Py_UNICODE_ISLOWER(Py_UNICODE ch)
				105
				106	Return 1 or 0 depending on whether ch is a lowercase character.
				107
				108
				109	.. cfunction:: int Py_UNICODE_ISUPPER(Py_UNICODE ch)
				110
				111	Return 1 or 0 depending on whether ch is an uppercase character.
				112
				113
				114	.. cfunction:: int Py_UNICODE_ISTITLE(Py_UNICODE ch)
				115
				116	Return 1 or 0 depending on whether ch is a titlecase character.
				117
				118
				119	.. cfunction:: int Py_UNICODE_ISLINEBREAK(Py_UNICODE ch)
				120
				121	Return 1 or 0 depending on whether ch is a linebreak character.
				122
				123
				124	.. cfunction:: int Py_UNICODE_ISDECIMAL(Py_UNICODE ch)
				125
				126	Return 1 or 0 depending on whether ch is a decimal character.
				127
				128
				129	.. cfunction:: int Py_UNICODE_ISDIGIT(Py_UNICODE ch)
				130
				131	Return 1 or 0 depending on whether ch is a digit character.
				132
				133
				134	.. cfunction:: int Py_UNICODE_ISNUMERIC(Py_UNICODE ch)
				135
				136	Return 1 or 0 depending on whether ch is a numeric character.
				137
				138
				139	.. cfunction:: int Py_UNICODE_ISALPHA(Py_UNICODE ch)
				140
				141	Return 1 or 0 depending on whether ch is an alphabetic character.
				142
				143
				144	.. cfunction:: int Py_UNICODE_ISALNUM(Py_UNICODE ch)
				145
				146	Return 1 or 0 depending on whether ch is an alphanumeric character.
				147
Georg Brandl	559e5d7	2008-06-11 18:37:52 +0000	[diff] [blame]	148
				149	.. cfunction:: int Py_UNICODE_ISPRINTABLE(Py_UNICODE ch)
				150
				151	Return 1 or 0 depending on whether ch is a printable character.
				152	Nonprintable characters are those characters defined in the Unicode character
				153	database as "Other" or "Separator", excepting the ASCII space (0x20) which is
				154	considered printable. (Note that printable characters in this context are
				155	those which should not be escaped when :func:`repr` is invoked on a string.
				156	It has no bearing on the handling of strings written to :data:`sys.stdout` or
				157	:data:`sys.stderr`.)
				158
				159
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	160	These APIs can be used for fast direct character conversions:
				161
				162
				163	.. cfunction:: Py_UNICODE Py_UNICODE_TOLOWER(Py_UNICODE ch)
				164
				165	Return the character ch converted to lower case.
				166
				167
				168	.. cfunction:: Py_UNICODE Py_UNICODE_TOUPPER(Py_UNICODE ch)
				169
				170	Return the character ch converted to upper case.
				171
				172
				173	.. cfunction:: Py_UNICODE Py_UNICODE_TOTITLE(Py_UNICODE ch)
				174
				175	Return the character ch converted to title case.
				176
				177
				178	.. cfunction:: int Py_UNICODE_TODECIMAL(Py_UNICODE ch)
				179
				180	Return the character ch converted to a decimal positive integer. Return
				181	``-1`` if this is not possible. This macro does not raise exceptions.
				182
				183
				184	.. cfunction:: int Py_UNICODE_TODIGIT(Py_UNICODE ch)
				185
				186	Return the character ch converted to a single digit integer. Return ``-1`` if
				187	this is not possible. This macro does not raise exceptions.
				188
				189
				190	.. cfunction:: double Py_UNICODE_TONUMERIC(Py_UNICODE ch)
				191
				192	Return the character ch converted to a double. Return ``-1.0`` if this is not
				193	possible. This macro does not raise exceptions.
				194
				195	To create Unicode objects and access their basic sequence properties, use these
				196	APIs:
				197
				198	.. % --- Plain Py_UNICODE ---------------------------------------------------
				199
				200
				201	.. cfunction:: PyObject* PyUnicode_FromUnicode(const Py_UNICODE *u, Py_ssize_t size)
				202
				203	Create a Unicode Object from the Py_UNICODE buffer u of the given size. u
				204	may be NULL which causes the contents to be undefined. It is the user's
				205	responsibility to fill in the needed data. The buffer is copied into the new
				206	object. If the buffer is not NULL, the return value might be a shared object.
				207	Therefore, modification of the resulting Unicode object is only allowed when u
				208	is NULL.
				209
				210
				211	.. cfunction:: PyObject* PyUnicode_FromStringAndSize(const char *u, Py_ssize_t size)
				212
				213	Create a Unicode Object from the char buffer u. The bytes will be interpreted
				214	as being UTF-8 encoded. u may also be NULL which
				215	causes the contents to be undefined. It is the user's responsibility to fill in
				216	the needed data. The buffer is copied into the new object. If the buffer is not
				217	NULL, the return value might be a shared object. Therefore, modification of
				218	the resulting Unicode object is only allowed when u is NULL.
				219
				220
				221	.. cfunction:: PyObject PyUnicode_FromString(const char u)
				222
				223	Create a Unicode object from an UTF-8 encoded null-terminated char buffer
				224	u.
				225
				226
				227	.. cfunction:: PyObject* PyUnicode_FromFormat(const char *format, ...)
				228
				229	Take a C :cfunc:`printf`\ -style format string and a variable number of
				230	arguments, calculate the size of the resulting Python unicode string and return
				231	a string with the values formatted into it. The variable arguments must be C
				232	types and must correspond exactly to the format characters in the format
				233	string. The following format characters are allowed:
				234
				235	.. % The descriptions for %zd and %zu are wrong, but the truth is complicated
				236	.. % because not all compilers support the %z width modifier -- we fake it
				237	.. % when necessary via interpolating PY_FORMAT_SIZE_T.
				238
				239	+-------------------+---------------------+--------------------------------+
				240	\| Format Characters \| Type \| Comment \|
				241	+===================+=====================+================================+
				242	\| :attr:`%%` \| n/a \| The literal % character. \|
				243	+-------------------+---------------------+--------------------------------+
				244	\| :attr:`%c` \| int \| A single character, \|
				245	\| \| \| represented as an C int. \|
				246	+-------------------+---------------------+--------------------------------+
				247	\| :attr:`%d` \| int \| Exactly equivalent to \|
				248	\| \| \| ``printf("%d")``. \|
				249	+-------------------+---------------------+--------------------------------+
				250	\| :attr:`%u` \| unsigned int \| Exactly equivalent to \|
				251	\| \| \| ``printf("%u")``. \|
				252	+-------------------+---------------------+--------------------------------+
				253	\| :attr:`%ld` \| long \| Exactly equivalent to \|
				254	\| \| \| ``printf("%ld")``. \|
				255	+-------------------+---------------------+--------------------------------+
				256	\| :attr:`%lu` \| unsigned long \| Exactly equivalent to \|
				257	\| \| \| ``printf("%lu")``. \|
				258	+-------------------+---------------------+--------------------------------+
				259	\| :attr:`%zd` \| Py_ssize_t \| Exactly equivalent to \|
				260	\| \| \| ``printf("%zd")``. \|
				261	+-------------------+---------------------+--------------------------------+
				262	\| :attr:`%zu` \| size_t \| Exactly equivalent to \|
				263	\| \| \| ``printf("%zu")``. \|
				264	+-------------------+---------------------+--------------------------------+
				265	\| :attr:`%i` \| int \| Exactly equivalent to \|
				266	\| \| \| ``printf("%i")``. \|
				267	+-------------------+---------------------+--------------------------------+
				268	\| :attr:`%x` \| int \| Exactly equivalent to \|
				269	\| \| \| ``printf("%x")``. \|
				270	+-------------------+---------------------+--------------------------------+
				271	\| :attr:`%s` \| char\* \| A null-terminated C character \|
				272	\| \| \| array. \|
				273	+-------------------+---------------------+--------------------------------+
				274	\| :attr:`%p` \| void\* \| The hex representation of a C \|
				275	\| \| \| pointer. Mostly equivalent to \|
				276	\| \| \| ``printf("%p")`` except that \|
				277	\| \| \| it is guaranteed to start with \|
				278	\| \| \| the literal ``0x`` regardless \|
				279	\| \| \| of what the platform's \|
				280	\| \| \| ``printf`` yields. \|
				281	+-------------------+---------------------+--------------------------------+
Georg Brandl	559e5d7	2008-06-11 18:37:52 +0000	[diff] [blame]	282	\| :attr:`%A` \| PyObject\* \| The result of calling \|
				283	\| \| \| :func:`ascii`. \|
				284	+-------------------+---------------------+--------------------------------+
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	285	\| :attr:`%U` \| PyObject\* \| A unicode object. \|
				286	+-------------------+---------------------+--------------------------------+
				287	\| :attr:`%V` \| PyObject\, char \ \| A unicode object (which may be \|
				288	\| \| \| NULL) and a null-terminated \|
				289	\| \| \| C character array as a second \|
				290	\| \| \| parameter (which will be used, \|
				291	\| \| \| if the first parameter is \|
				292	\| \| \| NULL). \|
				293	+-------------------+---------------------+--------------------------------+
				294	\| :attr:`%S` \| PyObject\* \| The result of calling \|
Benjamin Peterson	e866206	2009-03-08 23:51:13 +0000	[diff] [blame]	295	\| \| \| :func:`PyObject_Str`. \|
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	296	+-------------------+---------------------+--------------------------------+
				297	\| :attr:`%R` \| PyObject\* \| The result of calling \|
				298	\| \| \| :func:`PyObject_Repr`. \|
				299	+-------------------+---------------------+--------------------------------+
				300
				301	An unrecognized format character causes all the rest of the format string to be
				302	copied as-is to the result string, and any extra arguments discarded.
				303
				304
				305	.. cfunction:: PyObject* PyUnicode_FromFormatV(const char *format, va_list vargs)
				306
				307	Identical to :func:`PyUnicode_FromFormat` except that it takes exactly two
				308	arguments.
				309
				310
				311	.. cfunction:: Py_UNICODE* PyUnicode_AsUnicode(PyObject *unicode)
				312
				313	Return a read-only pointer to the Unicode object's internal :ctype:`Py_UNICODE`
				314	buffer, NULL if unicode is not a Unicode object.
				315
				316
				317	.. cfunction:: Py_ssize_t PyUnicode_GetSize(PyObject *unicode)
				318
				319	Return the length of the Unicode object.
				320
				321
				322	.. cfunction:: PyObject* PyUnicode_FromEncodedObject(PyObject obj, const char encoding, const char *errors)
				323
				324	Coerce an encoded object obj to an Unicode object and return a reference with
				325	incremented refcount.
				326
				327	String and other char buffer compatible objects are decoded according to the
				328	given encoding and using the error handling defined by errors. Both can be
				329	NULL to have the interface use the default values (see the next section for
				330	details).
				331
				332	All other objects, including Unicode objects, cause a :exc:`TypeError` to be
				333	set.
				334
				335	The API returns NULL if there was an error. The caller is responsible for
				336	decref'ing the returned objects.
				337
				338
				339	.. cfunction:: PyObject* PyUnicode_FromObject(PyObject *obj)
				340
				341	Shortcut for ``PyUnicode_FromEncodedObject(obj, NULL, "strict")`` which is used
				342	throughout the interpreter whenever coercion to Unicode is needed.
				343
				344	If the platform supports :ctype:`wchar_t` and provides a header file wchar.h,
				345	Python can interface directly to this type using the following functions.
				346	Support is optimized if Python's own :ctype:`Py_UNICODE` type is identical to
				347	the system's :ctype:`wchar_t`.
				348
				349	.. % --- wchar_t support for platforms which support it ---------------------
				350
				351
				352	.. cfunction:: PyObject* PyUnicode_FromWideChar(const wchar_t *w, Py_ssize_t size)
				353
				354	Create a Unicode object from the :ctype:`wchar_t` buffer w of the given size.
Martin v. Löwis	790465f	2008-04-05 20:41:37 +0000	[diff] [blame]	355	Passing -1 as the size indicates that the function must itself compute the length,
				356	using wcslen.
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	357	Return NULL on failure.
				358
				359
				360	.. cfunction:: Py_ssize_t PyUnicode_AsWideChar(PyUnicodeObject unicode, wchar_t w, Py_ssize_t size)
				361
				362	Copy the Unicode object contents into the :ctype:`wchar_t` buffer w. At most
				363	size :ctype:`wchar_t` characters are copied (excluding a possibly trailing
				364	0-termination character). Return the number of :ctype:`wchar_t` characters
				365	copied or -1 in case of an error. Note that the resulting :ctype:`wchar_t`
				366	string may or may not be 0-terminated. It is the responsibility of the caller
				367	to make sure that the :ctype:`wchar_t` string is 0-terminated in case this is
				368	required by the application.
				369
				370
				371	.. _builtincodecs:
				372
				373	Built-in Codecs
				374	^^^^^^^^^^^^^^^
				375
Georg Brandl	c5605df	2009-08-13 08:26:44 +0000	[diff] [blame^]	376	Python provides a set of built-in codecs which are written in C for speed. All of
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	377	these codecs are directly usable via the following functions.
				378
				379	Many of the following APIs take two arguments encoding and errors. These
				380	parameters encoding and errors have the same semantics as the ones of the
Georg Brandl	c5605df	2009-08-13 08:26:44 +0000	[diff] [blame^]	381	built-in :func:`unicode` Unicode object constructor.
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	382
Martin v. Löwis	c15bdef	2009-05-29 14:47:46 +0000	[diff] [blame]	383	Setting encoding to NULL causes the default encoding to be used
				384	which is ASCII. The file system calls should use
				385	:cfunc:`PyUnicode_FSConverter` for encoding file names. This uses the
				386	variable :cdata:`Py_FileSystemDefaultEncoding` internally. This
				387	variable should be treated as read-only: On some systems, it will be a
				388	pointer to a static string, on others, it will change at run-time
				389	(such as when the application invokes setlocale).
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	390
				391	Error handling is set by errors which may also be set to NULL meaning to use
				392	the default handling defined for the codec. Default error handling for all
Georg Brandl	c5605df	2009-08-13 08:26:44 +0000	[diff] [blame^]	393	built-in codecs is "strict" (:exc:`ValueError` is raised).
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	394
				395	The codecs all use a similar interface. Only deviation from the following
				396	generic ones are documented for simplicity.
				397
				398	These are the generic codec APIs:
				399
				400	.. % --- Generic Codecs -----------------------------------------------------
				401
				402
				403	.. cfunction:: PyObject* PyUnicode_Decode(const char s, Py_ssize_t size, const char encoding, const char *errors)
				404
				405	Create a Unicode object by decoding size bytes of the encoded string s.
				406	encoding and errors have the same meaning as the parameters of the same name
Georg Brandl	c5605df	2009-08-13 08:26:44 +0000	[diff] [blame^]	407	in the :func:`unicode` built-in function. The codec to be used is looked up
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	408	using the Python codec registry. Return NULL if an exception was raised by
				409	the codec.
				410
				411
				412	.. cfunction:: PyObject* PyUnicode_Encode(const Py_UNICODE s, Py_ssize_t size, const char encoding, const char *errors)
				413
				414	Encode the :ctype:`Py_UNICODE` buffer of the given size and return a Python
Benjamin Peterson	b6eba4f	2009-01-13 23:14:04 +0000	[diff] [blame]	415	bytes object. encoding and errors have the same meaning as the
				416	parameters of the same name in the Unicode :meth:`encode` method. The codec
				417	to be used is looked up using the Python codec registry. Return NULL if an
				418	exception was raised by the codec.
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	419
				420
				421	.. cfunction:: PyObject* PyUnicode_AsEncodedString(PyObject unicode, const char encoding, const char *errors)
				422
Benjamin Peterson	b6eba4f	2009-01-13 23:14:04 +0000	[diff] [blame]	423	Encode a Unicode object and return the result as Python bytes object.
				424	encoding and errors have the same meaning as the parameters of the same
				425	name in the Unicode :meth:`encode` method. The codec to be used is looked up
				426	using the Python codec registry. Return NULL if an exception was raised by
				427	the codec.
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	428
				429	These are the UTF-8 codec APIs:
				430
				431	.. % --- UTF-8 Codecs -------------------------------------------------------
				432
				433
				434	.. cfunction:: PyObject* PyUnicode_DecodeUTF8(const char s, Py_ssize_t size, const char errors)
				435
				436	Create a Unicode object by decoding size bytes of the UTF-8 encoded string
				437	s. Return NULL if an exception was raised by the codec.
				438
				439
				440	.. cfunction:: PyObject* PyUnicode_DecodeUTF8Stateful(const char s, Py_ssize_t size, const char errors, Py_ssize_t *consumed)
				441
				442	If consumed is NULL, behave like :cfunc:`PyUnicode_DecodeUTF8`. If
				443	consumed is not NULL, trailing incomplete UTF-8 byte sequences will not be
				444	treated as an error. Those bytes will not be decoded and the number of bytes
				445	that have been decoded will be stored in consumed.
				446
				447
				448	.. cfunction:: PyObject* PyUnicode_EncodeUTF8(const Py_UNICODE s, Py_ssize_t size, const char errors)
				449
Benjamin Peterson	b6eba4f	2009-01-13 23:14:04 +0000	[diff] [blame]	450	Encode the :ctype:`Py_UNICODE` buffer of the given size using UTF-8 and
				451	return a Python bytes object. Return NULL if an exception was raised by
				452	the codec.
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	453
				454
				455	.. cfunction:: PyObject* PyUnicode_AsUTF8String(PyObject *unicode)
				456
Benjamin Peterson	b6eba4f	2009-01-13 23:14:04 +0000	[diff] [blame]	457	Encode a Unicode object using UTF-8 and return the result as Python bytes
				458	object. Error handling is "strict". Return NULL if an exception was
				459	raised by the codec.
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	460
				461	These are the UTF-32 codec APIs:
				462
				463	.. % --- UTF-32 Codecs ------------------------------------------------------ */
				464
				465
				466	.. cfunction:: PyObject* PyUnicode_DecodeUTF32(const char s, Py_ssize_t size, const char errors, int *byteorder)
				467
				468	Decode length bytes from a UTF-32 encoded buffer string and return the
				469	corresponding Unicode object. errors (if non-NULL) defines the error
				470	handling. It defaults to "strict".
				471
				472	If byteorder is non-NULL, the decoder starts decoding using the given byte
				473	order::
				474
				475	*byteorder == -1: little endian
				476	*byteorder == 0: native order
				477	*byteorder == 1: big endian
				478
				479	and then switches if the first four bytes of the input data are a byte order mark
				480	(BOM) and the specified byte order is native order. This BOM is not copied into
				481	the resulting Unicode string. After completion, \byteorder* is set to the
				482	current byte order at the end of input data.
				483
				484	In a narrow build codepoints outside the BMP will be decoded as surrogate pairs.
				485
				486	If byteorder is NULL, the codec starts in native order mode.
				487
				488	Return NULL if an exception was raised by the codec.
				489
				490
				491	.. cfunction:: PyObject* PyUnicode_DecodeUTF32Stateful(const char s, Py_ssize_t size, const char errors, int byteorder, Py_ssize_t consumed)
				492
				493	If consumed is NULL, behave like :cfunc:`PyUnicode_DecodeUTF32`. If
				494	consumed is not NULL, :cfunc:`PyUnicode_DecodeUTF32Stateful` will not treat
				495	trailing incomplete UTF-32 byte sequences (such as a number of bytes not divisible
				496	by four) as an error. Those bytes will not be decoded and the number of bytes
				497	that have been decoded will be stored in consumed.
				498
				499
				500	.. cfunction:: PyObject* PyUnicode_EncodeUTF32(const Py_UNICODE s, Py_ssize_t size, const char errors, int byteorder)
				501
				502	Return a Python bytes object holding the UTF-32 encoded value of the Unicode
				503	data in s. If byteorder is not ``0``, output is written according to the
				504	following byte order::
				505
				506	byteorder == -1: little endian
				507	byteorder == 0: native byte order (writes a BOM mark)
				508	byteorder == 1: big endian
				509
				510	If byteorder is ``0``, the output string will always start with the Unicode BOM
				511	mark (U+FEFF). In the other two modes, no BOM mark is prepended.
				512
				513	If Py_UNICODE_WIDE is not defined, surrogate pairs will be output
				514	as a single codepoint.
				515
				516	Return NULL if an exception was raised by the codec.
				517
				518
				519	.. cfunction:: PyObject* PyUnicode_AsUTF32String(PyObject *unicode)
				520
Benjamin Peterson	b6eba4f	2009-01-13 23:14:04 +0000	[diff] [blame]	521	Return a Python byte string using the UTF-32 encoding in native byte
				522	order. The string always starts with a BOM mark. Error handling is "strict".
				523	Return NULL if an exception was raised by the codec.
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	524
				525
				526	These are the UTF-16 codec APIs:
				527
				528	.. % --- UTF-16 Codecs ------------------------------------------------------ */
				529
				530
				531	.. cfunction:: PyObject* PyUnicode_DecodeUTF16(const char s, Py_ssize_t size, const char errors, int *byteorder)
				532
				533	Decode length bytes from a UTF-16 encoded buffer string and return the
				534	corresponding Unicode object. errors (if non-NULL) defines the error
				535	handling. It defaults to "strict".
				536
				537	If byteorder is non-NULL, the decoder starts decoding using the given byte
				538	order::
				539
				540	*byteorder == -1: little endian
				541	*byteorder == 0: native order
				542	*byteorder == 1: big endian
				543
				544	and then switches if the first two bytes of the input data are a byte order mark
				545	(BOM) and the specified byte order is native order. This BOM is not copied into
				546	the resulting Unicode string. After completion, \byteorder* is set to the
				547	current byte order at the end of input data.
				548
				549	If byteorder is NULL, the codec starts in native order mode.
				550
				551	Return NULL if an exception was raised by the codec.
				552
				553
				554	.. cfunction:: PyObject* PyUnicode_DecodeUTF16Stateful(const char s, Py_ssize_t size, const char errors, int byteorder, Py_ssize_t consumed)
				555
				556	If consumed is NULL, behave like :cfunc:`PyUnicode_DecodeUTF16`. If
				557	consumed is not NULL, :cfunc:`PyUnicode_DecodeUTF16Stateful` will not treat
				558	trailing incomplete UTF-16 byte sequences (such as an odd number of bytes or a
				559	split surrogate pair) as an error. Those bytes will not be decoded and the
				560	number of bytes that have been decoded will be stored in consumed.
				561
				562
				563	.. cfunction:: PyObject* PyUnicode_EncodeUTF16(const Py_UNICODE s, Py_ssize_t size, const char errors, int byteorder)
				564
Benjamin Peterson	b6eba4f	2009-01-13 23:14:04 +0000	[diff] [blame]	565	Return a Python bytes object holding the UTF-16 encoded value of the Unicode
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	566	data in s. If byteorder is not ``0``, output is written according to the
				567	following byte order::
				568
				569	byteorder == -1: little endian
				570	byteorder == 0: native byte order (writes a BOM mark)
				571	byteorder == 1: big endian
				572
				573	If byteorder is ``0``, the output string will always start with the Unicode BOM
				574	mark (U+FEFF). In the other two modes, no BOM mark is prepended.
				575
				576	If Py_UNICODE_WIDE is defined, a single :ctype:`Py_UNICODE` value may get
				577	represented as a surrogate pair. If it is not defined, each :ctype:`Py_UNICODE`
				578	values is interpreted as an UCS-2 character.
				579
				580	Return NULL if an exception was raised by the codec.
				581
				582
				583	.. cfunction:: PyObject* PyUnicode_AsUTF16String(PyObject *unicode)
				584
Benjamin Peterson	b6eba4f	2009-01-13 23:14:04 +0000	[diff] [blame]	585	Return a Python byte string using the UTF-16 encoding in native byte
				586	order. The string always starts with a BOM mark. Error handling is "strict".
				587	Return NULL if an exception was raised by the codec.
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	588
				589	These are the "Unicode Escape" codec APIs:
				590
				591	.. % --- Unicode-Escape Codecs ----------------------------------------------
				592
				593
				594	.. cfunction:: PyObject* PyUnicode_DecodeUnicodeEscape(const char s, Py_ssize_t size, const char errors)
				595
				596	Create a Unicode object by decoding size bytes of the Unicode-Escape encoded
				597	string s. Return NULL if an exception was raised by the codec.
				598
				599
				600	.. cfunction:: PyObject* PyUnicode_EncodeUnicodeEscape(const Py_UNICODE *s, Py_ssize_t size)
				601
				602	Encode the :ctype:`Py_UNICODE` buffer of the given size using Unicode-Escape and
				603	return a Python string object. Return NULL if an exception was raised by the
				604	codec.
				605
				606
				607	.. cfunction:: PyObject* PyUnicode_AsUnicodeEscapeString(PyObject *unicode)
				608
				609	Encode a Unicode object using Unicode-Escape and return the result as Python
				610	string object. Error handling is "strict". Return NULL if an exception was
				611	raised by the codec.
				612
				613	These are the "Raw Unicode Escape" codec APIs:
				614
				615	.. % --- Raw-Unicode-Escape Codecs ------------------------------------------
				616
				617
				618	.. cfunction:: PyObject* PyUnicode_DecodeRawUnicodeEscape(const char s, Py_ssize_t size, const char errors)
				619
				620	Create a Unicode object by decoding size bytes of the Raw-Unicode-Escape
				621	encoded string s. Return NULL if an exception was raised by the codec.
				622
				623
				624	.. cfunction:: PyObject* PyUnicode_EncodeRawUnicodeEscape(const Py_UNICODE s, Py_ssize_t size, const char errors)
				625
				626	Encode the :ctype:`Py_UNICODE` buffer of the given size using Raw-Unicode-Escape
				627	and return a Python string object. Return NULL if an exception was raised by
				628	the codec.
				629
				630
				631	.. cfunction:: PyObject* PyUnicode_AsRawUnicodeEscapeString(PyObject *unicode)
				632
				633	Encode a Unicode object using Raw-Unicode-Escape and return the result as
				634	Python string object. Error handling is "strict". Return NULL if an exception
				635	was raised by the codec.
				636
				637	These are the Latin-1 codec APIs: Latin-1 corresponds to the first 256 Unicode
				638	ordinals and only these are accepted by the codecs during encoding.
				639
				640	.. % --- Latin-1 Codecs -----------------------------------------------------
				641
				642
				643	.. cfunction:: PyObject* PyUnicode_DecodeLatin1(const char s, Py_ssize_t size, const char errors)
				644
				645	Create a Unicode object by decoding size bytes of the Latin-1 encoded string
				646	s. Return NULL if an exception was raised by the codec.
				647
				648
				649	.. cfunction:: PyObject* PyUnicode_EncodeLatin1(const Py_UNICODE s, Py_ssize_t size, const char errors)
				650
Benjamin Peterson	b6eba4f	2009-01-13 23:14:04 +0000	[diff] [blame]	651	Encode the :ctype:`Py_UNICODE` buffer of the given size using Latin-1 and
				652	return a Python bytes object. Return NULL if an exception was raised by
				653	the codec.
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	654
				655
				656	.. cfunction:: PyObject* PyUnicode_AsLatin1String(PyObject *unicode)
				657
Benjamin Peterson	b6eba4f	2009-01-13 23:14:04 +0000	[diff] [blame]	658	Encode a Unicode object using Latin-1 and return the result as Python bytes
				659	object. Error handling is "strict". Return NULL if an exception was
				660	raised by the codec.
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	661
				662	These are the ASCII codec APIs. Only 7-bit ASCII data is accepted. All other
				663	codes generate errors.
				664
				665	.. % --- ASCII Codecs -------------------------------------------------------
				666
				667
				668	.. cfunction:: PyObject* PyUnicode_DecodeASCII(const char s, Py_ssize_t size, const char errors)
				669
				670	Create a Unicode object by decoding size bytes of the ASCII encoded string
				671	s. Return NULL if an exception was raised by the codec.
				672
				673
				674	.. cfunction:: PyObject* PyUnicode_EncodeASCII(const Py_UNICODE s, Py_ssize_t size, const char errors)
				675
Benjamin Peterson	b6eba4f	2009-01-13 23:14:04 +0000	[diff] [blame]	676	Encode the :ctype:`Py_UNICODE` buffer of the given size using ASCII and
				677	return a Python bytes object. Return NULL if an exception was raised by
				678	the codec.
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	679
				680
				681	.. cfunction:: PyObject* PyUnicode_AsASCIIString(PyObject *unicode)
				682
Benjamin Peterson	b6eba4f	2009-01-13 23:14:04 +0000	[diff] [blame]	683	Encode a Unicode object using ASCII and return the result as Python bytes
				684	object. Error handling is "strict". Return NULL if an exception was
				685	raised by the codec.
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	686
				687	These are the mapping codec APIs:
				688
				689	.. % --- Character Map Codecs -----------------------------------------------
				690
				691	This codec is special in that it can be used to implement many different codecs
				692	(and this is in fact what was done to obtain most of the standard codecs
				693	included in the :mod:`encodings` package). The codec uses mapping to encode and
				694	decode characters.
				695
				696	Decoding mappings must map single string characters to single Unicode
				697	characters, integers (which are then interpreted as Unicode ordinals) or None
				698	(meaning "undefined mapping" and causing an error).
				699
				700	Encoding mappings must map single Unicode characters to single string
				701	characters, integers (which are then interpreted as Latin-1 ordinals) or None
				702	(meaning "undefined mapping" and causing an error).
				703
				704	The mapping objects provided must only support the __getitem__ mapping
				705	interface.
				706
				707	If a character lookup fails with a LookupError, the character is copied as-is
				708	meaning that its ordinal value will be interpreted as Unicode or Latin-1 ordinal
				709	resp. Because of this, mappings only need to contain those mappings which map
				710	characters to different code points.
				711
				712
				713	.. cfunction:: PyObject* PyUnicode_DecodeCharmap(const char s, Py_ssize_t size, PyObject mapping, const char *errors)
				714
				715	Create a Unicode object by decoding size bytes of the encoded string s using
				716	the given mapping object. Return NULL if an exception was raised by the
				717	codec. If mapping is NULL latin-1 decoding will be done. Else it can be a
				718	dictionary mapping byte or a unicode string, which is treated as a lookup table.
				719	Byte values greater that the length of the string and U+FFFE "characters" are
				720	treated as "undefined mapping".
				721
				722
				723	.. cfunction:: PyObject* PyUnicode_EncodeCharmap(const Py_UNICODE s, Py_ssize_t size, PyObject mapping, const char *errors)
				724
				725	Encode the :ctype:`Py_UNICODE` buffer of the given size using the given
				726	mapping object and return a Python string object. Return NULL if an
				727	exception was raised by the codec.
				728
				729
				730	.. cfunction:: PyObject* PyUnicode_AsCharmapString(PyObject unicode, PyObject mapping)
				731
				732	Encode a Unicode object using the given mapping object and return the result
				733	as Python string object. Error handling is "strict". Return NULL if an
				734	exception was raised by the codec.
				735
				736	The following codec API is special in that maps Unicode to Unicode.
				737
				738
				739	.. cfunction:: PyObject* PyUnicode_TranslateCharmap(const Py_UNICODE s, Py_ssize_t size, PyObject table, const char *errors)
				740
				741	Translate a :ctype:`Py_UNICODE` buffer of the given length by applying a
				742	character mapping table to it and return the resulting Unicode object. Return
				743	NULL when an exception was raised by the codec.
				744
				745	The mapping table must map Unicode ordinal integers to Unicode ordinal
				746	integers or None (causing deletion of the character).
				747
				748	Mapping tables need only provide the :meth:`__getitem__` interface; dictionaries
				749	and sequences work well. Unmapped character ordinals (ones which cause a
				750	:exc:`LookupError`) are left untouched and are copied as-is.
				751
Jeroen Ruigrok van der Werven	47a7d70	2009-04-27 05:43:17 +0000	[diff] [blame]	752
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	753	These are the MBCS codec APIs. They are currently only available on Windows and
				754	use the Win32 MBCS converters to implement the conversions. Note that MBCS (or
				755	DBCS) is a class of encodings, not just one. The target encoding is defined by
				756	the user settings on the machine running the codec.
				757
				758	.. % --- MBCS codecs for Windows --------------------------------------------
				759
				760
				761	.. cfunction:: PyObject* PyUnicode_DecodeMBCS(const char s, Py_ssize_t size, const char errors)
				762
				763	Create a Unicode object by decoding size bytes of the MBCS encoded string s.
				764	Return NULL if an exception was raised by the codec.
				765
				766
				767	.. cfunction:: PyObject* PyUnicode_DecodeMBCSStateful(const char s, int size, const char errors, int *consumed)
				768
				769	If consumed is NULL, behave like :cfunc:`PyUnicode_DecodeMBCS`. If
				770	consumed is not NULL, :cfunc:`PyUnicode_DecodeMBCSStateful` will not decode
				771	trailing lead byte and the number of bytes that have been decoded will be stored
				772	in consumed.
				773
				774
				775	.. cfunction:: PyObject* PyUnicode_EncodeMBCS(const Py_UNICODE s, Py_ssize_t size, const char errors)
				776
Benjamin Peterson	b6eba4f	2009-01-13 23:14:04 +0000	[diff] [blame]	777	Encode the :ctype:`Py_UNICODE` buffer of the given size using MBCS and return
				778	a Python bytes object. Return NULL if an exception was raised by the
				779	codec.
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	780
				781
				782	.. cfunction:: PyObject* PyUnicode_AsMBCSString(PyObject *unicode)
				783
Benjamin Peterson	b6eba4f	2009-01-13 23:14:04 +0000	[diff] [blame]	784	Encode a Unicode object using MBCS and return the result as Python bytes
				785	object. Error handling is "strict". Return NULL if an exception was
				786	raised by the codec.
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	787
Martin v. Löwis	c15bdef	2009-05-29 14:47:46 +0000	[diff] [blame]	788	For decoding file names and other environment strings, :cdata:`Py_FileSystemEncoding`
				789	should be used as the encoding, and ``"surrogateescape"`` should be used as the error
				790	handler. For encoding file names during argument parsing, the ``O&`` converter should
				791	be used, passsing PyUnicode_FSConverter as the conversion function:
				792
				793	.. cfunction:: int PyUnicode_FSConverter(PyObject* obj, void* result)
				794
				795	Convert obj into result, using the file system encoding, and the ``surrogateescape``
				796	error handler. result must be a ``PyObject*``, yielding a bytes or bytearray object
				797	which must be released if it is no longer used.
				798
				799	.. versionadded:: 3.1
				800
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	801	.. % --- Methods & Slots ----------------------------------------------------
				802
				803
				804	.. _unicodemethodsandslots:
				805
				806	Methods and Slot Functions
				807	^^^^^^^^^^^^^^^^^^^^^^^^^^
				808
				809	The following APIs are capable of handling Unicode objects and strings on input
				810	(we refer to them as strings in the descriptions) and return Unicode objects or
				811	integers as appropriate.
				812
				813	They all return NULL or ``-1`` if an exception occurs.
				814
				815
				816	.. cfunction:: PyObject* PyUnicode_Concat(PyObject left, PyObject right)
				817
				818	Concat two strings giving a new Unicode string.
				819
				820
				821	.. cfunction:: PyObject* PyUnicode_Split(PyObject s, PyObject sep, Py_ssize_t maxsplit)
				822
				823	Split a string giving a list of Unicode strings. If sep is NULL, splitting
				824	will be done at all whitespace substrings. Otherwise, splits occur at the given
				825	separator. At most maxsplit splits will be done. If negative, no limit is
				826	set. Separators are not included in the resulting list.
				827
				828
				829	.. cfunction:: PyObject* PyUnicode_Splitlines(PyObject *s, int keepend)
				830
				831	Split a Unicode string at line breaks, returning a list of Unicode strings.
				832	CRLF is considered to be one line break. If keepend is 0, the Line break
				833	characters are not included in the resulting strings.
				834
				835
				836	.. cfunction:: PyObject* PyUnicode_Translate(PyObject str, PyObject table, const char *errors)
				837
				838	Translate a string by applying a character mapping table to it and return the
				839	resulting Unicode object.
				840
				841	The mapping table must map Unicode ordinal integers to Unicode ordinal integers
				842	or None (causing deletion of the character).
				843
				844	Mapping tables need only provide the :meth:`__getitem__` interface; dictionaries
				845	and sequences work well. Unmapped character ordinals (ones which cause a
				846	:exc:`LookupError`) are left untouched and are copied as-is.
				847
				848	errors has the usual meaning for codecs. It may be NULL which indicates to
				849	use the default error handling.
				850
				851
				852	.. cfunction:: PyObject* PyUnicode_Join(PyObject separator, PyObject seq)
				853
				854	Join a sequence of strings using the given separator and return the resulting
				855	Unicode string.
				856
				857
				858	.. cfunction:: int PyUnicode_Tailmatch(PyObject str, PyObject substr, Py_ssize_t start, Py_ssize_t end, int direction)
				859
				860	Return 1 if substr matches str[start:end] at the given tail end
				861	(direction == -1 means to do a prefix match, direction == 1 a suffix match),
				862	0 otherwise. Return ``-1`` if an error occurred.
				863
				864
				865	.. cfunction:: Py_ssize_t PyUnicode_Find(PyObject str, PyObject substr, Py_ssize_t start, Py_ssize_t end, int direction)
				866
				867	Return the first position of substr in str[start:end] using the given
				868	direction (direction == 1 means to do a forward search, direction == -1 a
				869	backward search). The return value is the index of the first match; a value of
				870	``-1`` indicates that no match was found, and ``-2`` indicates that an error
				871	occurred and an exception has been set.
				872
				873
				874	.. cfunction:: Py_ssize_t PyUnicode_Count(PyObject str, PyObject substr, Py_ssize_t start, Py_ssize_t end)
				875
				876	Return the number of non-overlapping occurrences of substr in
				877	``str[start:end]``. Return ``-1`` if an error occurred.
				878
				879
				880	.. cfunction:: PyObject* PyUnicode_Replace(PyObject str, PyObject substr, PyObject *replstr, Py_ssize_t maxcount)
				881
				882	Replace at most maxcount occurrences of substr in str with replstr and
				883	return the resulting Unicode object. maxcount == -1 means replace all
				884	occurrences.
				885
				886
				887	.. cfunction:: int PyUnicode_Compare(PyObject left, PyObject right)
				888
				889	Compare two strings and return -1, 0, 1 for less than, equal, and greater than,
				890	respectively.
				891
				892
Benjamin Peterson	c22ed14	2008-07-01 19:12:34 +0000	[diff] [blame]	893	.. cfunction:: int PyUnicode_CompareWithASCIIString(PyObject uni, char string)
				894
				895	Compare a unicode object, uni, with string and return -1, 0, 1 for less
				896	than, equal, and greater than, respectively.
				897
				898
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	899	.. cfunction:: int PyUnicode_RichCompare(PyObject left, PyObject right, int op)
				900
				901	Rich compare two unicode strings and return one of the following:
				902
				903	* ``NULL`` in case an exception was raised
				904	* :const:`Py_True` or :const:`Py_False` for successful comparisons
				905	* :const:`Py_NotImplemented` in case the type combination is unknown
				906
				907	Note that :const:`Py_EQ` and :const:`Py_NE` comparisons can cause a
				908	:exc:`UnicodeWarning` in case the conversion of the arguments to Unicode fails
				909	with a :exc:`UnicodeDecodeError`.
				910
				911	Possible values for op are :const:`Py_GT`, :const:`Py_GE`, :const:`Py_EQ`,
				912	:const:`Py_NE`, :const:`Py_LT`, and :const:`Py_LE`.
				913
				914
				915	.. cfunction:: PyObject* PyUnicode_Format(PyObject format, PyObject args)
				916
				917	Return a new string object from format and args; this is analogous to
				918	``format % args``. The args argument must be a tuple.
				919
				920
				921	.. cfunction:: int PyUnicode_Contains(PyObject container, PyObject element)
				922
				923	Check whether element is contained in container and return true or false
				924	accordingly.
				925
				926	element has to coerce to a one element Unicode string. ``-1`` is returned if
				927	there was an error.
				928
				929
				930	.. cfunction:: void PyUnicode_InternInPlace(PyObject **string)
				931
				932	Intern the argument \string* in place. The argument must be the address of a
				933	pointer variable pointing to a Python unicode string object. If there is an
				934	existing interned string that is the same as \string, it sets \string to
				935	it (decrementing the reference count of the old string object and incrementing
				936	the reference count of the interned string object), otherwise it leaves
				937	\string* alone and interns it (incrementing its reference count).
				938	(Clarification: even though there is a lot of talk about reference counts, think
				939	of this function as reference-count-neutral; you own the object after the call
				940	if and only if you owned it before the call.)
				941
				942
				943	.. cfunction:: PyObject* PyUnicode_InternFromString(const char *v)
				944
				945	A combination of :cfunc:`PyUnicode_FromString` and
				946	:cfunc:`PyUnicode_InternInPlace`, returning either a new unicode string object
				947	that has been interned, or a new ("owned") reference to an earlier interned
				948	string object with the same value.
				949